Comment #0 by qs.il.paperinik — 2022-09-26T12:09:10Z
On https://dlang.org/spec/entity.html it says:
> The full list of named character entities from the HTML 5 Spec is supported except for the named entities which contain multiple code points.
This exception seems like an arbitrary limitation. It might make the implementation easier, but could surprise users.
Even if this enhancement is rejected, the compiler should at least recognize HTML5 multi-code-point entities and emit a specific error, something like: “character entity ∾̳ specifies multiple code points (U+223E, U+0333) which is not supported. Use \u223E\u0333 instead.”
Comment #1 by dkorpel — 2022-09-26T12:22:26Z
I don't like enhancements based on hypothetical users. Are there actual users running into this limitation?
My impression is that it's a rarely used feature, but it's not very intrusive either so there's no pressure to remove it either. I could be wrong though.
> Even if this enhancement is rejected, the compiler should at least recognize
> HTML5 multi-code-point entities and emit a specific error
If we maintain code to recognize such entities, we might as well support them no?
Comment #2 by qs.il.paperinik — 2022-09-26T12:50:08Z
(In reply to Dennis from comment #1)
> I don't like enhancements based on hypothetical users. Are there actual
> users running into this limitation?
>
> My impression is that it's a rarely used feature, but it's not very
> intrusive either so there's no pressure to remove it either. I could be
> wrong though.
I have no idea how often it’s used. I just read over the spec and it said multi-code-point entities aren’t supported. I wondered what it said, when I tried using one and it said the same as if using some non-HTML5 one.
Maybe the reason for not supporting them is they’d be the only escape sequence to introduce multiple code-points, thus counting the number of code-points / code-units becomes non-trivial: One would have to know the entities. I have no idea why exactly this is an issue, tho. If I need the size of a string literal, I won’t count the characters manually but assign it to an `enum` and ask its `length`.
> > Even if this enhancement is rejected, the compiler should at least recognize
> > HTML5 multi-code-point entities and emit a specific error
>
> If we maintain code to recognize such entities, we might as well support
> them no?
Exactly. The error message effectively says that the entity does not exist, but that is incorrect. It’s akin to “‹name› does not exist in this scope” versus “‹name› cannot be used in this context”.
(It reminds me of the binary literals debate; when recognition and error is harder than recognition and support, the way to go is clear.)
Comment #3 by dkorpel — 2022-09-26T13:40:39Z
(In reply to Bolpat from comment #2)
> Maybe the reason for not supporting them is they’d be the only escape
> sequence to introduce multiple code-points, thus counting the number of
> code-points / code-units becomes non-trivial:
Looking at the source code, it seems like the only reason is because no one bothered to implement it. `Lexer.escapeSequence` returns a single `dchar`, so it would require a bit of refactoring.
Comment #4 by dkorpel — 2022-09-26T13:53:22Z
(In reply to Dennis from comment #3)
> Looking at the source code, it seems like the only reason is because no one
> bothered to implement it. `Lexer.escapeSequence` returns a single `dchar`,
> so it would require a bit of refactoring.
Oh, there's also the fact that escape sequences are also used in character literals, so you need to account for this:
```
dchar x = "&acE"; // Requires two dchars, 0x0223E;0x00333
```
Comment #5 by dlang-bot — 2022-09-26T15:41:30Z
@dkorpel updated dlang/dmd pull request #14489 "Fix 23376 - Allow multi-code-point HTML entities" fixing this issue:
- Fix 23376 - Allow multi-code-point HTML entities
https://github.com/dlang/dmd/pull/14489
Comment #6 by dlang-bot — 2022-09-27T09:46:17Z
@dkorpel created dlang/dlang.org pull request #3419 "Issue 23376 - Allow multi-code-point HTML entities" mentioning this issue:
- Issue 23376 - Allow multi-code-point HTML entities
https://github.com/dlang/dlang.org/pull/3419