Bug 23186 – wchar/dchar do not have their endianess defined
Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
dlang.org
Product
D
Version
D2
Platform
All
OS
All
Creation time
2022-06-13T23:52:46Z
Last change time
2022-09-02T17:10:26Z
Keywords
pull
Assigned to
No Owner
Creator
Richard Cattermole
Comments
Comment #0 by alphaglosined — 2022-06-13T23:52:46Z
For UTF-16 and UTF-32 there is little and big endian versions.
Even if it is target defined, it would be good to have this declared as such.
Comment #1 by dkorpel — 2022-06-15T10:12:41Z
This is relevant when e.g. converting a `ubyte[]` to a `wchar[]` or `dchar[]`, but I don't think the language ever does that itself. A `wchar` and `dchar` are defined as "unsigned 16/32 bit" basic types, just like `ushort` or `uint`, and endianness in general is already specified to be target defined here:
https://dlang.org/spec/abi.html#endianness
Would it suffice to add char types to the table below it?
https://dlang.org/spec/abi.html#basic_types
Comment #2 by alphaglosined — 2022-06-15T16:33:27Z
No, this isn't an ABI thing, it's about encodings.
Ideally, wchar/dchar would have little and big endian versions so that we can represent both forms of the encoding in the type system.
It gotta be in: https://dlang.org/spec/type.html#basic-data-types
However, it can be kept pretty simple something like ``Unicode 8-bit code point with matching target endian``.
Comment #3 by dkorpel — 2022-06-15T21:34:38Z
(In reply to Richard Cattermole from comment #2)
> No, this isn't an ABI thing, it's about encodings.
I don't follow, do you have a reference for me? I'm looking at:
https://en.wikipedia.org/wiki/UTF-16
"Each Unicode code point is encoded either as one or two 16-bit code units. How these 16-bit codes are stored as bytes then depends on the 'endianness' of the text file or communication protocol."
The `wchar` type is an integer, the 16-bit code. No integral operations on a `wchar` reveal the endianness, only once you reinterpret cast 'the text file' (a `ubyte[]`) will endianness come up, but at that point I think it's no different than casting a `ubyte[]` to a `ushort[]`. We don't have BE and LE `short` types either.
> However, it can be kept pretty simple something like `Unicode 8-bit code
> point with matching target endian`.
There's no endian difference for 8-bit code points, or are we talking about bit order instead of byte order?
Comment #4 by alphaglosined — 2022-06-15T21:44:37Z
(In reply to Dennis from comment #3)
> (In reply to Richard Cattermole from comment #2)
> > No, this isn't an ABI thing, it's about encodings.
>
> I don't follow, do you have a reference for me? I'm looking at:
>
> https://en.wikipedia.org/wiki/UTF-16
>
> "Each Unicode code point is encoded either as one or two 16-bit code units.
> How these 16-bit codes are stored as bytes then depends on the 'endianness'
> of the text file or communication protocol."
>
> The `wchar` type is an integer, the 16-bit code. No integral operations on a
> `wchar` reveal the endianness, only once you reinterpret cast 'the text
> file' (a `ubyte[]`) will endianness come up, but at that point I think it's
> no different than casting a `ubyte[]` to a `ushort[]`. We don't have BE and
> LE `short` types either.
Indeed. Integers you kinda expect that it is the same as cpu endian. But you cannot assume the same for UTF (hence we should document it).
> > However, it can be kept pretty simple something like `Unicode 8-bit code
> > point with matching target endian`.
>
> There's no endian difference for 8-bit code points, or are we talking about
> bit order instead of byte order?
That should have been UTF-16 or UTF-32, but its the same.
Comment #5 by dlang-bot — 2022-06-16T09:33:54Z
@dkorpel created dlang/dlang.org pull request #3319 "Fix 23186 - wchar/dchar do not have their endianess defined" fixing this issue:
- Fix 23186 - wchar/dchar do not have their endianess defined
https://github.com/dlang/dlang.org/pull/3319
Comment #6 by dlang-bot — 2022-09-02T17:10:26Z
dlang/dlang.org pull request #3319 "Fix 23186 - wchar/dchar do not have their endianess defined" was merged into master:
- d3e822cf7d4acfd38fcf3dc3a632c3644741c6d3 by Dennis Korpel:
Fix 23186 - wchar/dchar do not have their endianess defined
https://github.com/dlang/dlang.org/pull/3319