Bug 23186 – wchar/dchar do not have their endianess defined

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
dlang.org
Product
D
Version
D2
Platform
All
OS
All
Creation time
2022-06-13T23:52:46Z
Last change time
2022-09-02T17:10:26Z
Keywords
pull
Assigned to
No Owner
Creator
Richard Cattermole

Comments

Comment #0 by alphaglosined — 2022-06-13T23:52:46Z
For UTF-16 and UTF-32 there is little and big endian versions. Even if it is target defined, it would be good to have this declared as such.
Comment #1 by dkorpel — 2022-06-15T10:12:41Z
This is relevant when e.g. converting a `ubyte[]` to a `wchar[]` or `dchar[]`, but I don't think the language ever does that itself. A `wchar` and `dchar` are defined as "unsigned 16/32 bit" basic types, just like `ushort` or `uint`, and endianness in general is already specified to be target defined here: https://dlang.org/spec/abi.html#endianness Would it suffice to add char types to the table below it? https://dlang.org/spec/abi.html#basic_types
Comment #2 by alphaglosined — 2022-06-15T16:33:27Z
No, this isn't an ABI thing, it's about encodings. Ideally, wchar/dchar would have little and big endian versions so that we can represent both forms of the encoding in the type system. It gotta be in: https://dlang.org/spec/type.html#basic-data-types However, it can be kept pretty simple something like ``Unicode 8-bit code point with matching target endian``.
Comment #3 by dkorpel — 2022-06-15T21:34:38Z
(In reply to Richard Cattermole from comment #2) > No, this isn't an ABI thing, it's about encodings. I don't follow, do you have a reference for me? I'm looking at: https://en.wikipedia.org/wiki/UTF-16 "Each Unicode code point is encoded either as one or two 16-bit code units. How these 16-bit codes are stored as bytes then depends on the 'endianness' of the text file or communication protocol." The `wchar` type is an integer, the 16-bit code. No integral operations on a `wchar` reveal the endianness, only once you reinterpret cast 'the text file' (a `ubyte[]`) will endianness come up, but at that point I think it's no different than casting a `ubyte[]` to a `ushort[]`. We don't have BE and LE `short` types either. > However, it can be kept pretty simple something like `Unicode 8-bit code > point with matching target endian`. There's no endian difference for 8-bit code points, or are we talking about bit order instead of byte order?
Comment #4 by alphaglosined — 2022-06-15T21:44:37Z
(In reply to Dennis from comment #3) > (In reply to Richard Cattermole from comment #2) > > No, this isn't an ABI thing, it's about encodings. > > I don't follow, do you have a reference for me? I'm looking at: > > https://en.wikipedia.org/wiki/UTF-16 > > "Each Unicode code point is encoded either as one or two 16-bit code units. > How these 16-bit codes are stored as bytes then depends on the 'endianness' > of the text file or communication protocol." > > The `wchar` type is an integer, the 16-bit code. No integral operations on a > `wchar` reveal the endianness, only once you reinterpret cast 'the text > file' (a `ubyte[]`) will endianness come up, but at that point I think it's > no different than casting a `ubyte[]` to a `ushort[]`. We don't have BE and > LE `short` types either. Indeed. Integers you kinda expect that it is the same as cpu endian. But you cannot assume the same for UTF (hence we should document it). > > However, it can be kept pretty simple something like `Unicode 8-bit code > > point with matching target endian`. > > There's no endian difference for 8-bit code points, or are we talking about > bit order instead of byte order? That should have been UTF-16 or UTF-32, but its the same.
Comment #5 by dlang-bot — 2022-06-16T09:33:54Z
@dkorpel created dlang/dlang.org pull request #3319 "Fix 23186 - wchar/dchar do not have their endianess defined" fixing this issue: - Fix 23186 - wchar/dchar do not have their endianess defined https://github.com/dlang/dlang.org/pull/3319
Comment #6 by dlang-bot — 2022-09-02T17:10:26Z
dlang/dlang.org pull request #3319 "Fix 23186 - wchar/dchar do not have their endianess defined" was merged into master: - d3e822cf7d4acfd38fcf3dc3a632c3644741c6d3 by Dennis Korpel: Fix 23186 - wchar/dchar do not have their endianess defined https://github.com/dlang/dlang.org/pull/3319