Bug 18105 – std.conv.parse!wchar cannot take a string

Status
RESOLVED
Resolution
WONTFIX
Severity
normal
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2017-12-19T18:39:36Z
Last change time
2017-12-20T13:56:54Z
Assigned to
dechcaudron+dlang.issue.tracking
Creator
dechcaudron+dlang.issue.tracking

Comments

Comment #0 by dechcaudron+dlang.issue.tracking — 2017-12-19T18:39:36Z
Since any char fits into a wchar, and given that parse!dchar(string) works, it seems logical that it also works for wchar. Doing so should also allow to!wchar(string) to work as to!dchar(string) does. Merely reporting an issue that has personally affected me and I've already fixed in my Phobos fork. Will create PR.
Comment #1 by schveiguy — 2017-12-19T19:01:59Z
What if the string's first dchar is a surrogate pair? It doesn't fit into a wchar then.
Comment #2 by dechcaudron+dlang.issue.tracking — 2017-12-19T19:22:41Z
(In reply to Steven Schveighoffer from comment #1) > What if the string's first dchar is a surrogate pair? It doesn't fit into a > wchar then. That is correct, and it will pose a problem when using 'to' (I'm already seeing it. It throws UTFException. But 'parse' will indeed get the first element of the pair in such case.
Comment #3 by schveiguy — 2017-12-19T20:51:25Z
This means parse cannot advance the range, since it can't go past the current code point. It doesn't really have a way to parse this properly. There are no other types it does this for. I think it would be a mistake to add such a feature. Have you looked at std.utf to see if it can do what you are looking for?
Comment #4 by dechcaudron+dlang.issue.tracking — 2017-12-20T07:43:21Z
(In reply to Steven Schveighoffer from comment #3) > This means parse cannot advance the range, since it can't go past the > current code point. It doesn't really have a way to parse this properly. I'm sorry, I don't get it. I see no disadvantages when using wchar as the destination type compared to when using char. Would you mind ellaborating?
Comment #5 by dechcaudron+dlang.issue.tracking — 2017-12-20T10:57:14Z
I am now aware of the complexity of UTF-16 encoding, with surrogate pairs and all (just read the standard). While I agree the solution is not trivial, if we do allow parse!char(string), we should allow parse!wchar(string) in the same fashion. When using 'char', there is no guarantee that the returned value will be a valid UTF-8 code point. It just gets the next 'char' code unit from the string, and so should 'wchar' IMHO. That is, taking the next 2 'char' from the string.
Comment #6 by schveiguy — 2017-12-20T13:56:54Z
(In reply to dechcaudron+dlang.issue.tracking from comment #5) > While I agree the solution is not trivial, > if we do allow parse!char(string), we should allow parse!wchar(string) in > the same fashion. The reason you can parse a string into chars is because you can actually do it. I can consume one char off a string and return it no problem. You can't do the same with wchar. There's no way to advance the string "partially" into a code point. > When using 'char', there is no guarantee that the returned > value will be a valid UTF-8 code point. No, but a char is not necessarily a UTF code point, it's a UTF-8 code unit. There is no direct translation of N chars to 1 wchar. So there's no way to advance the range properly. > It just gets the next 'char' code > unit from the string, and so should 'wchar' IMHO. That is, taking the next 2 > 'char' from the string. It is NOT the same thing to take 2 chars and stuff them into a wchar. This is not only incorrect, it's pretty much useless. I'm not sure of your use case, but I think you want one of 2 things: 1. std.utf.byUTF!wchar: foreach(c; "hello".byUTF!wchar) { static assert(is(typeof(c) == wchar)); writeln(c); // writes 'h', 'e', 'l', 'l', 'o' on separate lines. } This will properly encode surrogate pairs. ------- 2. cast(ushort[]) myString; This will look at the string in 16-bit chunks, but these aren't valid characters.