← Back to index | Original Bugzilla link

Bug 4483 – foreach over string or wstring, where element type not specified, does not support unicode

Status: REOPENED
Severity: enhancement
Priority: P4
Component: dmd
Product: D
Version: D2
Platform: All
OS: Linux
Creation time: 2010-07-18T07:43:26Z
Last change time: 2024-12-13T17:52:39Z
Assigned to: No Owner
Creator: Jonathan M Davis

Comments

Comment #0 by issues.dlang — 2010-07-18T07:43:26Z

Strings are dealt with fantastically in D overall, but the very nature of UTF-8 and UTF-16 makes it easy to screw things up when trying to deal with individual characters. In foreach loops, D can convert the chars or wchars to dchar and avoid the problem. foreach(dchar c; str) However, if you fail to give an element type foreach(c; str) then the element type will be the element type of the array. Normally this is good, but for strings, wstrings, char[], and wchar[] this is generally _bad_. In most cases, what people will likely want is to loop over each code point rather than each code unit - that is, what they really should be using is foreach(dchar c; str) But if they fail to put the element type, that's not what they'll get. So, I propose that it become a warning whenever someone tries to loop over a char[], wchar[], string, or wstring without giving the element type. That way, the programmer will realize that they should have put dchar, since that's almost certainly what they wanted. If, however, they really did want char or wchar, then they're free to put that as the element type and iterate normally. But because iterating over the chars or wchars is almost certainly _not_ what the programmer wants in almost all cases, it seems to me that it would be worthwhile to give a warning in cases where they fail to put the element type. It doesn't restrict the programmer any, but it will help them avoid bugs with regards to the UTF-8 and UTF-16 encodings.

Comment #1 by awishformore — 2010-07-20T05:30:51Z

I agree. I'm absolutely happy with the way D handles foreach loops on strings and char[]. However, disambiguation of foreach loops for char and wchar is a an issue that should be addressed. I think the best solution would be to mave it the coding standard to specify the char/dchar/wchar. Issuing a warning whenever you iterate over a string or dstring without explicitely declaring char/char/wchar would indeed be the way to do so. The user is still free to ignore the warning, but accidental misuse would be avoided and a consistently clear coding style would be encouraged.

Comment #2 by lio+bugzilla — 2014-01-17T01:30:01Z

I took the liberty to remove the suggested solution from the title, since I think there are a couple of possible fixes here: 1. Issue a warning (original suggestion) 2. Issue an error, always require a value type (breaking change) 3. Infer the value type as "dchar" in all cases (breaking change) 4. Throw an exception at runtime when >char, >wchar unicode is encountered (breaking change) I think this issue is serious enough to warrant a breaking change. I taught a D workshop, in China, and everybody expected foreach to "just work", and rightfully so. foreach(c; "你好") {} This should just work! And it's hard to explain people why it doesn't, without getting into Unicode encoding issues, which no user wants to care about. I'm going to argue for fix 3. and I'd say it's worth taking a breaking change for this issue. The breaking change is compile time only, and limited to foreach over char[] or wchar[], with a non-ref, inferred value type, and where the scope cares about the value type being char or wchar. That last part is important: In all of druntime and phobos there were only 2 places where that was the case. All others, including all tests(!), compiled (and ran) successfully without changes. The two places were fixed by adding the appropriate type, in both cases "char". A nice side effect of this change is that it makes it immediately obvious that the foreach does NOT handle the full Unicode character set. It's self-documenting, in a way. Note that we might still choose a runtime exception. It's hardly useful to get a char with value 0xE8 out of a char[]. But throwing a sudden exception is a breaking change that might be too risky to take on.

Comment #3 by monarchdodra — 2014-01-21T10:07:21Z

(In reply to comment #2) > I took the liberty to remove the suggested solution from the title, since I > think there are a couple of possible fixes here: > > 1. Issue a warning (original suggestion) > 2. Issue an error, always require a value type (breaking change) > 3. Infer the value type as "dchar" in all cases (breaking change) > 4. Throw an exception at runtime when >char, >wchar unicode is encountered > (breaking change) > > I think this issue is serious enough to warrant a breaking change. I taught a D > workshop, in China, and everybody expected foreach to "just work", and > rightfully so. > > foreach(c; "你好") {} > > This should just work! And it's hard to explain people why it doesn't, without > getting into Unicode encoding issues, which no user wants to care about. > > I'm going to argue for fix 3. and I'd say it's worth taking a breaking change > for this issue. > > The breaking change is compile time only, and limited to foreach over char[] or > wchar[], with a non-ref, inferred value type, and where the scope cares about > the value type being char or wchar. > > That last part is important: In all of druntime and phobos there were only 2 > places where that was the case. All others, including all tests(!), compiled > (and ran) successfully without changes. The two places were fixed by adding the > appropriate type, in both cases "char". A nice side effect of this change is > that it makes it immediately obvious that the foreach does NOT handle the full > Unicode character set. It's self-documenting, in a way. The major issue though is that while it still compiles, it becomes a *silent* breaking change, which is the worst kind of change. I can think of a few places where the foreach is written "foreach(c; s)" *because* we specifically want to iterate on the raw codeunits (not necessarily the *char*). Just because the code still compiles doesn't mean it's still correct. These kinds of changes can introduce some really insidious bugs, or performance issues. At this point the only "solution" I really see would be to simply deprecate and ban implicit char inference. Code would have to make an explicit choice. After a two or three years of this, we'd be able to safely assume that the is no more implicit char conversion out there, and then we'd be able to set the default to dchar. Anything else (IMO) is just a disaster waiting to happen. So I'd say I'd opt for "2)", and *once* we've had "2)" for a while, we can opt to add "3)". The question though is is it all worth it... I don't know? > Note that we might still choose a runtime exception. It's hardly useful to get > a char with value 0xE8 out of a char[]. But throwing a sudden exception is a > breaking change that might be too risky to take on. Except if you are explicitly iterating the codeunits, because you know your needle is ASCII. Besides, 1) You'd *kill* performance. 2) You'd make a fair amount of nothrow functions throw.

Comment #4 by lio+bugzilla — 2014-01-21T18:20:57Z

The pull request contains some more arguments against option 2: * different type for "ref" compared to non ref is unprecedented * won't fix iteration for composed characters * what if iteration over code point was intended in the first place? Silent breaking change I'm personally particularly interested to fix the foreach over a unicode literal. I guess we could change the type of such a literal and, for example, pick a type that fits the characters in the literal.

Comment #5 by code — 2014-01-21T18:30:31Z

(In reply to comment #4) > I'm personally particularly interested to fix the foreach over a unicode > literal. I guess we could change the type of such a literal and, for example, > pick a type that fits the characters in the literal. Sounds pretty hacky. Andrei posted a good alternative solution. http://forum.dlang.org/post/[email protected] We should also try to use range forech (.front, .popFront, .empty) because decoding is much faster than with the runtime implementation which uses a delegate. After a proper deprecation cycle this could work like so. // iterate over any unicode string foreach (c; "foobar") {} //dchar foreach (c; "foobar"c) {} //dchar foreach (c; "foobar"w) {} //dchar foreach (c; "foobar"d) {} //dchar // iterate over representation foreach (c; "foobar".rep) {} //ubyte foreach (c; "foobar"c.rep) {} //ubyte foreach (c; "foobar"w.rep) {} //ushort foreach (c; "foobar"d.rep) {} //uint // conversion becomes an error foreach (char c; "foobar"c) {} //error can't convert dchar to char foreach (char c; "foobar"w) {} //error can't convert dchar to char foreach (char c; "foobar"d) {} //error can't convert dchar to char foreach (wchar wc; "foobar"c) {} //error can't convert dchar to wchar foreach (wchar wc; "foobar"w) {} //error can't convert dchar to wchar foreach (wchar wc; "foobar"d) {} //error can't convert dchar to wchar // std.utf.transcode for transcoding (lazy iteration) foreach (c; "foobar".transcode!char()) {} //char foreach (wc; "foobar".transcode!wchar()) {} //wchar // and so on and so forth // further changes "foobar".length // use "foobar".rep.length "foobar"[0] // use "foobar".rep[0]

Comment #6 by yebblies — 2014-01-21T20:13:11Z

I suggest the following steps: 1. Warn when foreach arg type is inferred to be a narrow string 2. Deprecate ... 3. Error ... 4. New behavior, maybe This both prevents accidental narrow-char iteration, and lets us put off deciding on the new behavior for at least a year. Win!

Comment #7 by monarchdodra — 2014-01-21T23:42:52Z

(In reply to comment #5) > (In reply to comment #4) > Sounds pretty hacky. > > Andrei posted a good alternative solution. > http://forum.dlang.org/post/[email protected] > > We should also try to use range forech (.front, .popFront, .empty) because > decoding is much faster than with the runtime implementation which uses a > delegate. Arguably, that's an implementation problem? In particular, front/popFront is *know* to be a slow iterative process, due to the double decode+stride. Whenever I need *real* speed for string decoding, *NOTHING* beats std.utf.decode. And technically, I see no reason the built-in foreach couldn't *hope* to one day achieve the same speeds. > After a proper deprecation cycle this could work like so. > > // iterate over any unicode string > foreach (c; "foobar") {} //dchar > foreach (c; "foobar"c) {} //dchar > foreach (c; "foobar"w) {} //dchar > foreach (c; "foobar"d) {} //dchar > > // iterate over representation > foreach (c; "foobar".rep) {} //ubyte > foreach (c; "foobar"c.rep) {} //ubyte > foreach (c; "foobar"w.rep) {} //ushort > foreach (c; "foobar"d.rep) {} //uint Sounds good to me. > // conversion becomes an error > foreach (char c; "foobar"c) {} //error can't convert dchar to char > foreach (char c; "foobar"w) {} //error can't convert dchar to char > foreach (char c; "foobar"d) {} //error can't convert dchar to char > foreach (wchar wc; "foobar"c) {} //error can't convert dchar to wchar > foreach (wchar wc; "foobar"w) {} //error can't convert dchar to wchar > foreach (wchar wc; "foobar"d) {} //error can't convert dchar to wchar > > // std.utf.transcode for transcoding (lazy iteration) > foreach (c; "foobar".transcode!char()) {} //char > foreach (wc; "foobar".transcode!wchar()) {} //wchar > // and so on and so forth > > // further changes > "foobar".length // use "foobar".rep.length > "foobar"[0] // use "foobar".rep[0] Now that just seems excessive to me. Especially the last one. We can't completely act like a string isn't an array. Plus: auto c = "foobar"[0]; static assert(is(typeof(c) == char); //Fails

Comment #8 by bugzilla — 2014-01-23T21:34:22Z

I'm very much opposed to this. The way it works now has been that way from the beginning, and an unknowably vast amount of code depends on it. Furthermore, I don't like the inherent slowdowns it causes. Only rarely does one need decoded chars, the rest of the time working with bytes is fast and correct.

Comment #9 by kozzi11 — 2014-01-24T02:22:51Z

(In reply to comment #8) > I'm very much opposed to this. The way it works now has been that way from the > beginning, and an unknowably vast amount of code depends on it. > > Furthermore, I don't like the inherent slowdowns it causes. Only rarely does > one need decoded chars, the rest of the time working with bytes is fast and > correct. Yes, I agree with this. string is just an alias for immutable array of chars, there is nothing special about it. (I am not happy with this, but this is current situation) What I really hate is this: char[] a = cast(char[])"asdsad"; writeln(typeid(typeof(a.front))) // prints dchar instead char; This is really unexpected behavior :(. I think new special type lets call it `EncodeString` should be added.

Comment #10 by code — 2014-01-25T20:50:26Z

(In reply to comment #8) > I'm very much opposed to this. The way it works now has been that way from the > beginning, and an unknowably vast amount of code depends on it. > You fail to recognize that it's broken from the begging. The knowably vast amount of people that stumble over this should show you how surprising this is. It's the only place in the language where unicode handling is opt-in and choosing an incorrect default goes against basic D principles. D is designed such that most "obvious" code is fast and safe. On occasion a function might need to escape the confines of type safety for ultimate speed and control. For such rare cases D offers native pointers, type casts... > Furthermore, I don't like the inherent slowdowns it causes. Only rarely does > one need decoded chars, the rest of the time working with bytes is fast and > correct. That's not the best argument. Handling UTF-8 only adds a single comparison if (str[i] < 0x80) return str[i]; else decodeUTF8(str, i); that the branch predictor will always get right. What's true is that we had several codegen issues with unicode decoding, e.g. the comparison wasn't inlined. Currently we decode foreach with a delegate, creeping slow. https://github.com/rejectedsoftware/vibe.d/pull/327 So handling UTF-8 requires care in library design and a good optimizer but I don't see how it is inherently slower.

Comment #11 by bugzilla — 2014-01-26T01:17:28Z

> You fail to recognize that it's broken from the begging. I understand and recognize your arguments, I just do not agree with the conclusion. 1. Silently breaking code with such a change is a seriously bad idea at this point. 2. I do not agree with the hypothesis that the slowdown is trivial. I just got done doing a project for a client, where high speed was essential. I got bit by front() doing a decode. The speed hit was not acceptable, and I had to insert lots of ugly casts to ubyte arrays. The code almost never needed a decode character, but simply doing a copy() did a decode and then an encode. Arghh! Actually needing a dchar is rather rare. The bulk of algorithms with strings work better and faster using octets (like good ole' copy()). I've suspected for a while now that std.algorithm is slower than necessary because the pointless decoding going on, and I know that std.regex suffered serious slowdowns because of it. To me this is much like the recurring argument that D should hide the reality of twos-complement arithmetic with various checks for overflows and carries. D strings are octets, not dchars, and one simply needs to be aware of that when programming with them.

Comment #12 by code — 2014-01-26T11:31:39Z

(In reply to comment #11) > > You fail to recognize that it's broken from the begging. > > I understand and recognize your arguments, I just do not agree with the > conclusion. > > 1. Silently breaking code with such a change is a seriously bad idea at this > point. > Nobody suggested to break code, but to deprecate the old behavior. > 2. I do not agree with the hypothesis that the slowdown is trivial. I just got > done doing a project for a client, where high speed was essential. I got bit by > front() doing a decode. The speed hit was not acceptable, and I had to insert > lots of ugly casts to ubyte arrays. The code almost never needed a decode > character, but simply doing a copy() did a decode and then an encode. Arghh! > It's not a trivial slowdown, I've optimized this too in various places. But for most applications it doesn't matter. The problem with UTF is that you need to handle it correctly from A to B. https://github.com/sociomantic/git-hub/pull/45#issuecomment-26527066 Doing the optimization when needed is much simpler. > Actually needing a dchar is rather rare. The bulk of algorithms with strings > work better and faster using octets (like good ole' copy()). I've suspected for > a while now that std.algorithm is slower than necessary because the pointless > decoding going on, and I know that std.regex suffered serious slowdowns because > of it. Dmitry is working on some good stuff here. https://github.com/D-Programming-Language/phobos/pull/1685 But you cannot possibly suggest to have default incorrect functions? > To me this is much like the recurring argument that D should hide the reality > of twos-complement arithmetic with various checks for overflows and carries. Integer overflow rarely is an issue, unicode handling is. Also your alphabet fits into this http://dlang.org/ascii-table.html, mine doesn't. > D strings are octets, not dchars, and one simply needs to be aware of that when programming with them. That may be true, but the data suggests that we're incapable of this. https://d.puremagic.com/issues/show_bug.cgi?id=6791 https://d.puremagic.com/issues/buglist.cgi?quicksearch=unicode I think the biggest issue is the inconsistency. Some language parts use automatic transcoding but a few others work on the represenation.

Comment #13 by github-bugzilla — 2014-03-16T10:19:23Z

Commits pushed to master at https://github.com/D-Programming-Language/druntime https://github.com/D-Programming-Language/druntime/commit/b4706e87e7c980c0bee6f1e8d2df3411baa464dc Fix Issue 4483 - foreach over string or wstring, where element type not specified core.demangle does not support Unicode and iterates over ASCII only. https://github.com/D-Programming-Language/druntime/commit/bb0e133d4c478bb279a228e9cabfa648fa1043b3 Merge pull request #712 from lionello/bug4483 Preparation for issue 4483, specifying foreach char iteration type

Comment #14 by github-bugzilla — 2014-07-16T11:59:29Z

Commits pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/eca3a10dd232541036c0bbc00cb7a4236d0f2276 Fix Issue 4483 - foreach over string or wstring, where element type not specified std.process does not support Unicode and assume it's iterating over ASCII. https://github.com/D-Programming-Language/phobos/commit/e3cdb418ea175ae8c4020973be6587cfd66779cc Merge pull request #1873 from lionello/bug4483 Preparation for issue 4483, specifying foreach char iteration type

Comment #15 by dfj1esp02 — 2014-07-16T13:08:38Z

(In reply to Martin Nowak from comment #12) > The problem with UTF is that you need to handle it correctly from A to B. The problem is dchar makes it harder to handle UTF correctly, because it's more subtly incorrect, hence more difficult to fix. Without automatic decoding people are more likely to encounter failure during development and fix it before release.

Comment #16 by dfj1esp02 — 2014-07-16T13:28:54Z

Added link to discussion.

Comment #17 by verylonglogin.reg — 2014-08-05T14:37:26Z

Github commits closed this by a mistake. Reopened.

Comment #18 by dmitry.olsh — 2016-04-06T07:13:26Z

(In reply to Walter Bright from comment #11) > > You fail to recognize that it's broken from the begging. > > and I know that std.regex suffered serious > slowdowns because of it. This turned out to be factually wrong, after I've spent a year and a half constructing a non-decoding version of std.regex for no significant gain ;) A brief exercise with a profiler shows decoding to be ~0.5% in a recent version, as Martin points out a single 100% predictable comparison is not a problem. The only case where not decoding is faster is bulk-mode operations that can take advantage of SIMD or auto-vectorization, such as: a) Skipping comments (i.e. looping until '*' is hit then check for '/') b) Comparing strings or searching for substring.

Comment #19 by dfj1esp02 — 2016-04-06T09:28:04Z

(In reply to Dmitry Olshansky from comment #18) > b) Comparing strings or searching for substring. Doesn't regex search for substring?

Comment #20 by dmitry.olsh — 2016-04-06T19:47:34Z

(In reply to Sobirari Muhomori from comment #19) > (In reply to Dmitry Olshansky from comment #18) > > b) Comparing strings or searching for substring. > > Doesn't regex search for substring? No, it examines one codepoint at a time. There is a special prefix matcher that does indeed avoid decoding but is very limited in what it could do so it's used only to find possible start of a match.

Comment #21 by robert.schadek — 2024-12-13T17:52:39Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/18284 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB