Bug 1235 – std.string.tolower() fails on certain utf8 characters

Status
RESOLVED
Resolution
FIXED
Severity
minor
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2007-05-15T19:08:00Z
Last change time
2015-06-09T05:15:21Z
Assigned to
bugzilla
Creator
d

Comments

Comment #0 by d — 2007-05-15T19:08:32Z
import std.string; int main(char[][] args) { printf("tolower(\"\\u0130e\") -> \"%.*s\"\n", tolower("\u0130e")); return 0; } produces incorrect output: tolower("\u0130e") -> "i e" Bug comes from erroneous code in phobos/std/string.d line 843: if (r.length != i + j) r = r[0 .. i + j]; Turkish dotted capital I (U+0130) is correctly converted to ASCII i (u+0069). But converted character does not use the same number of bytes as original character. The code above is therefore incorrect. As far as I understand the implementation, it could be removed completely. A similar issue is present in toupper(), with the additional twist that conversion to uppercase should not be special cased for the ASCII subset in the Turkish Locale. Additionally, non ASCII code is triggered by if (c >= 0x7F) where it should be if (c > 0x7F).
Comment #1 by bugzilla — 2007-06-28T22:57:41Z
I agree, with the exception that for UTF characters, there is no such thing as a locale. So the toupper("i") cannot be set to \u0130.
Comment #2 by bugzilla — 2007-07-01T14:03:43Z
Fixed DMD 1.018 and DMD 2.002