Bug 10203 – std.string.toUpperInPlace is... not in place

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2013-05-29T09:53:00Z
Last change time
2016-06-03T16:15:34Z
Assigned to
nobody
Creator
monarchdodra
See also
http://d.puremagic.com/issues/show_bug.cgi?id=9629

Comments

Comment #0 by monarchdodra — 2013-05-29T09:53:34Z
As reported by manu in his talk. All in the title. toUpperInPlace will actually allocated a new string. As a matter of fact, it will allocate, a LOT: a new string per char.
Comment #1 by monarchdodra — 2013-05-29T12:18:46Z
Comment #2 by monarchdodra — 2013-05-30T02:13:50Z
(In reply to comment #1) > Fixed here > > https://github.com/D-Programming-Language/phobos/pull/1322 Turns out, upon investigation, that my fix is incorrect. What is more problematic though, is that the review revealed that there are cases where toUpperInPlace simply *cannot* be inplace: there characters out there, whose UTF representation is not the same length for upper case and lower case. What this means is that the lower case representation could be longer than the original length. How could this be done in place? It also reveals more problems: If the resulting string is actually smaller, than slices that aliased the same data will now reference garbage values. For example: 'İ', which is u0130, and a two byte encoding will become 'i'. So when I take the string "İa": auto a = "\xC4\xB0\x61"; auto b = a; toLowerInPlace(a); //Now: //a == "\x69\xB0" //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Or: say, I have "aİa", and want to reduce in place "İ" auto a = "\x61\xC4\xB0\x61"; auto b = a[1 .. 3]; toLowerInPlace(b); //Now: //b == "\x69" //OK //a == "\x61\x69\xB0\x61" //Wait, what is that B0 doinh here? -------- It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. I think it should either be tagged with a massive red unsafe, or deprecated.
Comment #3 by monarchdodra — 2013-05-30T02:19:53Z
(In reply to comment #2)> So when I take the string "İa": > auto a = "\xC4\xB0\x61"; > auto b = a; > toLowerInPlace(a); > > //Now: > //a == "\x69\xB0" > //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Wait, this example is wrong, corrected as: take the string "aİ" auto a = "\x61\xC4\xB0"; auto b = a; toLowerInPlace(a); //Now: //a == "\x61\x69" //b == "\x61\x69\xB0" Oops: Trailing code unit :/ Sorry.
Comment #4 by dmitry.olsh — 2013-09-21T13:58:19Z
(In reply to comment #3) > Wait, this example is wrong, corrected as: > > take the string "aİ" > auto a = "\x61\xC4\xB0"; auto a = "\x61\xC4\xB0".dup; > auto b = a; > toLowerInPlace(a); > > //Now: > //a == "\x61\x69" > //b == "\x61\x69\xB0" Oops: Trailing code unit :/ > > Sorry. I belive we now should have solid treatment of toUpper/toLower (pending a bugfix in the works). For me this gives now: //a == 61 69 CC 87 //b == 61 C4 B0 i.e. toLowerInPlace fails to do this in place because resulting length increases. Should it try to extend if a.capacity allows it? (I bet it has about 15 bytes of storage)
Comment #5 by bearophile_hugs — 2013-09-21T14:13:19Z
(In reply to comment #2) > It seems to me that, overall, toLowerInPlace is a function that is broken, that > cannot respect the specs it promises, and violates the principal of least > surprise in regards to behavior. > > I think it should either be tagged with a massive red unsafe, or deprecated. Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case.
Comment #6 by dmitry.olsh — 2013-09-21T14:26:34Z
(In reply to comment #5) > (In reply to comment #2) > > > It seems to me that, overall, toLowerInPlace is a function that is broken, that > > cannot respect the specs it promises, and violates the principal of least > > surprise in regards to behavior. > > > > I think it should either be tagged with a massive red unsafe, or deprecated. > > Even if it's impossible for Unicode strings, I'd like to keep a version of it > for just arrays of ASCII chars, that is a common enough use case. It also works quite well for UTF-16. And it does now. And I would have kept it if only because of backwards compatibility perspective. There is no other primitive yet, but a version(s) with output range for all string transformations is something to look forward to. The only question is whether or not should this function try to use extra capacity beyond the length if it's present (that would make InPlace suffix look saner).
Comment #7 by devw0rp — 2013-09-21T14:35:51Z
It may be worth considering presenting only what is possible, so offer toUpperInPlace or toLowerInPlace only on ASCII data. As you say, Unicode strings change in byte length dramatically, but the ASCII range stays the same. So I think it would be worth offering functions with different names, or perhaps with locale template arguments. (Locale.ascii, or whatever.) Just so there can be a guarantee of something never allocating when you want to avoid that, but still let you get at a "almost there" function when you want that.
Comment #8 by jack — 2016-06-03T16:15:34Z
At some point this was noted in the docs and fixed to only allocate if it runs out of space in the array. I'm closing this as fixed.