← Back to index | Original Bugzilla link

Bug 10203 – std.string.toUpperInPlace is... not in place

Status: RESOLVED
Resolution: FIXED
Severity: normal
Priority: P2
Component: phobos
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2013-05-29T09:53:00Z
Last change time: 2016-06-03T16:15:34Z
Assigned to: nobody
Creator: monarchdodra
See also: http://d.puremagic.com/issues/show_bug.cgi?id=9629

Comments

Comment #0 by monarchdodra — 2013-05-29T09:53:34Z

As reported by manu in his talk. All in the title. toUpperInPlace will actually allocated a new string. As a matter of fact, it will allocate, a LOT: a new string per char.

Comment #1 by monarchdodra — 2013-05-29T12:18:46Z

Fixed here https://github.com/D-Programming-Language/phobos/pull/1322

Comment #2 by monarchdodra — 2013-05-30T02:13:50Z

(In reply to comment #1) > Fixed here > > https://github.com/D-Programming-Language/phobos/pull/1322 Turns out, upon investigation, that my fix is incorrect. What is more problematic though, is that the review revealed that there are cases where toUpperInPlace simply *cannot* be inplace: there characters out there, whose UTF representation is not the same length for upper case and lower case. What this means is that the lower case representation could be longer than the original length. How could this be done in place? It also reveals more problems: If the resulting string is actually smaller, than slices that aliased the same data will now reference garbage values. For example: 'İ', which is u0130, and a two byte encoding will become 'i'. So when I take the string "İa": auto a = "\xC4\xB0\x61"; auto b = a; toLowerInPlace(a); //Now: //a == "\x69\xB0" //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Or: say, I have "aİa", and want to reduce in place "İ" auto a = "\x61\xC4\xB0\x61"; auto b = a[1 .. 3]; toLowerInPlace(b); //Now: //b == "\x69" //OK //a == "\x61\x69\xB0\x61" //Wait, what is that B0 doinh here? -------- It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. I think it should either be tagged with a massive red unsafe, or deprecated.

Comment #3 by monarchdodra — 2013-05-30T02:19:53Z

(In reply to comment #2)> So when I take the string "İa": > auto a = "\xC4\xB0\x61"; > auto b = a; > toLowerInPlace(a); > > //Now: > //a == "\x69\xB0" > //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Wait, this example is wrong, corrected as: take the string "aİ" auto a = "\x61\xC4\xB0"; auto b = a; toLowerInPlace(a); //Now: //a == "\x61\x69" //b == "\x61\x69\xB0" Oops: Trailing code unit :/ Sorry.

Comment #4 by dmitry.olsh — 2013-09-21T13:58:19Z

(In reply to comment #3) > Wait, this example is wrong, corrected as: > > take the string "aİ" > auto a = "\x61\xC4\xB0"; auto a = "\x61\xC4\xB0".dup; > auto b = a; > toLowerInPlace(a); > > //Now: > //a == "\x61\x69" > //b == "\x61\x69\xB0" Oops: Trailing code unit :/ > > Sorry. I belive we now should have solid treatment of toUpper/toLower (pending a bugfix in the works). For me this gives now: //a == 61 69 CC 87 //b == 61 C4 B0 i.e. toLowerInPlace fails to do this in place because resulting length increases. Should it try to extend if a.capacity allows it? (I bet it has about 15 bytes of storage)

Comment #5 by bearophile_hugs — 2013-09-21T14:13:19Z

(In reply to comment #2) > It seems to me that, overall, toLowerInPlace is a function that is broken, that > cannot respect the specs it promises, and violates the principal of least > surprise in regards to behavior. > > I think it should either be tagged with a massive red unsafe, or deprecated. Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case.

Comment #6 by dmitry.olsh — 2013-09-21T14:26:34Z

(In reply to comment #5) > (In reply to comment #2) > > > It seems to me that, overall, toLowerInPlace is a function that is broken, that > > cannot respect the specs it promises, and violates the principal of least > > surprise in regards to behavior. > > > > I think it should either be tagged with a massive red unsafe, or deprecated. > > Even if it's impossible for Unicode strings, I'd like to keep a version of it > for just arrays of ASCII chars, that is a common enough use case. It also works quite well for UTF-16. And it does now. And I would have kept it if only because of backwards compatibility perspective. There is no other primitive yet, but a version(s) with output range for all string transformations is something to look forward to. The only question is whether or not should this function try to use extra capacity beyond the length if it's present (that would make InPlace suffix look saner).

Comment #7 by devw0rp — 2013-09-21T14:35:51Z

It may be worth considering presenting only what is possible, so offer toUpperInPlace or toLowerInPlace only on ASCII data. As you say, Unicode strings change in byte length dramatically, but the ASCII range stays the same. So I think it would be worth offering functions with different names, or perhaps with locale template arguments. (Locale.ascii, or whatever.) Just so there can be a guarantee of something never allocating when you want to avoid that, but still let you get at a "almost there" function when you want that.

Comment #8 by jack — 2016-06-03T16:15:34Z

At some point this was noted in the docs and fixed to only allocate if it runs out of space in the array. I'm closing this as fixed.