Bug 11017 – std.string/uni.toLower is very slow

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2013-09-12T10:52:00Z
Last change time
2014-02-25T13:03:52Z
Assigned to
nobody
Creator
peter.alexander.au

Comments

Comment #0 by peter.alexander.au — 2013-09-12T10:52:33Z
char[] s = new char[10_000_000]; s[] = 'A'; auto s2 = s.toLower; This takes 4.3 seconds on my machine. char[] s = new char[10_000_000]; s[] = 'A'; auto s2 = s.map!toLower.to!string; This only takes 1.1 seconds. Looking at the code for std.uni.toLower, it appears the string is constructed using repeated ~=. It should use an Appender of some sort.
Comment #1 by dmitry.olsh — 2013-09-12T11:59:08Z
(In reply to comment #0) > char[] s = new char[10_000_000]; > s[] = 'A'; > auto s2 = s.toLower; > > This takes 4.3 seconds on my machine. > > > char[] s = new char[10_000_000]; > s[] = 'A'; > auto s2 = s.map!toLower.to!string; > > This only takes 1.1 seconds. > There 2 things here to consider - first the 2nd one is not correct in general (1 codepoint can map to many e.g. german sharp S). > Looking at the code for std.uni.toLower, it appears the string is constructed > using repeated ~=. It should use an Appender of some sort. This indeed could be fixed I do suspect put an optimisitc reserve(original.length) there would work even better. See also issue 10864: http://d.puremagic.com/issues/show_bug.cgi?id=10864
Comment #2 by peter.alexander.au — 2013-09-12T12:45:45Z
(In reply to comment #1) > There 2 things here to consider - first the 2nd one is not correct in general > (1 codepoint can map to many e.g. german sharp S). Good point, although std.uni.toUpper doesn't handle it either :-) assert("ß".toUpper == "ß"); // passes
Comment #3 by dmitry.olsh — 2013-09-12T12:50:37Z
(In reply to comment #2) > (In reply to comment #1) > > There 2 things here to consider - first the 2nd one is not correct in general > > (1 codepoint can map to many e.g. german sharp S). > > Good point, although std.uni.toUpper doesn't handle it either :-) > > assert("ß".toUpper == "ß"); // passes To Lower will do. Sharp S is capital ;)
Comment #4 by peter.alexander.au — 2013-09-12T12:52:31Z
(In reply to comment #3) > To Lower will do. Sharp S is capital ;) assert("ß".toLower == "ß"); assert("ß".toUpper == "ß"); Both pass.
Comment #5 by dmitry.olsh — 2013-09-12T14:01:05Z
(In reply to comment #4) > (In reply to comment #3) > > To Lower will do. Sharp S is capital ;) > > assert("ß".toLower == "ß"); > assert("ß".toUpper == "ß"); > > Both pass. Something wicked have happend. I see that I've messed up toUpper in table generator while introducing toTitleCase (that isn't even yet exposed!). toLower is fine, toUpper is broken in half of cases apparently. How I missed that I've no idea ... gotta expand the test coverage around toLower/toUpper.
Comment #6 by dmitry.olsh — 2013-09-12T14:07:17Z
(In reply to comment #5) > (In reply to comment #4) > > (In reply to comment #3) > > > To Lower will do. Sharp S is capital ;) > > > > assert("ß".toLower == "ß"); > > assert("ß".toUpper == "ß"); > > > > Both pass. > > Something wicked have happend. > I see that I've messed up toUpper in table generator while introducing > toTitleCase (that isn't even yet exposed!). toLower is fine, toUpper is broken > in half of cases apparently. > How I missed that I've no idea ... gotta expand the test coverage around > toLower/toUpper. P.S. And there are both kinds of sharp s ... \u1E9E and \u00df
Comment #7 by peter.alexander.au — 2014-02-22T12:25:47Z