← Back to index | Original Bugzilla link

Bug 11229 – std.string.toLower is slow

Status: RESOLVED
Resolution: FIXED
Severity: enhancement
Priority: P2
Component: phobos
Product: D
Version: D2
Platform: x86
OS: Windows
Creation time: 2013-10-11T16:56:00Z
Last change time: 2016-05-10T18:23:48Z
Keywords: performance
Assigned to: nobody
Creator: bearophile_hugs

Comments

Comment #0 by bearophile_hugs — 2013-10-11T16:56:53Z

As test text I have used the Gutenberg "Pride and Prejudice": http://www.gutenberg.org/cache/epub/1342/pg1342.txt The simple D program: void main() { import std.string: toLower; import std.file: read; (cast(string)"pg1342.txt".read).toLower; } A similar Python2.6 program: def main(): open("pg1342.txt").read().lower() main() The Python code runs (including starting the interpreter) in about 0.07 seconds on my Core2 PC. The D version compiled with dmd 2.064 with -O -release -inline -noboundscheck runs in about 0.30 seconds. The two programs are not exactly equivalent because Python2.6 strings are not Unicode. But I think a fast path for ASCII or near-ASCII text inside std.string.toLower could bring its performance to something similar the Python performance. (A possible alternative solution is to introduce a function similar to toLower function in the std.ascii module.) Note that this program: void main() { import std.ascii: toLower; import std.file: read; auto txt = cast(char[])"pg1342.txt".read; foreach (ref c; txt) c = cast(char)c.toLower; } Compiled in the same way runs in about 0.03 seconds.

Comment #1 by jrdemail2000-dlang — 2016-03-10T05:58:26Z

Some comments I made in this forum post: https://forum.dlang.org/post/[email protected] Routines like to toLower and asLowerCase call functions that work for all unicode characters. I was able to create much faster versions by checking if the character was ascii, then calling either the ascii version or the more general version. Same is true for a few routines like isNumber. Some have the ascii check optimization built in, but not all. If this optimization is added, it might also be useful to add a few common combinations (or a generic solution, if that's feasible). For example, to check if a character is alpha-numeric, one currently ORs two tests from the standard library, isAlpha and isNumber. Putting in an ascii optimization check requires putting it before doing the OR, rather than inside the tests being ORed.

Comment #2 by jack — 2016-04-01T19:49:06Z

The single character version of toLower and isAlpha were already optimized for ASCII, but isNumber wasn't: https://github.com/D-Programming-Language/phobos/pull/4137

Comment #3 by jrdemail2000-dlang — 2016-04-03T21:34:20Z

Thanks for addressing this. The same treatment could probably be applied to many of the 'is' routines in std.uni, but the other one that seem likely to be commonly used is 'isPunctuation'. 'isWhite' does something I'm not following, not obvious if it would benefit. And a combined isAlphaNumeric would be useful.

Comment #4 by jack — 2016-04-04T22:53:00Z

Part 2: https://github.com/D-Programming-Language/phobos/pull/4150

Comment #5 by jack — 2016-04-04T23:05:01Z

isWhite and isSpace seem to already be optimized for ASCII

Comment #6 by jack — 2016-04-27T18:26:18Z

This, (cast(string)"pg1342.txt".read).map!toLower.array is almost three times faster than (cast(string)"pg1342.txt".read).toLower because of the ASCII optimizations in dchar toLower(dchar c)

Comment #7 by jrdemail2000-dlang — 2016-05-08T20:26:17Z

Similar to the previous comment, I tried an alternate implementation for std.uni.asLowerCase using map with std.uni.toLower (the single character version). The single character toLower already has the ascii check optimization. Timing with LDC was improved in all cases. 2.5-3x for latin languages, 1.5-2x for Japanese and Chinese. For DMD the improvements were 5x-20x, depending on whether depending on latin vs asian text and whether decoding to dchar was included or not. Program used to do timing is here: https://dpaste.dzfl.pl/a0e2fa1c71fd Texts used for timing were books in several languages found on the Project Gutenberg site (http://www.gutenberg.org/), with the boilerplate text removed. Latin languages tested were in English, Finnish, German, Spanish. Timing was done on OSX; DMD 2.071 (-release -O -boundscheck=off -inline); LDC 1.0.0-beta1 (Phobos 2.070.2; -release -O -boundscheck=off). That the Japanese and Chinese language docs showed improvement suggests the map + toLower was faster than std.uni.asLowerCase apart from the ascii optimization. Texts used for these had only about 3% ascii characters, so the optimization was rarely used. The replacement for asLowerCase I used is: auto mapAsLowerCase(Range)(Range str) if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range) { static if (ElementEncodingType!Range.sizeof < dchar.sizeof) { import std.utf : byDchar; return str.byDchar.mapAsLowerCase; } else { import std.algorithm : map; import std.uni : toLower; return str.map!(x => x.toLower); } }

Comment #8 by jack — 2016-05-09T18:46:17Z

(In reply to Jon Degenhardt from comment #7) > auto mapAsLowerCase(Range)(Range str) > if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && > !isConvertibleToString!Range) > { > static if (ElementEncodingType!Range.sizeof < dchar.sizeof) > { > import std.utf : byDchar; > return str.byDchar.mapAsLowerCase; > } > else > { > import std.algorithm : map; > import std.uni : toLower; > > return str.map!(x => x.toLower); > } > } I attempted to replace asLowerCase with this, but it's blocked by https://issues.dlang.org/show_bug.cgi?id=16005

Comment #9 by jrdemail2000-dlang — 2016-05-09T22:25:05Z

(In reply to Jack Stouffer from comment #8) > (In reply to Jon Degenhardt from comment #7) > > auto mapAsLowerCase(Range)(Range str) > > if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && > > !isConvertibleToString!Range) > > { > > static if (ElementEncodingType!Range.sizeof < dchar.sizeof) > > { > > import std.utf : byDchar; > > return str.byDchar.mapAsLowerCase; > > } > > else > > { > > import std.algorithm : map; > > import std.uni : toLower; > > > > return str.map!(x => x.toLower); > > } > > } > > I attempted to replace asLowerCase with this, but it's blocked by > https://issues.dlang.org/show_bug.cgi?id=16005 The version I wrote was reasonable for performance analysis, but it is not fully consistent functionally with the current version of asLowerCase. The current version of asLowerCase does "full case folding", which means the character length may expand. The version I wrote (and the single character version of toLower it calls) do "simple case folding", where the character length may expand. An example is "Latin Capital Letter I With Dot Above" (İ, u+130). In simple case folding, it becomes "Latin small letter I" (the ascii lower case letter I, u+0069). In full case folding, it becomes the character sequence [0069 0307] (lower case I followed by 'combining dot above). Both case folding approaches have valid uses. In retrospect, what my analysis did not do is differentiate the cost of the ascii check from the cost of full-case folding. One reasonable question is if the ascii check can be incorporated into std.uni.toCaser. If so, that would preserve the current full case-folding functionality. There still might be value to having a simple-case folding version as well, but that could be treated as a separate topic.

Comment #10 by jrdemail2000-dlang — 2016-05-09T22:30:33Z

(In reply to Jon Degenhardt from comment #9) > The version I wrote (and the single character version of toLower it > calls) do "simple case folding", where the character length may expand. > Sorry, that should have read: "simple case folding", where the character length may *not* expand.

Comment #11 by github-bugzilla — 2016-05-10T12:12:21Z

Commit pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/5e40a7553a49cf9b4ffbf4fb3b0ba81b2a985498 Fix Issue 11229: Add ASCII case for string versions of std.uni.toLower/toUpper

Comment #12 by jrdemail2000-dlang — 2016-05-10T18:23:48Z

Ascii check optimization for asLowerCase and asUpperCase in this commit: https://github.com/dlang/phobos/commit/f257e8330e5babbf5c74325dc32a2886bd181f02