Comment #0 by bearophile_hugs — 2013-10-11T16:56:53Z
As test text I have used the Gutenberg "Pride and Prejudice":
http://www.gutenberg.org/cache/epub/1342/pg1342.txt
The simple D program:
void main() {
import std.string: toLower;
import std.file: read;
(cast(string)"pg1342.txt".read).toLower;
}
A similar Python2.6 program:
def main():
open("pg1342.txt").read().lower()
main()
The Python code runs (including starting the interpreter) in about 0.07 seconds on my Core2 PC.
The D version compiled with dmd 2.064 with -O -release -inline -noboundscheck runs in about 0.30 seconds.
The two programs are not exactly equivalent because Python2.6 strings are not Unicode. But I think a fast path for ASCII or near-ASCII text inside std.string.toLower could bring its performance to something similar the Python performance.
(A possible alternative solution is to introduce a function similar to toLower function in the std.ascii module.)
Note that this program:
void main() {
import std.ascii: toLower;
import std.file: read;
auto txt = cast(char[])"pg1342.txt".read;
foreach (ref c; txt)
c = cast(char)c.toLower;
}
Compiled in the same way runs in about 0.03 seconds.
Comment #1 by jrdemail2000-dlang — 2016-03-10T05:58:26Z
Some comments I made in this forum post: https://forum.dlang.org/post/[email protected]
Routines like to toLower and asLowerCase call functions that work for all unicode characters. I was able to create much faster versions by checking if the character was ascii, then calling either the ascii version or the more general version. Same is true for a few routines like isNumber. Some have the ascii check optimization built in, but not all.
If this optimization is added, it might also be useful to add a few common combinations (or a generic solution, if that's feasible). For example, to check if a character is alpha-numeric, one currently ORs two tests from the standard library, isAlpha and isNumber. Putting in an ascii optimization check requires putting it before doing the OR, rather than inside the tests being ORed.
Comment #3 by jrdemail2000-dlang — 2016-04-03T21:34:20Z
Thanks for addressing this.
The same treatment could probably be applied to many of the 'is' routines in std.uni, but the other one that seem likely to be commonly used is 'isPunctuation'. 'isWhite' does something I'm not following, not obvious if it would benefit. And a combined isAlphaNumeric would be useful.
isWhite and isSpace seem to already be optimized for ASCII
Comment #6 by jack — 2016-04-27T18:26:18Z
This,
(cast(string)"pg1342.txt".read).map!toLower.array
is almost three times faster than
(cast(string)"pg1342.txt".read).toLower
because of the ASCII optimizations in dchar toLower(dchar c)
Comment #7 by jrdemail2000-dlang — 2016-05-08T20:26:17Z
Similar to the previous comment, I tried an alternate implementation for std.uni.asLowerCase using map with std.uni.toLower (the single character version). The single character toLower already has the ascii check optimization.
Timing with LDC was improved in all cases. 2.5-3x for latin languages, 1.5-2x for Japanese and Chinese. For DMD the improvements were 5x-20x, depending on whether depending on latin vs asian text and whether decoding to dchar was included or not.
Program used to do timing is here: https://dpaste.dzfl.pl/a0e2fa1c71fd
Texts used for timing were books in several languages found on the Project Gutenberg site (http://www.gutenberg.org/), with the boilerplate text removed. Latin languages tested were in English, Finnish, German, Spanish.
Timing was done on OSX; DMD 2.071 (-release -O -boundscheck=off -inline); LDC 1.0.0-beta1 (Phobos 2.070.2; -release -O -boundscheck=off).
That the Japanese and Chinese language docs showed improvement suggests the map + toLower was faster than std.uni.asLowerCase apart from the ascii optimization. Texts used for these had only about 3% ascii characters, so the optimization was rarely used.
The replacement for asLowerCase I used is:
auto mapAsLowerCase(Range)(Range str)
if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) &&
!isConvertibleToString!Range)
{
static if (ElementEncodingType!Range.sizeof < dchar.sizeof)
{
import std.utf : byDchar;
return str.byDchar.mapAsLowerCase;
}
else
{
import std.algorithm : map;
import std.uni : toLower;
return str.map!(x => x.toLower);
}
}
Comment #8 by jack — 2016-05-09T18:46:17Z
(In reply to Jon Degenhardt from comment #7)
> auto mapAsLowerCase(Range)(Range str)
> if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) &&
> !isConvertibleToString!Range)
> {
> static if (ElementEncodingType!Range.sizeof < dchar.sizeof)
> {
> import std.utf : byDchar;
> return str.byDchar.mapAsLowerCase;
> }
> else
> {
> import std.algorithm : map;
> import std.uni : toLower;
>
> return str.map!(x => x.toLower);
> }
> }
I attempted to replace asLowerCase with this, but it's blocked by https://issues.dlang.org/show_bug.cgi?id=16005
Comment #9 by jrdemail2000-dlang — 2016-05-09T22:25:05Z
(In reply to Jack Stouffer from comment #8)
> (In reply to Jon Degenhardt from comment #7)
> > auto mapAsLowerCase(Range)(Range str)
> > if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) &&
> > !isConvertibleToString!Range)
> > {
> > static if (ElementEncodingType!Range.sizeof < dchar.sizeof)
> > {
> > import std.utf : byDchar;
> > return str.byDchar.mapAsLowerCase;
> > }
> > else
> > {
> > import std.algorithm : map;
> > import std.uni : toLower;
> >
> > return str.map!(x => x.toLower);
> > }
> > }
>
> I attempted to replace asLowerCase with this, but it's blocked by
> https://issues.dlang.org/show_bug.cgi?id=16005
The version I wrote was reasonable for performance analysis, but it is not fully consistent functionally with the current version of asLowerCase. The current version of asLowerCase does "full case folding", which means the character length may expand. The version I wrote (and the single character version of toLower it calls) do "simple case folding", where the character length may expand. An example is "Latin Capital Letter I With Dot Above" (İ, u+130). In simple case folding, it becomes "Latin small letter I" (the ascii lower case letter I, u+0069). In full case folding, it becomes the character sequence [0069 0307] (lower case I followed by 'combining dot above).
Both case folding approaches have valid uses. In retrospect, what my analysis did not do is differentiate the cost of the ascii check from the cost of full-case folding.
One reasonable question is if the ascii check can be incorporated into std.uni.toCaser. If so, that would preserve the current full case-folding functionality. There still might be value to having a simple-case folding version as well, but that could be treated as a separate topic.
Comment #10 by jrdemail2000-dlang — 2016-05-09T22:30:33Z
(In reply to Jon Degenhardt from comment #9)
> The version I wrote (and the single character version of toLower it
> calls) do "simple case folding", where the character length may expand.
>
Sorry, that should have read:
"simple case folding", where the character length may *not* expand.
Comment #11 by github-bugzilla — 2016-05-10T12:12:21Z