Comment #0 by Jesse.K.Phillips+D — 2012-02-09T09:27:58Z
The previous implementation is said to do some caching of the last used engine. english.dic is 134,950 entries for these timings.
Test code
----------
import std.file;
import std.string;
import std.datetime;
import std.regex;
private int[string] model;
void main() {
auto name = "english.dic";
foreach(w; std.file.readText(name).toLower.splitLines)
model[w] += 1;
foreach(w; std.string.split(readText(name)))
if(!match(w, regex(r"\d")).empty)
{}
else if(!match(w, regex(r"\W")).empty)
{}
}
-------
I'm trying to avoid the caching here, but still see better performance in 2.056. Actually I find these timings are with mingw on Windows. I find it odd that user time is actually fast, but real time is the slow piece, does mingw have access to the proper information?
$ time ./test2.056.exe
real 0m0.860s
user 0m0.047s
sys 0m0.000s
$ time ./test2.058.exe
real 0m55.500s
user 0m0.031s
sys 0m0.000s
Comment #1 by dmitry.olsh — 2012-02-24T11:14:52Z
I'm willing to investigate the issue. Can you attach english.dic file?
Comment #2 by code — 2012-02-24T13:48:57Z
You are compiling two different regexes. So a single entry cache will only solve part of your problem.
Comment #3 by Jesse.K.Phillips+D — 2012-02-24T18:02:05Z
The exact file isn't important, can't get it now. But you could grab similar from http://www.winedt.org/Dict/
I realize that the example given is avoiding the benefit of single caching, but it does perform better and probably should be worked towards.
Comment #4 by dmitry.olsh — 2012-02-26T02:22:02Z
Profiling shows that about 99% of time is spent in GC, ouch.
What's at work here is that new regex engine is more costly to create and allocates a bunch of structures on heap. The biggest ones of them are cached like e.g. Tries but others are not.
I think I'll spend some time on introducing more caching and probably seek out some GC unfriendly stuff in parser.
Still I should point out is that \d and \W in new engine are unicode aware and correspond to MUCH broader character clasess then previos engine does. (that belongs in ddocs somewhere)
Comment #5 by dmitry.olsh — 2012-02-26T06:32:30Z
Anyway how compares of 2.056-2.058 when you don't create regex objects inside tight loop?
It is a strange thing to do at any circumstances, even N-slot caching you pay some extra on each iteration to lookup and copy out the compiled regex needed.
I'm dreaming that probably one day the compiler can just see it's a loop invariant and move it out for you.
Hm.. could happen sometime soon if 'regex' is pure and then it's result is immutable, the compiler would have it's guarantees to go ahead and optimize.
Comment #6 by Jesse.K.Phillips+D — 2012-02-26T18:03:06Z
After moving the regex to outside the loop and I think some other changes it helped immensely. Declaring them as module variables didn't seem to gain any more. I didn't have much time to play with it much more, it was exceptionable, though I hope to do more with regex and just need to watch out for tight loops.
Comment #7 by andy — 2015-01-25T18:47:06Z
This seems like its resolved, so I'm closing it.
Please reopen if there is still a problem.