Bug 13532 – std.regex performance (enums; regex vs ctRegex)

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P5
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2014-09-26T02:02:36Z
Last change time
2017-12-18T22:54:51Z
Keywords
performance
Assigned to
Dmitry Olshansky
Creator
Vladimir Panteleev
See also
https://issues.dlang.org/show_bug.cgi?id=16457

Comments

Comment #0 by dlang-bugzilla — 2014-09-26T02:02:36Z
I noticed something strange after accidentally introducing a performance regression in a program using std.regex. Benchmark program: /////////////////////////////////////////// import std.algorithm; import std.array; import std.conv; import std.datetime; import std.file; import std.regex; import std.stdio; import std.string; enum expr = `;.*`; enum repl = ""; enum fn = `alice30.txt`; enum N = 5000; string[] lines; void regexInline() { lines .map!(line => line .replaceAll(regex(expr), repl) ) .array ; } void regexAuto() { auto r = regex(expr); lines .map!(line => line .replaceAll(r, repl) ) .array ; } void regexStatic() { static r = regex(expr); lines .map!(line => line .replaceAll(r, repl) ) .array ; } void regexEnum() { enum r = regex(expr); lines .map!(line => line .replaceAll(r, repl) ) .array ; } void ctRegexInline() { lines .map!(line => line .replaceAll(ctRegex!expr, repl) ) .array ; } void ctRegexAuto() { auto r = ctRegex!expr; lines .map!(line => line .replaceAll(r, repl) ) .array ; } void ctRegexStatic() { static r = ctRegex!expr; lines .map!(line => line .replaceAll(r, repl) ) .array ; } void ctRegexEnum() { enum r = ctRegex!expr; lines .map!(line => line .replaceAll(r, repl) ) .array ; } Regex!char re(string pattern)() { static Regex!char r; if (r.empty) r = regex(pattern); return r; } void reInline() { lines .map!(line => line .replaceAll(re!expr, repl) ) .array ; } alias funcs = TypeTuple!( regexInline, regexAuto, regexStatic, regexEnum, ctRegexInline, ctRegexAuto, ctRegexStatic, ctRegexEnum, reInline, ); void main() { auto text = cast(string)read(fn); lines = text.splitLines(); auto results = benchmark!funcs(N); foreach (i, func; funcs) writeln( __traits(identifier, func), "\t", to!Duration(results[i]), ); } /////////////////////////////////////////// Here are my results: regexInline 10 secs, 174 ms, 254 μs, and 2 hnsecs regexAuto 8 secs, 249 ms, 92 μs, and 5 hnsecs regexStatic 8 secs, 155 ms, 231 μs, and 1 hnsec regexEnum 19 secs, 358 ms, 66 μs, and 8 hnsecs ctRegexInline 21 secs, 399 ms, 346 μs, and 5 hnsecs ctRegexAuto 10 secs, 57 ms, and 418 μs ctRegexStatic 10 secs, 66 ms, 489 μs, and 9 hnsecs ctRegexEnum 21 secs, 593 ms, 486 μs, and 9 hnsecs reInline 8 secs, 430 ms, 852 μs, and 3 hnsecs The first surprise for me was that declaring a regex object (either Regex or StaticRegex) with "enum" was so much slower. It makes sense now that I think about it: creating a struct literal inside a loop will be more expensive than referencing one already residing somewhere in memory. Perhaps it might be worth mentioning in the documentation to avoid using enum with compiled regexes. The second surprise was that ctRegex was slower than regular regex, although the difference is not significative. I don't know whether this needs any action, feel free to WONTFIX.
Comment #1 by hsteoh — 2014-09-26T03:50:32Z
ctRegex is slower than regular regex?! Whoa. That just sounds completely wrong. What's the cause of this slowdown? I thought the whole point of ctRegex is to outperform runtime regex by making use of compile-time optimization. Whatever happened to that?? If this is the case, we might as well throw ctRegex away.
Comment #2 by dlang-bugzilla — 2014-09-26T03:53:46Z
Well, it's slower for this particular case, not necessarily in general. CCing Dmitry.
Comment #3 by dmitry.olsh — 2014-09-27T12:55:02Z
(In reply to Vladimir Panteleev from comment #0) > The first surprise for me was that declaring a regex object (either Regex or > StaticRegex) with "enum" was so much slower. It makes sense now that I think > about it: creating a struct literal inside a loop will be more expensive than > referencing one already residing somewhere in memory. Perhaps it might be > worth mentioning in the documentation to avoid using enum with compiled regexes. It's a common anti-pattern, it's the same issue with array literals, it's the same issue with anything that takes some time to compute or allocates. regex function call does both. It's worth adding a note though, fell free t create a pull. I'm not sure I'll get to it soon. (In reply to Vladimir Panteleev from comment #2) > Well, it's slower for this particular case, not necessarily in general. > CCing Dmitry. That's right. Problem is simple backtracking engine of CTFE version which is an unfortunate historical point as I'd pick the other engine of the two if I could go back in time. (In reply to hsteoh from comment #1) > ctRegex is slower than regular regex?! Whoa. That just sounds completely > wrong. What's the cause of this slowdown? I thought the whole point of > ctRegex is to outperform runtime regex by making use of compile-time > optimization. Whatever happened to that?? If this is the case, we might as > well throw ctRegex away. I'm fully aware of this. Unfortunately adding yet another engine (C-T "robust" engine) is increasingly a maintenace disaster. Consider also that working on compile-time generated regex is a nightmare of ~5-10 minutes to run all tests and constant out of memory conditions. Duplicating the amount of work done at CTFE is something DMD CAN'T handle at the moment. Another problem is regex accumulated a lot of technical debt, and needs a serious amount of refactoring before pling up more stuff. Then with modular design (the one I roughly outlined in my talk) we can put more and more components into it. Sadly all of this goes very slooowly.
Comment #4 by dmitry.olsh — 2016-04-06T12:55:25Z
(In reply to Vladimir Panteleev from comment #2) > Well, it's slower for this particular case, not necessarily in general. > CCing Dmitry. Please try with https://github.com/D-Programming-Language/phobos/pull/4147
Comment #5 by dlang-bugzilla — 2016-05-09T00:52:54Z
(In reply to Dmitry Olshansky from comment #4) > (In reply to Vladimir Panteleev from comment #2) > > Well, it's slower for this particular case, not necessarily in general. > > CCing Dmitry. > > Please try with https://github.com/D-Programming-Language/phobos/pull/4147 As before (all changes are at most 20% since 2014).
Comment #6 by dlang-bugzilla — 2017-07-02T13:40:41Z
2017 timings with LDC 1.2.0 (DMD v2.072.2, LLVM 4.0.0): regexInline 7 secs, 342 ms, 775 μs, and 9 hnsecs regexAuto 5 secs, 195 ms, and 526 μs regexStatic 5 secs, 158 ms, 479 μs, and 2 hnsecs regexEnum 18 secs, 777 ms, 420 μs, and 7 hnsecs ctRegexInline 20 secs, 38 ms, and 25 μs ctRegexAuto 6 secs, 16 ms, 155 μs, and 1 hnsec ctRegexStatic 5 secs, 921 ms, 572 μs, and 3 hnsecs ctRegexEnum 20 secs, 422 ms, 889 μs, and 4 hnsecs reInline 5 secs, 84 ms, 943 μs, and 1 hnsec
Comment #7 by dmitry.olsh — 2017-09-05T07:47:19Z
Comment #8 by github-bugzilla — 2017-10-16T18:16:39Z
Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/a877469f07819fa26cd12248f11fd59cbea6563a Fix issue 13532 - std.regex performance (enums; regex vs ctRegex) https://github.com/dlang/phobos/commit/ad489989ec3fac9f65f4bb9d43d2254a0b718dc7 Merge pull request #5722 from DmitryOlshansky/regex-matcher-interfaces std.regex: major internal redesign, also fixes issue 13532 merged-on-behalf-of: Andrei Alexandrescu <[email protected]>
Comment #9 by github-bugzilla — 2017-12-18T22:54:51Z
Commits pushed to stable at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/a877469f07819fa26cd12248f11fd59cbea6563a Fix issue 13532 - std.regex performance (enums; regex vs ctRegex) https://github.com/dlang/phobos/commit/ad489989ec3fac9f65f4bb9d43d2254a0b718dc7 Merge pull request #5722 from DmitryOlshansky/regex-matcher-interfaces