Bug 12690 – std.regex BacktrackingMatcher bmatch is faster than ThompsonMatcher but discouraged

Status
NEW
Severity
enhancement
Priority
P4
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2014-05-02T15:05:08Z
Last change time
2024-12-01T16:21:06Z
Assigned to
Dmitry Olshansky
Creator
Diez
Moved to GitHub: phobos#9632 →

Attachments

IDFilenameSummaryContent-TypeSize
1348dvt4.dSource code for test casetext/x-dsrc1128
1351dvt6.dSource code for more complex test case (UTF-8)text/x-dsrc1779

Comments

Comment #0 by yxcvbasdfgqwert02 — 2014-05-02T15:05:08Z
Created attachment 1348 Source code for test case The discouraged function std.regex.bmatch is significantly faster then std.regex.match. The attached test case already shows the performance difference with a simple expression and a small haystack. In my real-world program, both are way more complex and it's many expressions against many haystacks. Most expressions match, only a few of them not. The performance difference is a knock-out criterion. Total real-world program running time: Using bmatch: 5 seconds Using match: 47 seconds All benchmarks with dmd 2.065 on Ubuntu 64 bit, compiled with -release. Unfortunately I'm not allowed to post the source code or haystack of the real-world application, so the small test case has to be enough.
Comment #1 by dmitry.olsh — 2014-05-02T20:51:19Z
For the test case in question on Win32 I get: //With bmatch D:\D>dvt4a Time=3 secs, 802 ms, 707 μs, and 7 hnsecs, matches per second=1577823 //With match D:\D>dvt4b Time=4 secs, 274 ms, 43 μs, and 7 hnsecs, matches per second=1403822 Which is around 10%. Note that proper flags for optimized build are: dmd -O -release -inline On ldc proper flags (IIRC) are: ldc -O3 -release --- I would not even consider result w/o inlining as showing anything. Please compile your real-world program with the above flags. --- On the test and results of such. The provided test measures the overhead of starting an engine (memory allocation/deallocation eats ~25% of total time in both cases). Overall indeed BT has a bit less of work to do at start-up. Generally it is known that BT matchers do have an advantage on patterns that tend to match, the problem with them is low stability (small change in pattern may change performance radically) and bad performance on ambiguous patterns. I imagine the problem you are solving involves a number of patterns that is close to the number of haystacks. Even then a 10% slower start-up is no more then ~1.21% of total time. P.S. There is an enhancement that may be of interest: https://issues.dlang.org/show_bug.cgi?id=12227
Comment #2 by yxcvbasdfgqwert02 — 2014-05-04T10:20:21Z
Created attachment 1351 Source code for more complex test case (UTF-8) Thank you for your help. I verified the compilation options of my real-world program and in fact the -inline option was missing (-O was already set). However, it didn't change anything at the overall performance, it's still 5 seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in performance difference there's no chance to switch to the Thompson matcher. The attached source is almost the same as before, but with average expression and haystack from my real-world program (confidential informations scrambled). Now it's not a small performance difference, but a factor of more than 5. As I understand the meaning of the word "discouraged", the backtracking matcher is to be removed in some future. Please don't. As long as the Thompson matcher has this poor performance (for certain use cases), there's need for both matchers. In fact, using the Thompson matcher is way slower than an equivalent Java 1.7 program, while using the backtracking matcher is slightly faster. Of course, optimally there's only one matcher which combines the advantages of all underlaying matchers. In my real-world program, there's about 100 times more haystacks than patterns (940 vs. 112_000). As soon as a pattern matches, the next haystack is examined. Each haystack needed about 23 tries (avg) until a pattern matched, actually there were 2_584_925 match calls in total of which 111_675 matched the haystack (a "few" haystacks didn't match any pattern). Note that the attached test case source file needs to be saved in UTF-8 encoding, since the regular expression string contains UTF-8 characters (hopefully, they don't get destroyed during upload into the issue tracking system). The UTF-8 signature comment line should help Windows editors to automatically switch to UTF-8 encoding when opening the source file (like an UTF-8 BOM). The idea of matching multiple patterns with only one call of the matcher is very interesting (though I didn't test it yet). Consider a 5 times faster matcher with 5 matches at once ...
Comment #3 by dmitry.olsh — 2014-05-04T11:47:08Z
(In reply to Diez from comment #2) > Created attachment 1351 [details] > Source code for more complex test case (UTF-8) > > Thank you for your help. I verified the compilation options of my real-world > program and in fact the -inline option was missing (-O was already set). > However, it didn't change anything at the overall performance, it's still 5 > seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in > performance difference there's no chance to switch to the Thompson matcher. > > The attached source is almost the same as before, but with average > expression and haystack from my real-world program (confidential > informations scrambled). Now it's not a small performance difference, but a > factor of more than 5. > I see. The fact that -inline doesn't change a thing is troublesome but lets leave that to compiler guys. The new pattern makes things much clearer - as I suspected the preparation stage of the Thompson engine is the one to blame. Preparing an internal freelist alone takes ~1/3 of total run-time in this case. The figures are much higher this time about x3 difference. I have a hunch that it allocates much more then needed in this case, might be a bug in counting up repetitions. Need to investigate further. > As I understand the meaning of the word "discouraged", the backtracking > matcher is to be removed in some future. Please don't. As long as the > Thompson matcher has this poor performance (for certain use cases), there's > need for both matchers. In fact, using the Thompson matcher is way slower > than an equivalent Java 1.7 program, while using the backtracking matcher is > slightly faster. Of course, optimally there's only one matcher which > combines the advantages of all underlaying matchers. Discouraged means exactly what you say - picking the type of underlying matcher should be done internally based on pattern properties (or read that as one meta-matcher). It occurred to me that - There are more types of matchers, some are very limited in what they can. Thus should't push it on user to understand when he/she can use them. - More matchers - more functions doesn't scale. - Also replace functions have the same NxM combinatorics (n methods, m functions), there is also splitter. I envision that when std.regex turns package, it will gradually expose internals for those who wants to make customize/their own matcher. > In my real-world program, there's about 100 times more haystacks than > patterns (940 vs. 112_000). As soon as a pattern matches, the next haystack > is examined. Each haystack needed about 23 tries (avg) until a pattern > matched, actually there were 2_584_925 match calls in total of which 111_675 > matched the haystack (a "few" haystacks didn't match any pattern). > > Note that the attached test case source file needs to be saved in UTF-8 > encoding, since the regular expression string contains UTF-8 characters > (hopefully, they don't get destroyed during upload into the issue tracking > system). The UTF-8 signature comment line should help Windows editors to > automatically switch to UTF-8 encoding when opening the source file (like an > UTF-8 BOM). > > The idea of matching multiple patterns with only one call of the matcher is > very interesting (though I didn't test it yet). Well, it's not yet implemented ;) But I expect to get to it done somewhere in May. > Consider a 5 times faster > matcher with 5 matches at once ...
Comment #4 by robert.schadek — 2024-12-01T16:21:06Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9632 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB