Bug 18600 – Regex performance enhancement for repeated matchFirst calls
Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Mac OS X
Creation time
2018-03-12T18:20:13Z
Last change time
2018-05-14T14:47:05Z
Assigned to
No Owner
Creator
Jon Degenhardt
Comments
Comment #0 by jrdemail2000-dlang — 2018-03-12T18:20:13Z
This is a follow-up to https://issues.dlang.org/show_bug.cgi?id=18114, which identified a regex performance regression introduced in the 2.078 release. A number of improvements in 2.079 cut the performance regression meaningfuly enough to close the issue, but a meaningful delta remained, from 10-25% in my tests, depending on the regex issued. This indicates headroom for improvement exists.
Subsequently, Dmitry Olshansky identified another performance opportunity. The test cases make repeated calls to matchFirst. This is the usage in eBay's TSV Utilities where this was identified. In this usage, a regex match is made against every line of input in turn.
Each call to matchFirst instantiates a new version of the regex engine, a relatively expensive operation. Caching and reusing the engine on subsequent calls would be faster. This optimization has been available for a while, however, an increase in instantiation cost may have contributed to the performance regression identified in 2.078.
In Dmitry's tests, caching the regex engine showed approximately 3x improvement. This would be an improvement significantly beyond the regression introduced in 2.078.
My performance tests from issue 18114 (see issue 18114 for test details):
| Regex | 2.077.1 | 2.078.0-beta1 | 2.079.0 |
|-----------------------+--------------+---------------+--------------|
| 'abc' | 3819.314 ms | 6202.892 ms | 5077.937 ms |
| '(aa)+(cc)+g' | 5457.678 ms | 8209.269 ms | 6672.057 ms |
| '(aa|ax).+[gxb][cyw]' | 10121.181 ms | 12448.443 ms | 11254.978 ms |
A material improvement over 2.078.0-beta1, but still 10-25% off 2.077.1, depending on the regex.
Dimitry's tests caching the engine, using the second regex:
# Before
▶ ./find_regex '(aa)+(cc)+g' strings.txt
106437 matches; 5538.671 milliseconds
# After
▶ ./find_regex '(aa)+(cc)+g' strings.txt
106437 matches; 1880.550 milliseconds
This would be approximately a 3x improvement.
Comment #1 by github-bugzilla — 2018-03-22T01:21:32Z
Comment #2 by jrdemail2000-dlang — 2018-05-14T14:47:05Z
Phobos PR 6268 was included in LDC 1.10.0-beta1. For this release the standard benchmark I used for the TSV Utilities improved as follows:
LDC 1.7.0 (before regression): 8.37 seconds
LDC 1.8.0 (after regression): 10.01 seconds
LDC 1.9.0 (first fixes): 9.44 seconds
LDC 1.10.0-beta1 (Phobos PR 6268): 5.85 seconds
First fixes: Phobos PR 5981, DMD PR 7599
The benchmark test used reads a TSV file line-by-line and checks individual fields for regex matches. A significant amount of processing time is IO, so the percentage gain on the regex portion is higher than the overall gain. The overall gain from LDC 1.7.0 is 30%.
Test was run on MacOS, MacMini with 16GB RAM, SSD drives. The file used was
2.7GB, 14 million lines. Test info can be found here:
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/ComparativeBenchmarks2018.md
Great result!