Bug 18600 – Regex performance enhancement for repeated matchFirst calls

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Mac OS X
Creation time
2018-03-12T18:20:13Z
Last change time
2018-05-14T14:47:05Z
Assigned to
No Owner
Creator
Jon Degenhardt

Comments

Comment #0 by jrdemail2000-dlang — 2018-03-12T18:20:13Z
This is a follow-up to https://issues.dlang.org/show_bug.cgi?id=18114, which identified a regex performance regression introduced in the 2.078 release. A number of improvements in 2.079 cut the performance regression meaningfuly enough to close the issue, but a meaningful delta remained, from 10-25% in my tests, depending on the regex issued. This indicates headroom for improvement exists. Subsequently, Dmitry Olshansky identified another performance opportunity. The test cases make repeated calls to matchFirst. This is the usage in eBay's TSV Utilities where this was identified. In this usage, a regex match is made against every line of input in turn. Each call to matchFirst instantiates a new version of the regex engine, a relatively expensive operation. Caching and reusing the engine on subsequent calls would be faster. This optimization has been available for a while, however, an increase in instantiation cost may have contributed to the performance regression identified in 2.078. In Dmitry's tests, caching the regex engine showed approximately 3x improvement. This would be an improvement significantly beyond the regression introduced in 2.078. My performance tests from issue 18114 (see issue 18114 for test details): | Regex | 2.077.1 | 2.078.0-beta1 | 2.079.0 | |-----------------------+--------------+---------------+--------------| | 'abc' | 3819.314 ms | 6202.892 ms | 5077.937 ms | | '(aa)+(cc)+g' | 5457.678 ms | 8209.269 ms | 6672.057 ms | | '(aa|ax).+[gxb][cyw]' | 10121.181 ms | 12448.443 ms | 11254.978 ms | A material improvement over 2.078.0-beta1, but still 10-25% off 2.077.1, depending on the regex. Dimitry's tests caching the engine, using the second regex: # Before ▶ ./find_regex '(aa)+(cc)+g' strings.txt 106437 matches; 5538.671 milliseconds # After ▶ ./find_regex '(aa)+(cc)+g' strings.txt 106437 matches; 1880.550 milliseconds This would be approximately a 3x improvement.
Comment #1 by github-bugzilla — 2018-03-22T01:21:32Z
Commit pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/4318073f40ae3e82ac1847da5e037ab2f091d6fc Fix issue 18600 Revamp std.regex caching for matchFirst case
Comment #2 by jrdemail2000-dlang — 2018-05-14T14:47:05Z
Phobos PR 6268 was included in LDC 1.10.0-beta1. For this release the standard benchmark I used for the TSV Utilities improved as follows: LDC 1.7.0 (before regression): 8.37 seconds LDC 1.8.0 (after regression): 10.01 seconds LDC 1.9.0 (first fixes): 9.44 seconds LDC 1.10.0-beta1 (Phobos PR 6268): 5.85 seconds First fixes: Phobos PR 5981, DMD PR 7599 The benchmark test used reads a TSV file line-by-line and checks individual fields for regex matches. A significant amount of processing time is IO, so the percentage gain on the regex portion is higher than the overall gain. The overall gain from LDC 1.7.0 is 30%. Test was run on MacOS, MacMini with 16GB RAM, SSD drives. The file used was 2.7GB, 14 million lines. Test info can be found here: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/ComparativeBenchmarks2018.md Great result!