Bug 9646 – std.algorithm.splitter for strings has opportunities for improvement

Status
NEW
Severity
enhancement
Priority
P4
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2013-03-04T11:10:42Z
Last change time
2024-12-01T16:16:50Z
Keywords
performance, preapproved
Assigned to
No Owner
Creator
Steven Schveighoffer
Moved to GitHub: phobos#9960 →

Comments

Comment #0 by schveiguy — 2013-03-04T11:10:42Z
Based on the way std.algorithm.splitter is implemented, it should perform at least as well as a simple hand-written range that does a brute force search for separators. However, with certain test data, and a simple range, it does not perform as well. An example (take from a post on the newsgroups), along with a simple MySplitter type that can split on any substring. I have not done any in-depth research as to why this works better: import std.stdio; import std.datetime; import std.algorithm; struct MySplitter { private string s; private string separator; private string source; this(string src, string sep) { source = src; separator = sep; popFront(); } @property string front() { return s; } @property bool empty() { return s.ptr == null; } void popFront() { if(!source.length) { s = source; source = null; } else { size_t i = 0; bool found = false; outer: for(; i + separator.length <= source.length; i++) { if(source[i] == separator[0]) { found = true; for(size_t j = i+1, k=1; k < separator.length; ++j, ++k) if(source[j] != separator[k]) { found = false; continue outer; } break; } } s = source[0..i]; if(found) source = source[i + separator.length..$]; else source = source[$..$]; } } } auto message = "REGISTER sip:example.com SIP/2.0\r Content-Length: 0\r Contact: <sip:[email protected]:59788;transport=tls>;expires=4294967295;events=\"message-summaryq\";q=0.9\r To: <sip:[email protected]>\r User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r Max-Forwards: 70\r CSeq: 1 REGISTER\r Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r Call-ID: 2910497622026445\r From: <sip:[email protected]>;tag=2910497618150713\r\n\r\n"; void main() { enum iterations = 5_000_000; auto t1 = Clock.currTime(); foreach(i; 0..iterations) { foreach(notused; splitter(message, "\r\n")) { if(!i) writeln(notused); } } writefln("std.algorithm.splitter took %s", Clock.currTime()-t1); t1 = Clock.currTime(); foreach(i; 0..iterations) { foreach(notused; MySplitter(message, "\r\n")) { if(!i) writeln(notused); } } writefln("MySplitter took %s", Clock.currTime()-t1); } result (macbook pro 2.2GHz i7): REGISTER sip:example.com SIP/2.0 Content-Length: 0 Contact: <sip:[email protected]:59788;transport=tls>;expires=4294967295;events="message-summaryq";q=0.9 To: <sip:[email protected]> User-Agent: ("VENDOR=MyCompany" "My User Agent") Max-Forwards: 70 CSeq: 1 REGISTER Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690 Call-ID: 2910497622026445 From: <sip:[email protected]>;tag=2910497618150713 std.algorithm.splitter took 5 secs, 197 ms, and 89 μs REGISTER sip:example.com SIP/2.0 Content-Length: 0 Contact: <sip:[email protected]:59788;transport=tls>;expires=4294967295;events="message-summaryq";q=0.9 To: <sip:[email protected]> User-Agent: ("VENDOR=MyCompany" "My User Agent") Max-Forwards: 70 CSeq: 1 REGISTER Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690 Call-ID: 2910497622026445 From: <sip:[email protected]>;tag=2910497618150713 MySplitter took 3 secs, 668 ms, and 714 μs
Comment #1 by andrei — 2013-03-04T11:31:47Z
The forum discussion: http://goo.gl/CZVCB
Comment #2 by github-bugzilla — 2016-06-02T20:40:40Z
Commit pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/a9d5b8ca779d2dd89c8b49abae385e793d469e5d Improve speed of find for random access needles (strings) For find a string within a string, std.algorithm.searching.find was unnecessarily slow. The reason is it created intermediate slices. A naively written nested-for-loop implementation was a few times faster. For random access ranges (which strings are) this uses an index based algorithm, which does not need to create an intermediate slice. Speed is now comparable to the nested-for-loop implementation even in rather pathological cases. This might help with issue 9646.
Comment #3 by qznc — 2016-06-28T17:31:20Z
For an update. This is still an issue for LDC 1.0.0 and DMD 2.071.0. One problem is certainly that find does not get inlined. While everything of MySplitter gets inlined. Maybe this is more of a dmd than a phobos bug. Does LDC use the LLVM inliner?
Comment #4 by github-bugzilla — 2016-10-01T11:45:16Z
Commit pushed to stable at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/a9d5b8ca779d2dd89c8b49abae385e793d469e5d Improve speed of find for random access needles (strings)
Comment #5 by greensunny12 — 2018-03-30T01:16:31Z
New numbers: DMD (2.079.0): std.algorithm.splitter took 2 secs, 736 ms, 348 μs, and 3 hnsecs MySplitter took 3 secs, 176 ms, 18 μs, and 9 hnsecs LDC (1.8.0): > ldc -O4 -mcpu=native -release -flto=full std.algorithm.splitter took 1 sec, 954 ms, 444 μs, and 9 hnsecs MySplitter took 1 sec, 635 ms, 394 μs, and 8 hnsecs So I there's still an opportunity for improvement with LDC.
Comment #6 by robert.schadek — 2024-12-01T16:16:50Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9960 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB