Bug 16517 – topN performance very poor on certain data sets

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Mac OS X
Creation time
2016-09-21T06:59:00Z
Last change time
2016-10-23T16:50:25Z
Assigned to
nobody
Creator
jrdemail2000-dlang

Comments

Comment #0 by jrdemail2000-dlang — 2016-09-21T06:59:55Z
topN has very poor performance on certain data sets, to the point of being unusable. Note: There was a forum thread, issues (eg. 15553) and merged patches in Jan 2016. Best I can tell, these fixes are all included in the dmd/ldc releases I've tested, indicating performance issues remain. The data set used is not artificial, but the distribution of values are quite skewed, this may contribute to the performance issues. Data is a file from the google books ngram viewer project: https://books.google.com/ngrams. File is a TSV file with 4 columns: ngram, year, total occurrences, number of books occurred in. File used has 10,512,769 lines. Tests were run by reading one of the fields (10M values), storing in an array, and running topN to get the median value. This was run for fields 2,3,4 and compared to the same algorithm using sort. Sort was faster in each case, but for the fields 3 and 4 topN was dramatically slower, to the point of being unusable. Summary (Times in milliseconds, Compiler: LDC 1.1.0-beta2): Field 2 Field 3 Field 4 sort: 289 285 273 topN: 1756 148793 668906 As this was a median calculation, the nth argument to topN was the mid-point. Some of the earlier fixes discuss performance issues for topN where nth is at one end. Data file: http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-0.gz It's the first file under "1-grams" from this URL: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html =====Test program (median.d)===== import std.stdio; import std.array : appender; import std.algorithm : sort, topN; import std.conv; import std.range; import std.datetime: StopWatch; void main() { StopWatch swTotal, swSort; swTotal.start; Appender!(double[]) values; foreach (line; stdin.byLine) values.put(line.to!double); size_t medianIndex = values.data.length/2; swSort.start; version(topn) { topN(values.data, medianIndex); auto method = "topN"; } else { sort(values.data); auto method = "sort"; } swTotal.stop; swSort.stop; writefln("median (via %s): %g; lines: %d; total time (ms): %d; %s time (ms): %d", method, values.data[medianIndex], values.data.length, swTotal.peek.msecs, method, swSort.peek.msecs); } ========================== Test session: $ make ldc2 -release -O -boundscheck=off -d-version=topn -ofmedian_by_topn median.d ldc2 -release -O -boundscheck=off -ofmedian_by_sort median.d $ gcut -f 2 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 1972; lines: 10512769; total time (ms): 1756; sort time (ms): 289 $ gcut -f 2 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 1972; lines: 10512769; total time (ms): 3199; topN time (ms): 1756 $ gcut -f 3 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 3; lines: 10512769; total time (ms): 1549; sort time (ms): 285 $ gcut -f 3 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 3; lines: 10512769; total time (ms): 150033; topN time (ms): 148793 $ gcut -f 4 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 2; lines: 10512769; total time (ms): 1523; sort time (ms): 273 $ gcut -f 4 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 2; lines: 10512769; total time (ms): 670124; topN time (ms): 668906 $ ldc2 --version LDC - the LLVM D compiler (1.1.0-beta2): based on DMD v2.071.1 and LLVM 3.9.0 built with LDC - the LLVM D compiler (1.1.0-beta2) Default target: x86_64-apple-darwin14.5.0
Comment #1 by andrei — 2016-09-23T16:05:48Z
I don't have a ldc testing rig installed, but with https://github.com/dlang/phobos/pull/4815 and dmd I get the following for your code and data: $ dmd -O -inline -release -ofmedian_by_topn -boundscheck=off -version=topn issue16517.d $ dmd -O -inline -release -ofmedian_by_sort -boundscheck=off issue16517.d $ cut -f 2 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 1972; lines: 10512769; total time (ms): 2246; sort time (ms): 412 $ cut -f 2 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 1972; lines: 10512769; total time (ms): 1823; topN time (ms): 35 Although results obtained using dmd are not comparable, the trend is encouraging. If you could patch your stdlib with the pull request it would be great. Thanks!
Comment #2 by andrei — 2016-09-23T16:12:18Z
More measurements: $ cut -f 3 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 3; lines: 10512769; total time (ms): 2046; sort time (ms): 408 $ cut -f 3 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 3; lines: 10512769; total time (ms): 1330; topN time (ms): 64 $ cut -f 4 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 2; lines: 10512769; total time (ms): 1964; sort time (ms): 376 $ cut -f 4 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 2; lines: 10512769; total time (ms): 1549; topN time (ms): 78
Comment #3 by jrdemail2000-dlang — 2016-09-24T06:37:02Z
Apologies for the slow response. I'm not building my own LDC, just using current release build, so I'm not quite set up to pull the patch and test it. I'll take a look and what it takes, if I can do it easily I will. To do a bit of comparison with your numbers - On my machine, the DMD numbers are a bit faster than yours. Typical DMD times for the sort version on the 3 fields are 350/345/330ms, about 85% of your times. By the same token, on my machine LDC runs are about 85% of the DMD times. Dangerous to assume the ratios will hold up for topN, but if they do that be outstanding.
Comment #4 by jrdemail2000-dlang — 2016-09-25T03:18:00Z
I took a try at building the pull request as part of a local LDC instance, but hit incompatibilities in other code files. Will have to wait on the LDC tests.
Comment #5 by jrdemail2000-dlang — 2016-09-25T07:49:38Z
Got an LDC build with pull https://github.com/dlang/phobos/pull/4815. Also built the C++ nth_element program posted in the forum thread. Did seven runs each version, the median times are below. C++ compiler line: g++-mp-4.9 -O3 -static-libgcc -static-libstdc++ -std=c++11 Median sort times (ms): Field 2 Field 3 Field 4 DMD 351 348 326 LDC 273 261 245 C++ 281 260 246 Median topN / nth_element times (ms): Field 2 Field 3 Field 4 LDC: 21 44 57 C++: 41 60 71 This looks really good!
Comment #6 by andrei — 2016-10-23T16:50:25Z