← Back to index | Original Bugzilla link

Bug 21560 – md5 poor performance out of the box

Status: NEW
Severity: enhancement
Priority: P4
Component: phobos
Product: D
Version: D2
Platform: x86_64
OS: Linux
Creation time: 2021-01-20T17:51:28Z
Last change time: 2024-12-01T16:38:16Z
Assigned to: No Owner
Creator: Witold Baryluk

Comments

Comment #0 by witold.baryluk+d — 2021-01-20T17:51:28Z

Comment #1 by witold.baryluk+d — 2021-01-20T17:57:45Z

100GB sparse file in tmpfs. phobos: 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 3m23.589s md5sum from Debian testing: 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 2m20.709s 32KiB buffers were used in both cases (confirmed by strace). Code compiled in release mode, with optimisations enabled. AMD ThreadRipper 2950X, water cooled. 128GB , quad channel DDR4-2933 memory. Linux 5.10.4

Comment #2 by witold.baryluk+d — 2021-01-20T17:58:34Z

void main(string[] args) { import std.digest.md : MD5, toHexString; import std.digest : LetterCase; import std.stdio : File, writefln; foreach (filename; args[1..$]) { ubyte[32768] buffer_ = void; MD5 md5; md5.start(); foreach (ubyte[] buffer; File(filename).byChunk(buffer_)) { md5.put(buffer); } auto hash = md5.finish(); writefln!("%s %s")(toHexString!(LetterCase.lower)(hash), filename); } }

Comment #3 by witold.baryluk+d — 2021-01-20T18:04:57Z

also openssl 1.1.1i-2 from Debian testing (uses 8kiB buffers): MD5(/usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data)= 009f07e9b8fb09a820dd180441502d46 real 2m18.517s Similar to md5sum (coreutils 8.32). Both faster than phobos.

Comment #4 by witold.baryluk+d — 2021-01-20T18:12:57Z

FYI. Using File(filename).byChunk(32*1024)), to allocate buffer once on a heap, instead on a stack (which could be unaligned and use big stack offsets, leading to a bit more poor instruction encodings), leads to the same performance results.

Comment #5 by b2.temp — 2021-01-20T18:18:09Z

using ldc2 too ? there are option to enable best vectorization and bit op

Comment #6 by zopsicle — 2021-01-20T18:25:00Z

Since you have 128 GB memory you could load the entire file into a byte array and compute the hash from there. Start the timer after loading the entire file. This should eliminate any potential difference in I/O from the tests. Of course, you will need to do the same with md5sum and OpenSSL.

Comment #7 by witold.baryluk+d — 2021-01-20T22:01:42Z

(In reply to Basile-z from comment #5) > using ldc2 too ? there are option to enable best vectorization and bit op Quite a bit better with ldc2 (1.24.0 with LLVM 11.0.0, -release -mcpu=native -O3): 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 2m47.200s But, that is with precompiled ldc's phobos from Debian testing (-O -inline -release, aka -O2), so a bit limited vectorization and const propagation.

Comment #8 by witold.baryluk+d — 2021-01-20T22:23:55Z

BTW. When using precompiled dmd and phobos, from dmd 2.095 from dlang.org. It is really really slow: dmd -O -inline -release -mcpu=avx2 -boundscheck=off md5.d 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 15m14.536s It uses the precompiled 64-bit phobos and links statically to it. Almost feels like it is in debug mode, or maybe with bounds checks on. I didn't checked what it has by default in some time.

Comment #9 by witold.baryluk+d — 2021-01-21T01:54:14Z

In-memory tests, on 16384 byte blocks (should fit nicely in caches). OpenSSL 1.1.1i (gcc-10.2.1 -fPIC -O2 -fstack-protector-strong ... -DOPENSSL_PIC -DMD5_ASM ...): 833MB/s standard/optimized (non-asm) C version with clang-11.0.0 -O3 -march=native -flto -fomit-frame-pointer: 707MB/s standard/optimized (non-asm) C version with gcc-10.2.1 -O3 -march=native -flto -fomit-frame-pointer: 594MB/s hand optimized x86-64 assembly with gcc-10.2.1 ....: 716MB/s hand optimized x86-64 assembly with clang-11.0.0 ....: 717MB/s md5sum-coreutils-8.32-4 on big files in tmpfs (uses 32KiB buffers, but also doing syscalls), C + gcc-10: 763MB/s md5sum-busybox-static-1.30.1-6 on big files in tmpfs (uses 32KiB buffers, but also doing syscalls), C + gcc-10: 565MB/s D / phobos: gdc-10.2.1 -O3 -march=native -frelease -fno-weak (using shared Phobos, which uses -fPIC, from Debian testing) 569MB/s dmd-2.095 -O -inline -release (precompiled Phobos from dlang.org binary release, statically linked) 120MB/s ldc2-1.24.0 -O3 -release (precompiled Phobos from Debian testing, dynamically linked) 677MB/s dmd-2.095 -O -inline -release -mcpu=avx2 -boundscheck=no + hand compiled Phobos with same options, statically linked. 544MB/s "performance" cpu frequency governor, no other load on system. Reruns were 10s+ each, few MB/s variations between reruns. So, ldc2 actually does very good. Approaching the performance of pure-C version compiled with clang. gdc despite poor codegen with -fPIC in MD5.transform (missed a lot of inlining opportunities for 1–2-instruction functions), is close to pure-C version compiled with gcc. dmd. It depends how you compile the Phobos apparently. The version distributed on dlang.org, and as built by default, does poorly. Properly compiled it actually doesn't do too bad. The pre-compiled version works horribly tho.

Comment #10 by robert.schadek — 2024-12-01T16:38:16Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/10453 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB