Comment #0 by witold.baryluk+d — 2021-01-20T17:51:28Z
Comment #1 by witold.baryluk+d — 2021-01-20T17:57:45Z
100GB sparse file in tmpfs.
phobos:
009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data
real 3m23.589s
md5sum from Debian testing:
009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data
real 2m20.709s
32KiB buffers were used in both cases (confirmed by strace).
Code compiled in release mode, with optimisations enabled.
AMD ThreadRipper 2950X, water cooled. 128GB , quad channel DDR4-2933 memory.
Linux 5.10.4
Comment #2 by witold.baryluk+d — 2021-01-20T17:58:34Z
Comment #3 by witold.baryluk+d — 2021-01-20T18:04:57Z
also openssl 1.1.1i-2 from Debian testing (uses 8kiB buffers):
MD5(/usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data)= 009f07e9b8fb09a820dd180441502d46
real 2m18.517s
Similar to md5sum (coreutils 8.32).
Both faster than phobos.
Comment #4 by witold.baryluk+d — 2021-01-20T18:12:57Z
FYI. Using File(filename).byChunk(32*1024)), to allocate buffer once on a heap, instead on a stack (which could be unaligned and use big stack offsets, leading to a bit more poor instruction encodings), leads to the same performance results.
Comment #5 by b2.temp — 2021-01-20T18:18:09Z
using ldc2 too ? there are option to enable best vectorization and bit op
Comment #6 by zopsicle — 2021-01-20T18:25:00Z
Since you have 128 GB memory you could load the entire file into a byte array and compute the hash from there. Start the timer after loading the entire file. This should eliminate any potential difference in I/O from the tests. Of course, you will need to do the same with md5sum and OpenSSL.
Comment #7 by witold.baryluk+d — 2021-01-20T22:01:42Z
(In reply to Basile-z from comment #5)
> using ldc2 too ? there are option to enable best vectorization and bit op
Quite a bit better with ldc2 (1.24.0 with LLVM 11.0.0, -release -mcpu=native -O3):
009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data
real 2m47.200s
But, that is with precompiled ldc's phobos from Debian testing (-O -inline -release, aka -O2), so a bit limited vectorization and const propagation.
Comment #8 by witold.baryluk+d — 2021-01-20T22:23:55Z
BTW. When using precompiled dmd and phobos, from dmd 2.095 from dlang.org. It is really really slow:
dmd -O -inline -release -mcpu=avx2 -boundscheck=off md5.d
009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data
real 15m14.536s
It uses the precompiled 64-bit phobos and links statically to it. Almost feels like it is in debug mode, or maybe with bounds checks on. I didn't checked what it has by default in some time.
Comment #9 by witold.baryluk+d — 2021-01-21T01:54:14Z
In-memory tests, on 16384 byte blocks (should fit nicely in caches).
OpenSSL 1.1.1i (gcc-10.2.1 -fPIC -O2 -fstack-protector-strong ... -DOPENSSL_PIC -DMD5_ASM ...):
833MB/s
standard/optimized (non-asm) C version with clang-11.0.0 -O3 -march=native -flto -fomit-frame-pointer:
707MB/s
standard/optimized (non-asm) C version with gcc-10.2.1 -O3 -march=native -flto -fomit-frame-pointer:
594MB/s
hand optimized x86-64 assembly with gcc-10.2.1 ....:
716MB/s
hand optimized x86-64 assembly with clang-11.0.0 ....:
717MB/s
md5sum-coreutils-8.32-4 on big files in tmpfs (uses 32KiB buffers, but also doing syscalls), C + gcc-10:
763MB/s
md5sum-busybox-static-1.30.1-6 on big files in tmpfs (uses 32KiB buffers, but also doing syscalls), C + gcc-10:
565MB/s
D / phobos:
gdc-10.2.1 -O3 -march=native -frelease -fno-weak (using shared Phobos, which uses -fPIC, from Debian testing)
569MB/s
dmd-2.095 -O -inline -release (precompiled Phobos from dlang.org binary release, statically linked)
120MB/s
ldc2-1.24.0 -O3 -release (precompiled Phobos from Debian testing, dynamically linked)
677MB/s
dmd-2.095 -O -inline -release -mcpu=avx2 -boundscheck=no + hand compiled Phobos with same options, statically linked.
544MB/s
"performance" cpu frequency governor, no other load on system. Reruns were 10s+ each, few MB/s variations between reruns.
So, ldc2 actually does very good. Approaching the performance of pure-C version compiled with clang.
gdc despite poor codegen with -fPIC in MD5.transform (missed a lot of inlining opportunities for 1–2-instruction functions), is close to pure-C version compiled with gcc.
dmd. It depends how you compile the Phobos apparently. The version distributed on dlang.org, and as built by default, does poorly. Properly compiled it actually doesn't do too bad.
The pre-compiled version works horribly tho.
Comment #10 by robert.schadek — 2024-12-01T16:38:16Z