Bug 18627 – std.complex is a lot slower than builtin complex types at number crunching

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2018-03-18T16:38:25Z
Last change time
2021-04-24T01:06:50Z
Keywords
pull
Assigned to
No Owner
Creator
ponce

Comments

Comment #0 by aliloko — 2018-03-18T16:38:25Z
BENCHMARK: Please consider the following program: It's just number crunching, there is no complex exponential or such function calls. The FFT function features complex addition, multiplication and division. --- main.d import std.string; import std.datetime; import std.datetime.stopwatch : benchmark, StopWatch; import std.complex; import std.stdio; import std.math; void main() { cfloat[] A = new cfloat[1024]; cdouble[] B = new cdouble[1024]; Complex!float[] C = new Complex!float[1024]; Complex!double[] D = new Complex!double[1024]; foreach(i; 0..1024) { // Initialize with something A[i] = i + 0i; B[i] = i + 0i; C[i] = Complex!float(i, 0); D[i] = Complex!double(i, 0); } void fA() { bench!(float, cfloat, true)(A); } void fB() { bench!(double, cdouble, true)(B); } void fC() { bench!(float, Complex!float, false)(C); } void fD() { bench!(double, Complex!double, false)(D); } auto r = benchmark!(fA, fB, fC, fD)(1000); Duration fAResult = r[0]; Duration fBResult = r[1]; Duration fCResult = r[2]; Duration fDResult = r[3]; writefln("With cfloat: %s", fAResult); writefln("With cdouble: %s", fBResult); writefln("With Complex!float: %s", fCResult); writefln("With Complex!double: %s", fDResult); } void bench(T, ComplexType, bool usingBuiltinTypes)(ComplexType[] array) { // forward then reverse FFT FFT!(T, ComplexType, true, usingBuiltinTypes)(array); FFT!(T, ComplexType, false, usingBuiltinTypes)(array); } // just for the benchmark purpose private void FFT(T, ComplexType, bool direction, bool usingBuiltinTypes)(ComplexType[] buffer) pure nothrow @nogc { int size = cast(int)(buffer.length); int m = 10; ComplexType* pbuffer = buffer.ptr; // do the bit reversal int i2 = cast(int)size / 2; int j = 0; for (int i = 0; i < size - 1; ++i) { if (i < j) { auto tmp = pbuffer[i]; pbuffer[i] = pbuffer[j]; pbuffer[j] = tmp; } int k = i2; while(k <= j) { j = j - k; k = k / 2; } j += k; } // compute the FFT static if (usingBuiltinTypes) { ComplexType c = -1+0i; } else { ComplexType c = -1; } int l2 = 1; for (int l = 0; l < m; ++l) { int l1 = l2; l2 = l2 * 2; static if (usingBuiltinTypes) { ComplexType u = 1+0i; } else { ComplexType u = 1; } for (int j2 = 0; j2 < l1; ++j2) { int i = j2; while (i < size) { int i1 = i + l1; ComplexType t1 = u * pbuffer[i1]; pbuffer[i1] = pbuffer[i] - t1; pbuffer[i] += t1; i += l2; } u = u * c; } T newImag = sqrt((1 - c.re) / 2); static if (direction) newImag = -newImag; T newReal = sqrt((1 + c.re) / 2); static if (usingBuiltinTypes) { c = newReal + 1.0fi * newImag; } else { c = ComplexType(newReal, newImag); } } // scaling when doing the reverse transformation, to avoid being multiplied by size static if (!direction) { T divider = 1 / cast(T)size; for (int i = 0; i < size; ++i) { pbuffer[i] = pbuffer[i] * divider; } } } BENCHMARKS - With DMD v2.079.0, Windows: $ rdmd -O -inline -release main.d With cfloat: 1 sec, 187 ms, 225 ╬╝s, and 5 hnsecs With cdouble: 1 sec, 315 ms, 385 ╬╝s, and 7 hnsecs With Complex!float: 2 secs, 451 ms, 319 ╬╝s, and 2 hnsecs (!) With Complex!double: 11 secs, 539 ms, and 96 ╬╝s (!!!) $ rdmd -O -inline -release main.d -m64 With cfloat: 1 sec, 196 ms, 312 ╬╝s, and 7 hnsecs With cdouble: 1 sec, 295 ms, 927 ╬╝s, and 9 hnsecs With Complex!float: 2 secs, 505 ms, 536 ╬╝s, and 4 hnsecs (!) With Complex!double: 11 secs, 590 ms, 574 ╬╝s, and 4 hnsecs (!!!) - With LDC 1.8.0 32-bit: $ ldc2.exe -O3 -release -enable-inlining main.d With cfloat: 564 ms, 669 us, and 9 hnsecs With cdouble: 535 ms, 570 us, and 2 hnsecs With Complex!float: 563 ms, 809 us, and 4 hnsecs With Complex!double: 537 ms, 289 us, and 6 hnsecs - With LDC 1.8.0 64-bit: $ ldc2.exe -O3 -release -enable-inlining main.d With cfloat: 577 ms, 472 us, and 5 hnsecs With cdouble: 554 ms, 357 us, and 6 hnsecs With Complex!float: 588 ms, 463 us, and 9 hnsecs With Complex!double: 558 ms, 881 us, and 7 hnsecs COMMENTARY We are looking to a guaranteed 2x to 10x regression when using complex number + DMD and complex arithmetic. Latest LDC seems to have fixed this discrepancy but I'm still using an older LDC where the difference is 2x in favour of builtin complexes. This will definately make my life harder and I don't understand why the rug is pulled below builtin complex numbers. While I do use another FFT optimized with LDC intrinsics, there is a lot of complex number crunching in phase vocoders, filtering, and many signal processing tasks. I find the builtin complex in D incredibly well suited to this. This is a competive avantage, even for readability. Please reconsider the deprecation.
Comment #1 by greeenify — 2018-03-18T16:47:38Z
Comment #2 by aliloko — 2018-03-18T17:08:34Z
I've posted there, thanks.
Comment #3 by ibuclaw — 2018-03-20T06:18:10Z
FYI, GDC is missing, but I'll post it anyway, along with DMD as a comparative benchmark, because each machine is different and DMD may optimize weirdly for one CPU but is perfectly fine for another (see for instance issue 5100) DMD64 D Compiler v2.076.1 --- $ dmd complex.d -O -inline -release With cfloat: 75 ms, 688 μs, and 2 hnsecs With cdouble: 61 ms, 546 μs, and 7 hnsecs With Complex!float: 161 ms, 816 μs, and 8 hnsecs With Complex!double: 109 ms, 66 μs, and 1 hnsec --- There seems to be room for improvement in dmd or the general phobos implementation. gdc (GCC) 8.0.1 20180226 (2.076.1 library and patches) --- $ gdc complex.d -O2 -frelease With cfloat: 154 ms, 871 μs, and 8 hnsecs With cdouble: 59 ms, 205 μs, and 7 hnsecs With Complex!float: 32 ms, 566 μs, and 5 hnsecs With Complex!double: 34 ms, 961 μs, and 6 hnsecs --- However with gdc, std.complex is /faster/ than native.
Comment #4 by aliloko — 2018-03-20T12:58:43Z
This benchmark is a variation that does only division. --- divide.d import std.string; import std.datetime; import std.datetime.stopwatch : benchmark, StopWatch; import std.complex; import std.stdio; import std.math; void main() { int[] divider = new int[1024]; cfloat[] A = new cfloat[1024]; cdouble[] B = new cdouble[1024]; Complex!float[] C = new Complex!float[1024]; Complex!double[] D = new Complex!double[1024]; foreach(i; 0..1024) { divider[i] = (i*69060) / 10000; // Initialize with something A[i] = i + 1i; B[i] = i + 1i; C[i] = Complex!float(i, 1); D[i] = Complex!double(i, 1); } void justDivide(ComplexType)(ComplexType[] arr) { int size = cast(int)(arr.length); for (int i = 0; i < size; ++i) { arr[i] = divider[i] / arr[i]; } } void fA() { justDivide!(cfloat)(A); } void fB() { justDivide!(cdouble)(B); } void fC() { justDivide!(Complex!float)(C); } void fD() { justDivide!(Complex!double)(D); } auto r = benchmark!(fA, fB, fC, fD)(1000000); { writefln("With cfloat: %s", r[0] ); writefln("With cdouble: %s", r[1] ); writefln("With Complex!float: %s", r[2] ); writefln("With Complex!double: %s", r[3] ); } } RESULTS * With ldc 1.8.0 64-bit: $ ldc2.exe -O3 -enable-inlining -release divide.d -m64 $ divide.exe With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs * With ldc 1.8.0 32-bit: $ ldc2.exe -O3 -enable-inlining -release divide.d -m32 $ divide.exe With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs With Complex!double: 15 secs, 414 ms, and 344 ╬╝s This show that even with latest LDC you can have a regression. I appreciate that std.complex gives more precision in the divide operation, it's also something that is _different_ from builtin complex it replaces.
Comment #5 by aliloko — 2018-03-20T13:30:50Z
Division with DMD 32-bit: With cfloat: 1 minute, 18 secs, 451 ms, 932 ╬╝s, and 9 hnsecs With cdouble: 1 minute, 19 secs, 747 ms, 70 ╬╝s, and 5 hnsecs With Complex!float: 27 secs, 412 ms, 926 ╬╝s, and 5 hnsecs With Complex!double: 25 secs, 39 ms, 159 ╬╝s, and 2 hnsecs
Comment #6 by aliloko — 2018-03-20T13:46:12Z
Conversely complex divide seems faster with DMD with std.complex than builtins.
Comment #7 by greensunny12 — 2018-03-20T13:50:13Z
> Division with DMD 32-bit: Using DMD for any performance arguments is a bit of a moot point as DMD's optimizer is pretty bad. So this would halt almost all development as there are many many performance regressions with DMD.
Comment #8 by aliloko — 2018-03-20T20:19:35Z
@Seb: It's not only about DMD, there is a 2x performance regression with Complex!double vs cdouble using LDC. There are probably more I haven't exposed yet. And yes, I use cdouble for designing IIR filters, in a real-time program. Our main product use builtin complexes, it's downloaded 2000 times per month.
Comment #9 by aliloko — 2018-03-22T11:08:32Z
I think at the very least std.complex should contain a function to divide Complex without the additional precision provided by the check with the 2 fabs(). People that want speed could opt-in, and others will enjoy increased precision without noticing.
Comment #10 by ibuclaw — 2020-07-05T16:33:48Z
(In reply to ponce from comment #4) > This benchmark is a variation that does only division. > > --- divide.d * With gdc -O2 -frelease -m64 With cfloat: 11 secs, 204 ms, 475 μs, and 2 hnsecs With cdouble: 13 secs, 420 ms, 497 μs, and 6 hnsecs With Complex!float: 4 secs, 689 ms, 546 μs, and 2 hnsecs With Complex!double: 8 secs, 903 ms, 172 μs, and 4 hnsecs * With gdc -O2 -frelease -m32 With cfloat: 29 secs, 471 ms, 678 μs, and 9 hnsecs With cdouble: 29 secs, 176 ms, 189 μs, and 2 hnsecs With Complex!float: 13 secs, 379 ms, 856 μs, and 8 hnsecs With Complex!double: 18 secs, 240 ms, 975 μs, and 5 hnsecs Native complex floating point must die.
Comment #11 by dlang-bot — 2021-02-28T00:23:19Z
@ibuclaw created dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for complex divide" fixing this issue: - fix Issue 18627 - Use cephes algorithm for complex divide https://github.com/dlang/phobos/pull/7814
Comment #12 by dlang-bot — 2021-03-22T08:18:51Z
dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for complex divide" was merged into master: - 70595f5d51011a6258d001523c8749411b9d8152 by Iain Buclaw: fix Issue 18627 - Use cephes algorithm for complex divide https://github.com/dlang/phobos/pull/7814
Comment #13 by ibuclaw — 2021-03-24T09:20:10Z
(In reply to ponce from comment #4) > RESULTS > > * With ldc 1.8.0 64-bit: > > $ ldc2.exe -O3 -enable-inlining -release divide.d -m64 > $ divide.exe > > With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs > With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs > With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs > With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs > > > * With ldc 1.8.0 32-bit: > > $ ldc2.exe -O3 -enable-inlining -release divide.d -m32 > $ divide.exe > > With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec > With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs > With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs > With Complex!double: 15 secs, 414 ms, and 344 ╬╝s > > > This show that even with latest LDC you can have a regression. > > I appreciate that std.complex gives more precision in the divide operation, > it's also something that is _different_ from builtin complex it replaces. A bug probably should be raised against LDC for not using range reduction (i.e: Smiths algorithm) in their native complex division implementation. The slowdown is not a regression, LDC is just using the wrong algorithm by default (i.e: the "fast" naive version should be generated only when compiling with `-ffast-math`). GDC and LDC could coordinate with each other and predefine `version(FastMath)` when the `-ffast-math` flag is given on the command-line.
Comment #14 by aliloko — 2021-03-24T09:30:35Z
No problem from here, a lot of our complex code is now SIMD. I doubt we'll see a practical problem apart from the transition work. It's easy to recreate the desired division algorithm manually if ever needed.
Comment #15 by ibuclaw — 2021-04-16T14:45:17Z
Not sure if this should really be marked as resolved/fixed, but anyhow... With the following (lazy) function generator: --- import std.complex : C = Complex; import std.meta : AliasSeq; import std.format : format; static foreach (T; AliasSeq!(cfloat, cdouble, creal)) { // Unary operators mixin(format!"%s %s_unary_add(%s a) { return +a; }" (T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_unary_sub(%s a) { return -a; }" (T.stringof, T.stringof, T.stringof)); // Binary operators mixin(format!"%s %s_binary_add(%s a, %s b) { return a + b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_binary_sub(%s a, %s b) { return a - b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_binary_mul(%s a, %s b) { return a * b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"%s %s_binary_div(%s a, %s b) { return a / b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); } static foreach (T; AliasSeq!(float, double, real)) { // Unary operators mixin(format!"C!%s std_c%s_unary_add(C!%s a) { return +a; }" (T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_unary_sub(C!%s a) { return -a; }" (T.stringof, T.stringof, T.stringof)); // Binary operators mixin(format!"C!%s std_c%s_binary_add(C!%s a, C!%s b) { return a + b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_binary_sub(C!%s a, C!%s b) { return a - b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_binary_mul(C!%s a, C!%s b) { return a * b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); mixin(format!"C!%s std_c%s_binary_div(C!%s a, C!%s b) { return a / b; }" (T.stringof, T.stringof, T.stringof, T.stringof)); } --- On x86_64/GDC, the results are: ======================================== cfloat_unary_add: movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm0 movss %xmm0, -16(%rsp) movss -4(%rsp), %xmm0 movss %xmm0, -12(%rsp) movq -16(%rsp), %xmm0 ret --- std_cfloat_unary_add: ret ======================================== cdouble_unary_add: ret --- std_cdouble_unary_add: ret ======================================== creal_unary_add: fldt 8(%rsp) fldt 24(%rsp) fxch %st(1) ret --- std_creal_unary_add: movdqa 8(%rsp), %xmm0 movdqa 24(%rsp), %xmm1 movq %rdi, %rax movaps %xmm0, (%rdi) movaps %xmm1, 16(%rdi) ret ======================================== cfloat_unary_sub: movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm0 movss .LC4(%rip), %xmm2 movaps %xmm0, %xmm1 movss -4(%rsp), %xmm0 xorps %xmm2, %xmm1 xorps %xmm2, %xmm0 movss %xmm1, -16(%rsp) movss %xmm0, -12(%rsp) movq -16(%rsp), %xmm0 ret .LC4: .long -2147483648 .long 0 .long 0 .long 0 --- std_cfloat_unary_sub: movq .LC7(%rip), %xmm1 xorps %xmm1, %xmm0 ret .LC7: .long -2147483648 .long -2147483648 ======================================== cdouble_unary_sub: movq .LC5(%rip), %xmm2 xorpd %xmm2, %xmm1 xorpd %xmm2, %xmm0 ret .LC5: .long 0 .long -2147483648 .long 0 .long 0 --- std_cdouble_unary_sub: movq %xmm0, -24(%rsp) movq %xmm1, -16(%rsp) movapd -24(%rsp), %xmm2 xorpd .LC8(%rip), %xmm2 movaps %xmm2, -24(%rsp) movsd -16(%rsp), %xmm1 movsd -24(%rsp), %xmm0 ret .LC8: .long 0 .long -2147483648 .long 0 .long -2147483648 ======================================== creal_unary_sub: fldt 8(%rsp) fchs fldt 24(%rsp) fchs fxch %st(1) ret --- std_creal_unary_sub: fldt 24(%rsp) movq %rdi, %rax fchs fldt 8(%rsp) fchs fstpt (%rdi) fstpt 16(%rdi) ret ======================================== cfloat_binary_add: movq %xmm0, -8(%rsp) movq %xmm1, -16(%rsp) movss -8(%rsp), %xmm1 movss -16(%rsp), %xmm0 addss %xmm0, %xmm1 movss -12(%rsp), %xmm0 addss -4(%rsp), %xmm0 movss %xmm1, -24(%rsp) movss %xmm0, -20(%rsp) movq -24(%rsp), %xmm0 ret --- std_cfloat_binary_add: addps %xmm1, %xmm0 ret ======================================== cdouble_binary_add: addsd %xmm3, %xmm1 addsd %xmm2, %xmm0 ret --- std_cdouble_binary_add: movq %xmm0, -40(%rsp) movq %xmm1, -32(%rsp) movq %xmm2, -24(%rsp) movq %xmm3, -16(%rsp) movapd -24(%rsp), %xmm4 addpd -40(%rsp), %xmm4 movaps %xmm4, -40(%rsp) movsd -32(%rsp), %xmm1 movsd -40(%rsp), %xmm0 ret ======================================== creal_binary_add: fldt 8(%rsp) fldt 40(%rsp) faddp %st, %st(1) fldt 24(%rsp) fldt 56(%rsp) faddp %st, %st(1) fxch %st(1) ret --- std_creal_binary_add: fldt 24(%rsp) movq %rdi, %rax fldt 56(%rsp) faddp %st, %st(1) fldt 40(%rsp) fldt 8(%rsp) faddp %st, %st(1) fstpt (%rdi) fstpt 16(%rdi) ret ======================================== cfloat_binary_sub: movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm0 movq %xmm1, -16(%rsp) movaps %xmm0, %xmm1 movss -4(%rsp), %xmm0 subss -16(%rsp), %xmm1 subss -12(%rsp), %xmm0 movss %xmm1, -24(%rsp) movss %xmm0, -20(%rsp) movq -24(%rsp), %xmm0 ret --- std_cfloat_binary_sub: subps %xmm1, %xmm0 ret ======================================== cdouble_binary_sub: subsd %xmm3, %xmm1 subsd %xmm2, %xmm0 ret --- std_cdouble_binary_sub: movq %xmm0, -40(%rsp) movq %xmm1, -32(%rsp) movapd -40(%rsp), %xmm4 movq %xmm2, -24(%rsp) movq %xmm3, -16(%rsp) subpd -24(%rsp), %xmm4 movaps %xmm4, -40(%rsp) movsd -32(%rsp), %xmm1 movsd -40(%rsp), %xmm0 ret ======================================== creal_binary_sub: fldt 8(%rsp) fldt 40(%rsp) fsubrp %st, %st(1) fldt 24(%rsp) fldt 56(%rsp) fsubrp %st, %st(1) fxch %st(1) ret --- std_creal_binary_sub: fldt 24(%rsp) movq %rdi, %rax fldt 56(%rsp) fsubrp %st, %st(1) fldt 8(%rsp) fldt 40(%rsp) fsubrp %st, %st(1) fstpt (%rdi) fstpt 16(%rdi) ret ======================================== cfloat_binary_mul: movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm0 movss -4(%rsp), %xmm2 movq %xmm1, -16(%rsp) movss -16(%rsp), %xmm3 movss -12(%rsp), %xmm4 movaps %xmm0, %xmm1 movaps %xmm2, %xmm5 mulss %xmm3, %xmm1 mulss %xmm4, %xmm5 mulss %xmm4, %xmm0 mulss %xmm3, %xmm2 subss %xmm5, %xmm1 addss %xmm2, %xmm0 movss %xmm1, -24(%rsp) movss %xmm0, -20(%rsp) movq -24(%rsp), %xmm0 ret --- std_cfloat_binary_mul: movdqa %xmm0, %xmm2 movaps %xmm1, %xmm0 shufps $0xe5, %xmm1, %xmm1 shufps $0xe0, %xmm0, %xmm0 mulps %xmm2, %xmm0 shufps $0xe1, %xmm2, %xmm2 mulps %xmm1, %xmm2 movaps %xmm0, %xmm1 subps %xmm2, %xmm1 addps %xmm2, %xmm0 movss %xmm1, %xmm0 ret ======================================== cdouble_binary_mul: movapd %xmm0, %xmm4 movapd %xmm1, %xmm5 mulsd %xmm2, %xmm0 mulsd %xmm3, %xmm5 mulsd %xmm3, %xmm4 mulsd %xmm2, %xmm1 subsd %xmm5, %xmm0 addsd %xmm4, %xmm1 ret --- std_cdouble_binary_mul: movq %xmm2, -40(%rsp) movq %xmm3, -32(%rsp) movapd -40(%rsp), %xmm2 movq %xmm1, -16(%rsp) movapd -40(%rsp), %xmm1 movq %xmm0, -24(%rsp) movapd -24(%rsp), %xmm0 unpcklpd %xmm2, %xmm2 mulpd -24(%rsp), %xmm2 unpckhpd %xmm1, %xmm1 shufpd $1, %xmm0, %xmm0 mulpd %xmm1, %xmm0 movapd %xmm2, %xmm1 subpd %xmm0, %xmm1 addpd %xmm0, %xmm2 movsd %xmm1, %xmm2 movaps %xmm2, -40(%rsp) movsd -32(%rsp), %xmm1 movsd -40(%rsp), %xmm0 ret ======================================== creal_binary_mul: fldt 8(%rsp) fldt 24(%rsp) fldt 40(%rsp) fldt 56(%rsp) fld %st(3) fmul %st(2), %st fld %st(3) fmul %st(2), %st fsubrp %st, %st(1) fxch %st(4) fmulp %st, %st(1) fxch %st(2) fmulp %st, %st(1) faddp %st, %st(1) fxch %st(1) ret --- std_creal_binary_mul: fldt 40(%rsp) movq %rdi, %rax fldt 56(%rsp) fldt 24(%rsp) fldt 8(%rsp) fld %st(3) fmul %st(1), %st fld %st(2) fmul %st(4), %st fsubrp %st, %st(1) fstpt (%rdi) fxch %st(3) fmulp %st, %st(1) fxch %st(2) fmulp %st, %st(1) faddp %st, %st(1) fstpt 16(%rdi) ret ======================================== cfloat_binary_div: movq %xmm1, -16(%rsp) movss -16(%rsp), %xmm5 movss -12(%rsp), %xmm4 movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm3 movss -4(%rsp), %xmm0 movaps %xmm5, %xmm2 movaps %xmm4, %xmm1 mulss %xmm4, %xmm1 movaps %xmm0, %xmm6 mulss %xmm5, %xmm2 mulss %xmm4, %xmm6 mulss %xmm5, %xmm0 addss %xmm1, %xmm2 movaps %xmm3, %xmm1 mulss %xmm5, %xmm1 mulss %xmm4, %xmm3 addss %xmm6, %xmm1 subss %xmm3, %xmm0 divss %xmm2, %xmm1 divss %xmm2, %xmm0 movss %xmm1, -24(%rsp) movss %xmm0, -20(%rsp) movq -24(%rsp), %xmm0 ret --- std_cfloat_binary_div: movq %xmm1, %rax movdqa %xmm1, %xmm2 movdqa %xmm0, %xmm3 shrq $32, %rax movaps %xmm2, %xmm4 mulss %xmm2, %xmm4 movd %eax, %xmm1 movq %xmm0, %rax movaps %xmm1, %xmm0 shrq $32, %rax mulss %xmm1, %xmm0 movq %rax, %xmm5 movd %eax, %xmm6 mulss %xmm1, %xmm6 addss %xmm0, %xmm4 movaps %xmm2, %xmm0 mulss %xmm3, %xmm0 mulss %xmm5, %xmm2 mulss %xmm1, %xmm3 addss %xmm6, %xmm0 subss %xmm3, %xmm2 divss %xmm4, %xmm0 divss %xmm4, %xmm2 unpcklps %xmm2, %xmm0 ret ======================================== cdouble_binary_div: movapd %xmm0, %xmm4 movapd %xmm2, %xmm5 movapd %xmm3, %xmm0 mulsd %xmm3, %xmm0 movapd %xmm1, %xmm6 mulsd %xmm2, %xmm5 mulsd %xmm3, %xmm6 mulsd %xmm2, %xmm1 addsd %xmm0, %xmm5 movapd %xmm4, %xmm0 mulsd %xmm2, %xmm0 mulsd %xmm3, %xmm4 addsd %xmm6, %xmm0 subsd %xmm4, %xmm1 divsd %xmm5, %xmm0 divsd %xmm5, %xmm1 ret --- std_cdouble_binary_div: movq %xmm2, -40(%rsp) movsd -40(%rsp), %xmm2 movq %xmm3, -32(%rsp) movapd -40(%rsp), %xmm3 movsd -32(%rsp), %xmm4 movq %xmm1, -16(%rsp) movapd -40(%rsp), %xmm1 mulsd %xmm2, %xmm2 movq %xmm0, -24(%rsp) mulsd %xmm4, %xmm4 unpcklpd %xmm3, %xmm3 movapd -24(%rsp), %xmm0 mulpd -24(%rsp), %xmm3 unpckhpd %xmm1, %xmm1 shufpd $1, %xmm0, %xmm0 mulpd %xmm1, %xmm0 addsd %xmm4, %xmm2 movapd %xmm3, %xmm1 addpd %xmm0, %xmm1 subpd %xmm0, %xmm3 unpcklpd %xmm2, %xmm2 movsd %xmm1, %xmm3 divpd %xmm2, %xmm3 movaps %xmm3, -40(%rsp) movsd -32(%rsp), %xmm1 movsd -40(%rsp), %xmm0 ret ======================================== creal_binary_div: fldt 8(%rsp) fldt 24(%rsp) fldt 40(%rsp) fldt 56(%rsp) fld %st(1) fmul %st(2), %st fld %st(1) fmul %st(2), %st faddp %st, %st(1) fld %st(4) fmul %st(3), %st fld %st(4) fmul %st(3), %st faddp %st, %st(1) fdiv %st(1), %st fxch %st(4) fmulp %st, %st(3) fxch %st(4) fmulp %st, %st(1) fsubrp %st, %st(1) fdivp %st, %st(2) ret --- std_creal_binary_div: fldt 40(%rsp) movq %rdi, %rax fldt 56(%rsp) fldt 24(%rsp) fldt 8(%rsp) fld %st(3) fmul %st(4), %st fld %st(3) fmul %st(4), %st faddp %st, %st(1) fld %st(4) fmul %st(2), %st fld %st(3) fmul %st(5), %st faddp %st, %st(1) fdiv %st(1), %st fstpt (%rdi) fxch %st(4) fmulp %st, %st(2) fmulp %st, %st(2) fsubp %st, %st(1) fdivp %st, %st(1) fstpt 16(%rdi) ret ======================================== Just visually comparing: - cfloat -> Complex!float looks to be neglible. - creal -> Complex!real just adds a small overhead of moving data on/off ST registers (this is expected, and not a performance bug). - cdouble -> Complex!double, it may look like cdouble still has a small edge, however the use of *pd instructions on the std.complex would infact make it quicker (i.e: one divpd is 2x faster than two divsd instructions in the cdouble_binary_div functions). I actually found that LLVM seemed for able to pick-up the intent of the FastMath complex divide functions, so LDC might give a more pleasing output. Benchmarks to follow soon...
Comment #16 by ibuclaw — 2021-04-24T01:06:50Z
cfloat_unary_add: 15 secs, 195 ms, 935 μs, and 5 hnsecs std_cfloat_unary_add: 2 secs, 491 ms, 834 μs, and 9 hnsecs cfloat_unary_sub: 14 secs, 926 ms, 587 μs, and 6 hnsecs std_cfloat_unary_sub: 4 secs, 858 ms, 349 μs, and 4 hnsecs cfloat_binary_add: 22 secs, 363 ms, 951 μs, and 9 hnsecs std_cfloat_binary_add: 5 secs, 403 ms, 108 μs, and 9 hnsecs cfloat_binary_sub: 22 secs, 236 ms, and 902 μs std_cfloat_binary_sub: 5 secs, 266 ms, 697 μs, and 6 hnsecs cfloat_binary_mul: 24 secs, 858 ms, 63 μs, and 7 hnsecs std_cfloat_binary_mul: 7 secs, 186 ms, 291 μs, and 8 hnsecs cfloat_binary_div: 30 secs, 225 ms, 114 μs, and 4 hnsecs std_cfloat_binary_div: 17 secs, 900 ms, 164 μs, and 6 hnsecs cfloat_binary_div(FastMath): 29 secs, 230 ms, 821 μs, and 5 hnsecs std_cfloat_binary_div(FastMath): 12 secs, 208 ms, 118 μs, and 7 hnsecs cdouble_unary_add: 2 secs, 788 ms, 525 μs, and 6 hnsecs std_cdouble_unary_add: 2 secs, 922 ms, 224 μs, and 1 hnsec cdouble_unary_sub: 2 secs, 502 ms, and 734 μs std_cdouble_unary_sub: 2 secs, 915 ms, 203 μs, and 9 hnsecs cdouble_binary_add: 2 secs, 869 ms, 820 μs, and 1 hnsec std_cdouble_binary_add: 3 secs, 108 ms, 545 μs, and 4 hnsecs cdouble_binary_sub: 2 secs, 836 ms, 796 μs, and 5 hnsecs std_cdouble_binary_sub: 3 secs, 159 ms, 209 μs, and 3 hnsecs cdouble_binary_mul: 4 secs, 785 ms, 197 μs, and 6 hnsecs std_cdouble_binary_mul: 5 secs, 197 ms, 572 μs, and 9 hnsecs cdouble_binary_div: 14 secs, 238 ms, 332 μs, and 6 hnsecs std_cdouble_binary_div: 15 secs, 933 ms, 301 μs, and 8 hnsecs cdouble_binary_div(FastMath): 10 secs, 700 ms, and 32 μs std_cdouble_binary_div(FastMath): 11 secs, 8 ms, 868 μs, and 5 hnsecs creal_unary_add: 8 secs, 183 ms, 254 μs, and 3 hnsecs std_creal_unary_add: 14 secs, 72 ms, 96 μs, and 2 hnsecs creal_unary_sub: 8 secs, 425 ms, 681 μs, and 9 hnsecs std_creal_unary_sub: 10 secs, 854 ms, 312 μs, and 8 hnsecs creal_binary_add: 3 minutes, 50 secs, 877 ms, 637 μs, and 6 hnsecs std_creal_binary_add: 3 minutes, 57 secs, 397 ms, 952 μs, and 4 hnsecs creal_binary_sub: 4 minutes, 4 secs, 982 ms, 715 μs, and 2 hnsecs std_creal_binary_sub: 4 minutes, 11 secs, 485 ms, 74 μs, and 8 hnsecs creal_binary_mul: 11 minutes, 31 secs, 328 ms, 600 μs, and 7 hnsecs std_creal_binary_mul: 11 minutes, 46 secs, 26 ms, 451 μs, and 2 hnsecs creal_binary_div: 20 minutes, 48 secs, 778 ms, and 747 μs std_creal_binary_div: 20 minutes, 2 secs, 439 ms, and 535 μs creal_binary_div(FastMath): 18 minutes, 38 secs, 613 ms, 679 μs, and 6 hnsecs std_creal_binary_div(FastMath): 18 minutes, 42 secs, 400 ms, 343 μs, and 7 hnsecs