Bug 18627 – std.complex is a lot slower than builtin complex types at number crunching
Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2018-03-18T16:38:25Z
Last change time
2021-04-24T01:06:50Z
Keywords
pull
Assigned to
No Owner
Creator
ponce
Comments
Comment #0 by aliloko — 2018-03-18T16:38:25Z
BENCHMARK:
Please consider the following program:
It's just number crunching, there is no complex exponential or such function calls.
The FFT function features complex addition, multiplication and division.
--- main.d
import std.string;
import std.datetime;
import std.datetime.stopwatch : benchmark, StopWatch;
import std.complex;
import std.stdio;
import std.math;
void main()
{
cfloat[] A = new cfloat[1024];
cdouble[] B = new cdouble[1024];
Complex!float[] C = new Complex!float[1024];
Complex!double[] D = new Complex!double[1024];
foreach(i; 0..1024)
{
// Initialize with something
A[i] = i + 0i;
B[i] = i + 0i;
C[i] = Complex!float(i, 0);
D[i] = Complex!double(i, 0);
}
void fA()
{
bench!(float, cfloat, true)(A);
}
void fB()
{
bench!(double, cdouble, true)(B);
}
void fC()
{
bench!(float, Complex!float, false)(C);
}
void fD()
{
bench!(double, Complex!double, false)(D);
}
auto r = benchmark!(fA, fB, fC, fD)(1000);
Duration fAResult = r[0];
Duration fBResult = r[1];
Duration fCResult = r[2];
Duration fDResult = r[3];
writefln("With cfloat: %s", fAResult);
writefln("With cdouble: %s", fBResult);
writefln("With Complex!float: %s", fCResult);
writefln("With Complex!double: %s", fDResult);
}
void bench(T, ComplexType, bool usingBuiltinTypes)(ComplexType[] array)
{
// forward then reverse FFT
FFT!(T, ComplexType, true, usingBuiltinTypes)(array);
FFT!(T, ComplexType, false, usingBuiltinTypes)(array);
}
// just for the benchmark purpose
private void FFT(T, ComplexType, bool direction, bool usingBuiltinTypes)(ComplexType[] buffer) pure nothrow @nogc
{
int size = cast(int)(buffer.length);
int m = 10;
ComplexType* pbuffer = buffer.ptr;
// do the bit reversal
int i2 = cast(int)size / 2;
int j = 0;
for (int i = 0; i < size - 1; ++i)
{
if (i < j)
{
auto tmp = pbuffer[i];
pbuffer[i] = pbuffer[j];
pbuffer[j] = tmp;
}
int k = i2;
while(k <= j)
{
j = j - k;
k = k / 2;
}
j += k;
}
// compute the FFT
static if (usingBuiltinTypes)
{
ComplexType c = -1+0i;
}
else
{
ComplexType c = -1;
}
int l2 = 1;
for (int l = 0; l < m; ++l)
{
int l1 = l2;
l2 = l2 * 2;
static if (usingBuiltinTypes)
{
ComplexType u = 1+0i;
}
else
{
ComplexType u = 1;
}
for (int j2 = 0; j2 < l1; ++j2)
{
int i = j2;
while (i < size)
{
int i1 = i + l1;
ComplexType t1 = u * pbuffer[i1];
pbuffer[i1] = pbuffer[i] - t1;
pbuffer[i] += t1;
i += l2;
}
u = u * c;
}
T newImag = sqrt((1 - c.re) / 2);
static if (direction)
newImag = -newImag;
T newReal = sqrt((1 + c.re) / 2);
static if (usingBuiltinTypes)
{
c = newReal + 1.0fi * newImag;
}
else
{
c = ComplexType(newReal, newImag);
}
}
// scaling when doing the reverse transformation, to avoid being multiplied by size
static if (!direction)
{
T divider = 1 / cast(T)size;
for (int i = 0; i < size; ++i)
{
pbuffer[i] = pbuffer[i] * divider;
}
}
}
BENCHMARKS
- With DMD v2.079.0, Windows:
$ rdmd -O -inline -release main.d
With cfloat: 1 sec, 187 ms, 225 ╬╝s, and 5 hnsecs
With cdouble: 1 sec, 315 ms, 385 ╬╝s, and 7 hnsecs
With Complex!float: 2 secs, 451 ms, 319 ╬╝s, and 2 hnsecs (!)
With Complex!double: 11 secs, 539 ms, and 96 ╬╝s (!!!)
$ rdmd -O -inline -release main.d -m64
With cfloat: 1 sec, 196 ms, 312 ╬╝s, and 7 hnsecs
With cdouble: 1 sec, 295 ms, 927 ╬╝s, and 9 hnsecs
With Complex!float: 2 secs, 505 ms, 536 ╬╝s, and 4 hnsecs (!)
With Complex!double: 11 secs, 590 ms, 574 ╬╝s, and 4 hnsecs (!!!)
- With LDC 1.8.0 32-bit:
$ ldc2.exe -O3 -release -enable-inlining main.d
With cfloat: 564 ms, 669 us, and 9 hnsecs
With cdouble: 535 ms, 570 us, and 2 hnsecs
With Complex!float: 563 ms, 809 us, and 4 hnsecs
With Complex!double: 537 ms, 289 us, and 6 hnsecs
- With LDC 1.8.0 64-bit:
$ ldc2.exe -O3 -release -enable-inlining main.d
With cfloat: 577 ms, 472 us, and 5 hnsecs
With cdouble: 554 ms, 357 us, and 6 hnsecs
With Complex!float: 588 ms, 463 us, and 9 hnsecs
With Complex!double: 558 ms, 881 us, and 7 hnsecs
COMMENTARY
We are looking to a guaranteed 2x to 10x regression when using complex number + DMD and complex arithmetic.
Latest LDC seems to have fixed this discrepancy but I'm still using an older LDC where the difference is 2x in favour of builtin complexes.
This will definately make my life harder and I don't understand why the rug is pulled below builtin complex numbers.
While I do use another FFT optimized with LDC intrinsics, there is a lot of complex number crunching in phase vocoders, filtering, and many signal processing tasks. I find the builtin complex in D incredibly well suited to this.
This is a competive avantage, even for readability. Please reconsider the deprecation.
FYI, GDC is missing, but I'll post it anyway, along with DMD as a comparative benchmark, because each machine is different and DMD may optimize weirdly for one CPU but is perfectly fine for another (see for instance issue 5100)
DMD64 D Compiler v2.076.1
---
$ dmd complex.d -O -inline -release
With cfloat: 75 ms, 688 μs, and 2 hnsecs
With cdouble: 61 ms, 546 μs, and 7 hnsecs
With Complex!float: 161 ms, 816 μs, and 8 hnsecs
With Complex!double: 109 ms, 66 μs, and 1 hnsec
---
There seems to be room for improvement in dmd or the general phobos implementation.
gdc (GCC) 8.0.1 20180226 (2.076.1 library and patches)
---
$ gdc complex.d -O2 -frelease
With cfloat: 154 ms, 871 μs, and 8 hnsecs
With cdouble: 59 ms, 205 μs, and 7 hnsecs
With Complex!float: 32 ms, 566 μs, and 5 hnsecs
With Complex!double: 34 ms, 961 μs, and 6 hnsecs
---
However with gdc, std.complex is /faster/ than native.
Comment #4 by aliloko — 2018-03-20T12:58:43Z
This benchmark is a variation that does only division.
--- divide.d
import std.string;
import std.datetime;
import std.datetime.stopwatch : benchmark, StopWatch;
import std.complex;
import std.stdio;
import std.math;
void main()
{
int[] divider = new int[1024];
cfloat[] A = new cfloat[1024];
cdouble[] B = new cdouble[1024];
Complex!float[] C = new Complex!float[1024];
Complex!double[] D = new Complex!double[1024];
foreach(i; 0..1024)
{
divider[i] = (i*69060) / 10000;
// Initialize with something
A[i] = i + 1i;
B[i] = i + 1i;
C[i] = Complex!float(i, 1);
D[i] = Complex!double(i, 1);
}
void justDivide(ComplexType)(ComplexType[] arr)
{
int size = cast(int)(arr.length);
for (int i = 0; i < size; ++i)
{
arr[i] = divider[i] / arr[i];
}
}
void fA()
{
justDivide!(cfloat)(A);
}
void fB()
{
justDivide!(cdouble)(B);
}
void fC()
{
justDivide!(Complex!float)(C);
}
void fD()
{
justDivide!(Complex!double)(D);
}
auto r = benchmark!(fA, fB, fC, fD)(1000000);
{
writefln("With cfloat: %s", r[0] );
writefln("With cdouble: %s", r[1] );
writefln("With Complex!float: %s", r[2] );
writefln("With Complex!double: %s", r[3] );
}
}
RESULTS
* With ldc 1.8.0 64-bit:
$ ldc2.exe -O3 -enable-inlining -release divide.d -m64
$ divide.exe
With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs
With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs
With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs
With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs
* With ldc 1.8.0 32-bit:
$ ldc2.exe -O3 -enable-inlining -release divide.d -m32
$ divide.exe
With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec
With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs
With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs
With Complex!double: 15 secs, 414 ms, and 344 ╬╝s
This show that even with latest LDC you can have a regression.
I appreciate that std.complex gives more precision in the divide operation, it's also something that is _different_ from builtin complex it replaces.
Comment #5 by aliloko — 2018-03-20T13:30:50Z
Division with DMD 32-bit:
With cfloat: 1 minute, 18 secs, 451 ms, 932 ╬╝s, and 9 hnsecs
With cdouble: 1 minute, 19 secs, 747 ms, 70 ╬╝s, and 5 hnsecs
With Complex!float: 27 secs, 412 ms, 926 ╬╝s, and 5 hnsecs
With Complex!double: 25 secs, 39 ms, 159 ╬╝s, and 2 hnsecs
Comment #6 by aliloko — 2018-03-20T13:46:12Z
Conversely complex divide seems faster with DMD with std.complex than builtins.
Comment #7 by greensunny12 — 2018-03-20T13:50:13Z
> Division with DMD 32-bit:
Using DMD for any performance arguments is a bit of a moot point as DMD's optimizer is pretty bad. So this would halt almost all development as there are many many performance regressions with DMD.
Comment #8 by aliloko — 2018-03-20T20:19:35Z
@Seb: It's not only about DMD, there is a 2x performance regression with Complex!double vs cdouble using LDC. There are probably more I haven't exposed yet. And yes, I use cdouble for designing IIR filters, in a real-time program.
Our main product use builtin complexes, it's downloaded 2000 times per month.
Comment #9 by aliloko — 2018-03-22T11:08:32Z
I think at the very least std.complex should contain a function to divide Complex without the additional precision provided by the check with the 2 fabs().
People that want speed could opt-in, and others will enjoy increased precision without noticing.
Comment #10 by ibuclaw — 2020-07-05T16:33:48Z
(In reply to ponce from comment #4)
> This benchmark is a variation that does only division.
>
> --- divide.d
* With gdc -O2 -frelease -m64
With cfloat: 11 secs, 204 ms, 475 μs, and 2 hnsecs
With cdouble: 13 secs, 420 ms, 497 μs, and 6 hnsecs
With Complex!float: 4 secs, 689 ms, 546 μs, and 2 hnsecs
With Complex!double: 8 secs, 903 ms, 172 μs, and 4 hnsecs
* With gdc -O2 -frelease -m32
With cfloat: 29 secs, 471 ms, 678 μs, and 9 hnsecs
With cdouble: 29 secs, 176 ms, 189 μs, and 2 hnsecs
With Complex!float: 13 secs, 379 ms, 856 μs, and 8 hnsecs
With Complex!double: 18 secs, 240 ms, 975 μs, and 5 hnsecs
Native complex floating point must die.
Comment #11 by dlang-bot — 2021-02-28T00:23:19Z
@ibuclaw created dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for complex divide" fixing this issue:
- fix Issue 18627 - Use cephes algorithm for complex divide
https://github.com/dlang/phobos/pull/7814
Comment #12 by dlang-bot — 2021-03-22T08:18:51Z
dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for complex divide" was merged into master:
- 70595f5d51011a6258d001523c8749411b9d8152 by Iain Buclaw:
fix Issue 18627 - Use cephes algorithm for complex divide
https://github.com/dlang/phobos/pull/7814
Comment #13 by ibuclaw — 2021-03-24T09:20:10Z
(In reply to ponce from comment #4)
> RESULTS
>
> * With ldc 1.8.0 64-bit:
>
> $ ldc2.exe -O3 -enable-inlining -release divide.d -m64
> $ divide.exe
>
> With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs
> With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs
> With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs
> With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs
>
>
> * With ldc 1.8.0 32-bit:
>
> $ ldc2.exe -O3 -enable-inlining -release divide.d -m32
> $ divide.exe
>
> With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec
> With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs
> With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs
> With Complex!double: 15 secs, 414 ms, and 344 ╬╝s
>
>
> This show that even with latest LDC you can have a regression.
>
> I appreciate that std.complex gives more precision in the divide operation,
> it's also something that is _different_ from builtin complex it replaces.
A bug probably should be raised against LDC for not using range reduction (i.e: Smiths algorithm) in their native complex division implementation.
The slowdown is not a regression, LDC is just using the wrong algorithm by default (i.e: the "fast" naive version should be generated only when compiling with `-ffast-math`).
GDC and LDC could coordinate with each other and predefine `version(FastMath)` when the `-ffast-math` flag is given on the command-line.
Comment #14 by aliloko — 2021-03-24T09:30:35Z
No problem from here, a lot of our complex code is now SIMD. I doubt we'll see a practical problem apart from the transition work. It's easy to recreate the desired division algorithm manually if ever needed.