Bug 6658 – Slow static array equality

Status
NEW
Severity
enhancement
Priority
P4
Component
dmd
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2011-09-12T16:20:16Z
Last change time
2024-12-13T17:56:24Z
Keywords
performance
Assigned to
No Owner
Creator
bearophile_hugs
Moved to GitHub: dmd#18369 →

Comments

Comment #0 by bearophile_hugs — 2011-09-12T16:20:16Z
Equality comparisons between very short arrays is quite slow in D/DMD. When the arrays are both fixed-sized and short, I suggest to inline item comparisons instead of calling a runtime function, to speed up the code significantly. A benchmark: bool eq1(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow { return a == b; } bool eq2(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow { foreach (size_t i; 0 .. a.length) if (a[i] != b[i]) return false; return true; } void main() { import std.stdio, std.random; enum size_t N = 3; enum size_t M = 100; rndGen.seed(1); int[N][M] data; foreach (ref d; data) foreach (ref x; d) x = uniform(0, 4); int total; foreach (i; 0 .. 20_000) foreach (j; 0 .. M) foreach (k; 0 .. M) static if (1) total += eq1(data[j], data[k]); // 11.5 seconds else total += eq2(data[j], data[k]); // 2.45 seconds writeln(total); }
Comment #1 by bearophile_hugs — 2013-03-31T03:29:16Z
After closing Issue 9477 the situation is changed a little. On the same PC with DMD 2.063alpha the run-time with eq2() is 2.04 seconds and the run-time with eq1() is 6.37 seconds (3.12X). The asm produced by dmd (-O -release -inline -noboundscheck) for the two versions: eq1: push EAX mov ECX, 0Ch push ESI mov ESI, 0Ch[ESP] push EDI mov EDI, EAX xor EAX, EAX rep cmpsb je L19 sbb EAX, EAX sbb EAX, 0FFFFFFFFh L19: neg EAX sbb EAX, EAX inc EAX pop EDI pop ESI pop ECX ret 4 eq2: push EBX mov EDX, EAX mov ECX, 8[ESP] xor EBX, EBX L9: mov EAX, [EBX*4][ECX] cmp EAX, [EBX*4][EDX] je L17 pop EBX xor EAX, EAX ret 4 L17: inc EBX cmp EBX, 3 jb L9 pop EBX mov EAX, 1 ret 4 Is cmpsb efficient on modern CPUs? Maybe the answer is negative: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
Comment #2 by dlang-bugzilla — 2013-04-01T14:56:15Z
Can we close this issue as a duplicate of issue 5282? > Is cmpsb efficient on modern CPUs? As the linked discussion mentions, it's not as efficient for short, fixed-size data blocks. Backends other than DMD might take advantage of a known array length, but DMD does not.
Comment #3 by bearophile_hugs — 2013-04-01T16:07:15Z
(In reply to comment #2) > Can we close this issue as a duplicate of issue 5282? > > > Is cmpsb efficient on modern CPUs? > > As the linked discussion mentions, it's not as efficient for short, fixed-size > data blocks. Backends other than DMD might take advantage of a known array > length, but DMD does not. Do you know why DMD can't be smarter here? The compiler knows the types and lengths, so it knows when one of such situations happens. At worst calling a template function defined in the object module is able to solve the situation.
Comment #4 by dlang-bugzilla — 2013-04-01T16:10:35Z
It can, it's just that no one has implemented such an optimization yet. Keep in mind that the patch would be against DMDBE (assuming the strategy would be to optimize the memcmp intrinsic), so it would only improve the non-free, non-redistributable part of one D compiler. D users who want their programs to be fast should use GDC or LDC. In contrast, the patch to issue 9477 was in the frontend, so it benefits all of GDC, LDC and DMD.
Comment #5 by ibuclaw — 2013-04-02T00:32:11Z
(In reply to comment #4) > It can, it's just that no one has implemented such an optimization yet. Keep in > mind that the patch would be against DMDBE (assuming the strategy would be to > optimize the memcmp intrinsic), so it would only improve the non-free, > non-redistributable part of one D compiler. D users who want their programs to > be fast should use GDC or LDC. In contrast, the patch to issue 9477 was in the > frontend, so it benefits all of GDC, LDC and DMD. I'd just like to say that statement is false. It wasn't a patch in the frontend, and no one but DMD benefits.
Comment #6 by dlang-bugzilla — 2013-04-02T00:35:39Z
(In reply to comment #5) > I'd just like to say that statement is false. It wasn't a patch in the > frontend, and no one but DMD benefits. Really?? We're talking about this, right? https://github.com/D-Programming-Language/dmd/commit/c984175cf25dfa17b3956e8e33ff83547fa56b0a I thought GDC/LDC simply provided their own implementation of el_* and the elem type.
Comment #7 by dlang-bugzilla — 2013-04-02T00:42:00Z
I don't understand why GDC and LDC did not use the existing code in e2ir. This would imply that every implementation would have to reimplement exactly the same things as DMD (emit calls to Druntime, for example).
Comment #8 by dlang-bugzilla — 2013-04-02T00:55:41Z
DMD: https://github.com/D-Programming-Language/dmd/blob/v2.062/src/e2ir.c#L2431 GDC: https://github.com/D-Programming-GDC/GDC/blob/master/gcc/d/d-elem.cc#L100 LDC: https://github.com/ldc-developers/ldc/blob/master/gen/arrays.cpp#L812 It's all reimplementations of the same thing. I guess there was a reason for why the layer of abstraction was chosen to be this high, but in this situation it's not doing anyone any favors. Well, I guess that now that DMD is faster than both GDC and LDC at something, the burden is on those compilers' maintainers to catch up with it :)
Comment #9 by ibuclaw — 2013-04-02T02:45:00Z
(In reply to comment #8) > DMD: https://github.com/D-Programming-Language/dmd/blob/v2.062/src/e2ir.c#L2431 > > GDC: https://github.com/D-Programming-GDC/GDC/blob/master/gcc/d/d-elem.cc#L100 > > LDC: https://github.com/ldc-developers/ldc/blob/master/gen/arrays.cpp#L812 > > It's all reimplementations of the same thing. I guess there was a reason for > why the layer of abstraction was chosen to be this high, but in this situation > it's not doing anyone any favors. > When David first released gdc, the (In reply to comment #6) > (In reply to comment #5) > > I'd just like to say that statement is false. It wasn't a patch in the > > frontend, and no one but DMD benefits. > > Really?? > > We're talking about this, right? > https://github.com/D-Programming-Language/dmd/commit/c984175cf25dfa17b3956e8e33ff83547fa56b0a > > I thought GDC/LDC simply provided their own implementation of el_* and the elem > type. When gdc was first released, it provided it's own implementation of dt* and the dt_t type. This is now what I widely regard of as a bad move. The DMD implementation of toobj.c; todt.c; typinf.c; has been getting further from GDC's copy of that part of the frontend, thus has become a pain to maintain upon each frontend update. Frankly I can never see it working trying to hack and wrap dmd backend calls into producing something that another backend see's as correct. So I'll be yanking those three files out of dfrontend/ sometime around this fall too.
Comment #10 by dlang-bugzilla — 2013-04-02T02:50:44Z
I don't know what dt_t is, but judging from all 3 of its mentions in the 5381-line e2ir.c, I can only suppose that it is a leaky abstraction poking out of the DMD backend. Anyway, I don't see how this applies to e2ir.c, since, as I've mentioned, dt_t only occurs 3 times in the file.
Comment #11 by ibuclaw — 2013-04-02T03:01:26Z
(In reply to comment #10) > I don't know what dt_t is, but judging from all 3 of its mentions in the > 5381-line e2ir.c, I can only suppose that it is a leaky abstraction poking out > of the DMD backend. > > Anyway, I don't see how this applies to e2ir.c, since, as I've mentioned, dt_t > only occurs 3 times in the file. (In reply to comment #10) > I don't know what dt_t is, but judging from all 3 of its mentions in the > 5381-line e2ir.c, I can only suppose that it is a leaky abstraction poking out > of the DMD backend. > > Anyway, I don't see how this applies to e2ir.c, since, as I've mentioned, dt_t > only occurs 3 times in the file. In brief, why define all these dmd backend symbols (OPcall) that are of no use to gcc, and when you can just build gcc trees directly (CALL_EXPR)?
Comment #12 by dlang-bugzilla — 2013-04-02T11:57:32Z
I don't understand the argument. Because... choosing a lower (and simpler) layer of abstraction would mean that less code would need to be reimplemented? As I understand it, every time DMD implements a new expression type (for example, dot multiply), GDC and LDC would need to be updated. However, DMD already has a glue layer that lowers all of the D-specific expressions to something closer to abstract machine code, which I'd think is what alternative backends would be more interested in.
Comment #13 by dmitry.olsh — 2018-05-31T12:05:52Z
Still true, even after new DRuntime patch for tempalted array ops: Interestingly using array ops on static arrays is faster then plain == on static arrays: // New code bool eq1(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow { return a == b; } bool eq2(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow { foreach (size_t i; 0 .. a.length) if (a[i] != b[i]) return false; return true; } bool eq3(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow { return a[] == b[]; } void main() { import std.stdio, std.random; enum size_t N = 3; enum size_t M = 100; rndGen.seed(1); int[N][M] data; foreach (ref d; data) foreach (ref x; d) x = uniform(0, 4); int total; foreach (i; 0 .. 20_000) foreach (j; 0 .. M) foreach (k; 0 .. M) version(first) total += eq1(data[j], data[k]); // 11.5 seconds else version(second) total += eq2(data[j], data[k]); // 2.45 seconds else version(third) total += eq3(data[j], data[k]); writeln(total); } All 3 versions in my system: ▶ time ./saa 4960000 ./saa 0.69s user 0.00s system 99% cpu 0.688 total ▶ dmd -release -O -inline saa.d ▶ time ./saa 4960000 ./saa 3.13s user 0.00s system 99% cpu 3.133 total ▶ time ./saa 4960000 ./saa 1.32s user 0.00s system 99% cpu 1.322 total
Comment #14 by robert.schadek — 2024-12-13T17:56:24Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/18369 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB