Comment #0 by bearophile_hugs — 2011-09-12T16:20:16Z
Equality comparisons between very short arrays is quite slow in D/DMD. When the arrays are both fixed-sized and short, I suggest to inline item comparisons instead of calling a runtime function, to speed up the code significantly.
A benchmark:
bool eq1(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow {
return a == b;
}
bool eq2(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow {
foreach (size_t i; 0 .. a.length)
if (a[i] != b[i])
return false;
return true;
}
void main() {
import std.stdio, std.random;
enum size_t N = 3;
enum size_t M = 100;
rndGen.seed(1);
int[N][M] data;
foreach (ref d; data)
foreach (ref x; d)
x = uniform(0, 4);
int total;
foreach (i; 0 .. 20_000)
foreach (j; 0 .. M)
foreach (k; 0 .. M)
static if (1)
total += eq1(data[j], data[k]); // 11.5 seconds
else
total += eq2(data[j], data[k]); // 2.45 seconds
writeln(total);
}
Comment #1 by bearophile_hugs — 2013-03-31T03:29:16Z
After closing Issue 9477 the situation is changed a little.
On the same PC with DMD 2.063alpha the run-time with eq2() is 2.04 seconds and the run-time with eq1() is 6.37 seconds (3.12X).
The asm produced by dmd (-O -release -inline -noboundscheck) for the two versions:
eq1:
push EAX
mov ECX, 0Ch
push ESI
mov ESI, 0Ch[ESP]
push EDI
mov EDI, EAX
xor EAX, EAX
rep
cmpsb
je L19
sbb EAX, EAX
sbb EAX, 0FFFFFFFFh
L19: neg EAX
sbb EAX, EAX
inc EAX
pop EDI
pop ESI
pop ECX
ret 4
eq2:
push EBX
mov EDX, EAX
mov ECX, 8[ESP]
xor EBX, EBX
L9: mov EAX, [EBX*4][ECX]
cmp EAX, [EBX*4][EDX]
je L17
pop EBX
xor EAX, EAX
ret 4
L17: inc EBX
cmp EBX, 3
jb L9
pop EBX
mov EAX, 1
ret 4
Is cmpsb efficient on modern CPUs? Maybe the answer is negative:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
Comment #2 by dlang-bugzilla — 2013-04-01T14:56:15Z
Can we close this issue as a duplicate of issue 5282?
> Is cmpsb efficient on modern CPUs?
As the linked discussion mentions, it's not as efficient for short, fixed-size data blocks. Backends other than DMD might take advantage of a known array length, but DMD does not.
Comment #3 by bearophile_hugs — 2013-04-01T16:07:15Z
(In reply to comment #2)
> Can we close this issue as a duplicate of issue 5282?
>
> > Is cmpsb efficient on modern CPUs?
>
> As the linked discussion mentions, it's not as efficient for short, fixed-size
> data blocks. Backends other than DMD might take advantage of a known array
> length, but DMD does not.
Do you know why DMD can't be smarter here? The compiler knows the types and lengths, so it knows when one of such situations happens. At worst calling a template function defined in the object module is able to solve the situation.
Comment #4 by dlang-bugzilla — 2013-04-01T16:10:35Z
It can, it's just that no one has implemented such an optimization yet. Keep in mind that the patch would be against DMDBE (assuming the strategy would be to optimize the memcmp intrinsic), so it would only improve the non-free, non-redistributable part of one D compiler. D users who want their programs to be fast should use GDC or LDC. In contrast, the patch to issue 9477 was in the frontend, so it benefits all of GDC, LDC and DMD.
Comment #5 by ibuclaw — 2013-04-02T00:32:11Z
(In reply to comment #4)
> It can, it's just that no one has implemented such an optimization yet. Keep in
> mind that the patch would be against DMDBE (assuming the strategy would be to
> optimize the memcmp intrinsic), so it would only improve the non-free,
> non-redistributable part of one D compiler. D users who want their programs to
> be fast should use GDC or LDC. In contrast, the patch to issue 9477 was in the
> frontend, so it benefits all of GDC, LDC and DMD.
I'd just like to say that statement is false. It wasn't a patch in the frontend, and no one but DMD benefits.
Comment #6 by dlang-bugzilla — 2013-04-02T00:35:39Z
Comment #7 by dlang-bugzilla — 2013-04-02T00:42:00Z
I don't understand why GDC and LDC did not use the existing code in e2ir. This would imply that every implementation would have to reimplement exactly the same things as DMD (emit calls to Druntime, for example).
Comment #8 by dlang-bugzilla — 2013-04-02T00:55:41Z
(In reply to comment #8)
> DMD: https://github.com/D-Programming-Language/dmd/blob/v2.062/src/e2ir.c#L2431
>
> GDC: https://github.com/D-Programming-GDC/GDC/blob/master/gcc/d/d-elem.cc#L100
>
> LDC: https://github.com/ldc-developers/ldc/blob/master/gen/arrays.cpp#L812
>
> It's all reimplementations of the same thing. I guess there was a reason for
> why the layer of abstraction was chosen to be this high, but in this situation
> it's not doing anyone any favors.
>
When David first released gdc, the (In reply to comment #6)
> (In reply to comment #5)
> > I'd just like to say that statement is false. It wasn't a patch in the
> > frontend, and no one but DMD benefits.
>
> Really??
>
> We're talking about this, right?
> https://github.com/D-Programming-Language/dmd/commit/c984175cf25dfa17b3956e8e33ff83547fa56b0a
>
> I thought GDC/LDC simply provided their own implementation of el_* and the elem
> type.
When gdc was first released, it provided it's own implementation of dt* and the dt_t type. This is now what I widely regard of as a bad move. The DMD implementation of toobj.c; todt.c; typinf.c; has been getting further from GDC's copy of that part of the frontend, thus has become a pain to maintain upon each frontend update. Frankly I can never see it working trying to hack and wrap dmd backend calls into producing something that another backend see's as correct. So I'll be yanking those three files out of dfrontend/ sometime around this fall too.
Comment #10 by dlang-bugzilla — 2013-04-02T02:50:44Z
I don't know what dt_t is, but judging from all 3 of its mentions in the 5381-line e2ir.c, I can only suppose that it is a leaky abstraction poking out of the DMD backend.
Anyway, I don't see how this applies to e2ir.c, since, as I've mentioned, dt_t only occurs 3 times in the file.
Comment #11 by ibuclaw — 2013-04-02T03:01:26Z
(In reply to comment #10)
> I don't know what dt_t is, but judging from all 3 of its mentions in the
> 5381-line e2ir.c, I can only suppose that it is a leaky abstraction poking out
> of the DMD backend.
>
> Anyway, I don't see how this applies to e2ir.c, since, as I've mentioned, dt_t
> only occurs 3 times in the file.
(In reply to comment #10)
> I don't know what dt_t is, but judging from all 3 of its mentions in the
> 5381-line e2ir.c, I can only suppose that it is a leaky abstraction poking out
> of the DMD backend.
>
> Anyway, I don't see how this applies to e2ir.c, since, as I've mentioned, dt_t
> only occurs 3 times in the file.
In brief, why define all these dmd backend symbols (OPcall) that are of no use to gcc, and when you can just build gcc trees directly (CALL_EXPR)?
Comment #12 by dlang-bugzilla — 2013-04-02T11:57:32Z
I don't understand the argument.
Because... choosing a lower (and simpler) layer of abstraction would mean that less code would need to be reimplemented?
As I understand it, every time DMD implements a new expression type (for example, dot multiply), GDC and LDC would need to be updated. However, DMD already has a glue layer that lowers all of the D-specific expressions to something closer to abstract machine code, which I'd think is what alternative backends would be more interested in.
Comment #13 by dmitry.olsh — 2018-05-31T12:05:52Z
Still true, even after new DRuntime patch for tempalted array ops:
Interestingly using array ops on static arrays is faster then plain == on static arrays:
// New code
bool eq1(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow {
return a == b;
}
bool eq2(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow {
foreach (size_t i; 0 .. a.length)
if (a[i] != b[i])
return false;
return true;
}
bool eq3(T, size_t N)(const ref T[N] a, const ref T[N] b) pure nothrow {
return a[] == b[];
}
void main() {
import std.stdio, std.random;
enum size_t N = 3;
enum size_t M = 100;
rndGen.seed(1);
int[N][M] data;
foreach (ref d; data)
foreach (ref x; d)
x = uniform(0, 4);
int total;
foreach (i; 0 .. 20_000)
foreach (j; 0 .. M)
foreach (k; 0 .. M)
version(first)
total += eq1(data[j], data[k]); // 11.5 seconds
else version(second)
total += eq2(data[j], data[k]); // 2.45 seconds
else version(third)
total += eq3(data[j], data[k]);
writeln(total);
}
All 3 versions in my system:
▶ time ./saa
4960000
./saa 0.69s user 0.00s system 99% cpu 0.688 total
▶ dmd -release -O -inline saa.d
▶ time ./saa
4960000
./saa 3.13s user 0.00s system 99% cpu 3.133 total
▶ time ./saa
4960000
./saa 1.32s user 0.00s system 99% cpu 1.322 total
Comment #14 by robert.schadek — 2024-12-13T17:56:24Z