Bug 9963 – Absurdly Inefficient Codegen For Adding Boolean Predicates
Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P2
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2013-04-19T12:25:00Z
Last change time
2015-06-09T05:11:47Z
Keywords
performance
Assigned to
nobody
Creator
dsimcha
Comments
Comment #0 by dsimcha — 2013-04-19T12:25:32Z
D source Code:
__gshared ulong n_less = 0, n_greater = 0;
void doConditional(ubyte thresh, ubyte[] arr) {
ulong l, g;
foreach(val; arr) {
l += (thresh < val);
g += !(thresh < val);
}
n_less += l;
n_greater += g;
}
DMD-generated ASM code (foreach loop only, from obj2asm, when compiled with -O -inline -release):
L33: mov RDX,-018h[RBP]
mov CL,[RDX][R8]
cmp CL,R9B
mov EAX,1
ja L47
xor EAX,EAX
L47: cdqe
add R11,RAX
cmp R9B,CL
sbb EAX,EAX
inc EAX
cdqe
add RBX,RAX
inc R8
cmp R8,-010h[RBP]
jb L33
Why use sbb + neg + two cmp instructions instead of just using setb and setae? This executes in about 0.495 seconds for an array of 100 million elements.
GCC's codegen for the same function:
L20: movzx ECX,[RAX][RDX]
xor R10D,R10D
cmp ECX,EDI
setnle R10B
add R9,R10
cmp ECX,EDI
setle CL
add RAX,1
movzx ECX,CL
add R8,RCX
cmp RAX,RSI
jne L20
This executes in about 0.095 seconds for an array of 100 million elements.
My hand-compilation for this loop:
LStart:
cmp DL, byte ptr [RAX];
setae R9B;
adc R10, 0;
inc RAX;
add R11, R9;
cmp RAX, RBX;
jb LStart;
This executes in about 0.071 seconds for an array of 100 million elements.