Bug 16605 – core.simd generates slow/irrelevant code

Status
RESOLVED
Resolution
INVALID
Severity
minor
Priority
P1
Component
dmd
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2016-10-08T10:04:44Z
Last change time
2020-03-21T03:56:39Z
Keywords
SIMD
Assigned to
No Owner
Creator
Malte Kießling

Comments

Comment #0 by malte.kiessling — 2016-10-08T10:04:44Z
I tried working with core.simd. I noticed that (at least for trivial operations like +=, *= etc) the generated code is kinda slow (slower than wihout SSE instructions!). I used asm.dlang.org to get these results (using the newest dmd) below. This code: **** import core.simd; void doStuff() { float4 x = [1.0,0.4,1234.0,124.0]; float4 y = [1.0,0.4,1234.0,124.0]; float4 z = [1.0,0.4,1234.0,123.0]; for(long i = 0; i<1_000_000; i++) { x += y; x += z; z += x; } } **** Results in the following Assembly (i only pasted the function) **** void example.doStuff(): push rbp mov rbp,rsp sub rsp,0x40 movaps xmm0,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf> movaps XMMWORD PTR [rbp-0x40],xmm0 movaps xmm1,XMMWORD PTR [rip+0x0] # 1a <void example.doStuff()+0x1a> movaps XMMWORD PTR [rbp-0x30],xmm1 movaps xmm2,XMMWORD PTR [rip+0x0] # 25 <void example.doStuff()+0x25> movaps XMMWORD PTR [rbp-0x20],xmm2 mov QWORD PTR [rbp-0x10],0x0 cmp QWORD PTR [rbp-0x10],0xf4240 jge 6e <void example.doStuff()+0x6e> movaps xmm3,XMMWORD PTR [rbp-0x30] movaps xmm4,XMMWORD PTR [rbp-0x40] addps xmm4,xmm3 movaps XMMWORD PTR [rbp-0x40],xmm4 movaps xmm0,XMMWORD PTR [rbp-0x20] movaps xmm1,XMMWORD PTR [rbp-0x40] addps xmm1,xmm0 movaps XMMWORD PTR [rbp-0x40],xmm1 movaps xmm2,XMMWORD PTR [rbp-0x40] movaps xmm3,XMMWORD PTR [rbp-0x20] addps xmm3,xmm2 movaps XMMWORD PTR [rbp-0x20],xmm3 inc QWORD PTR [rbp-0x10] jmp 31 <void example.doStuff()+0x31> leave ret **** The most importand thing here is in the body of the for-loop: **** x += y; x += z; z += x; **** Becomes **** movaps xmm3,XMMWORD PTR [rbp-0x30] movaps xmm4,XMMWORD PTR [rbp-0x40] addps xmm4,xmm3 movaps XMMWORD PTR [rbp-0x40],xmm4 movaps xmm0,XMMWORD PTR [rbp-0x20] movaps xmm1,XMMWORD PTR [rbp-0x40] addps xmm1,xmm0 movaps XMMWORD PTR [rbp-0x40],xmm1 movaps xmm2,XMMWORD PTR [rbp-0x40] movaps xmm3,XMMWORD PTR [rbp-0x20] addps xmm3,xmm2 movaps XMMWORD PTR [rbp-0x20],xmm3 **** Insted of **** addps xmm0,xmm1 addps xmm0,xmm2 addps xmm2,xmm0 **** So the results of the calculation are put back into memory at each loop iteration insted of moving them into the xmm registers beforehand and storing them back afterwards. Also, in the beginning the value of the float4 is stored into xmm0-2. Insted of being used inside the loop, this assignment is ignored inside of the loop and only used for the copy into the array. The result of this is that the generated code runs slower than the manual operation on an array instead of being a significant speedup.
Comment #1 by malte.kiessling — 2016-10-08T10:10:08Z
asm.dlang.org example that shows this: https://goo.gl/wVQjQh
Comment #2 by malte.kiessling — 2016-10-08T12:03:43Z
I get a kinda similar output in ldc: http://tinyurl.com/hye9774 Though its better, in the loop its still storing the stuff away.
Comment #3 by b2.temp — 2016-10-08T12:07:14Z
(In reply to Malte Kießling from comment #1) > asm.dlang.org example that shows this: https://goo.gl/wVQjQh Unfortunately this report is only based on the backend production with the switch "-release", so it just remove the assertions! You should retry with "-release -O -boundscheck=off"
Comment #4 by malte.kiessling — 2016-10-08T12:13:20Z
(In reply to b2.temp from comment #3) > (In reply to Malte Kießling from comment #1) > > asm.dlang.org example that shows this: https://goo.gl/wVQjQh > > Unfortunately this report is only based on the backend production with the > switch "-release", so it just remove the assertions! > > You should retry with "-release -O -boundscheck=off" Woops i see. With "-release -O -boundscheck=off" i get the following: **** movaps xmm4,XMMWORD PTR [rip+0x0] # 4c <void example.doStuff()+0x4c> movaps xmm0,XMMWORD PTR [rsp] addps xmm0,xmm4 movaps XMMWORD PTR [rsp],xmm0 movaps xmm1,XMMWORD PTR [rsp+0x10] movaps xmm2,XMMWORD PTR [rsp] addps xmm2,xmm1 movaps XMMWORD PTR [rsp],xmm2 movaps xmm3,XMMWORD PTR [rsp] movaps xmm4,XMMWORD PTR [rsp+0x10] addps xmm4,xmm3 movaps XMMWORD PTR [rsp+0x10],xmm4 **** Wich is the same.
Comment #5 by b2.temp — 2016-10-08T12:31:44Z
(In reply to Malte Kießling from comment #4) > (In reply to b2.temp from comment #3) > > (In reply to Malte Kießling from comment #1) > > > asm.dlang.org example that shows this: https://goo.gl/wVQjQh > > > > Unfortunately this report is only based on the backend production with the > > switch "-release", so it just remove the assertions! > > > > You should retry with "-release -O -boundscheck=off" > > Woops i see. With "-release -O -boundscheck=off" i get the following: > **** > movaps xmm4,XMMWORD PTR [rip+0x0] # 4c <void example.doStuff()+0x4c> > movaps xmm0,XMMWORD PTR [rsp] > addps xmm0,xmm4 > movaps XMMWORD PTR [rsp],xmm0 > movaps xmm1,XMMWORD PTR [rsp+0x10] > movaps xmm2,XMMWORD PTR [rsp] > addps xmm2,xmm1 > movaps XMMWORD PTR [rsp],xmm2 > movaps xmm3,XMMWORD PTR [rsp] > movaps xmm4,XMMWORD PTR [rsp+0x10] > addps xmm4,xmm3 > movaps XMMWORD PTR [rsp+0x10],xmm4 > **** > > Wich is the same. No at all, you should have: push rax movaps xmm2,XMMWORD PTR [rip+0x0] # 8 <void example.doStuff()+0x8> movaps xmm3,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf> xor eax,eax movaps xmm0,XMMWORD PTR [rip+0x0] # 18 <void example.doStuff()+0x18> addps xmm2,xmm0 movaps xmm1,xmm3 addps xmm2,xmm1 movaps xmm4,xmm2 addps xmm3,xmm4 inc rax cmp rax,0xf4240 jb 11 <void example.doStuff()+0x11> pop rax ret see https://goo.gl/C3aquU