← Back to index | Original Bugzilla link

Bug 16605 – core.simd generates slow/irrelevant code

Status: RESOLVED
Resolution: INVALID
Severity: minor
Priority: P1
Component: dmd
Product: D
Version: D2
Platform: x86_64
OS: Linux
Creation time: 2016-10-08T10:04:44Z
Last change time: 2020-03-21T03:56:39Z
Keywords: SIMD
Assigned to: No Owner
Creator: Malte Kießling

Comments

Comment #0 by malte.kiessling — 2016-10-08T10:04:44Z

I tried working with core.simd. I noticed that (at least for trivial operations like +=, *= etc) the generated code is kinda slow (slower than wihout SSE instructions!). I used asm.dlang.org to get these results (using the newest dmd) below. This code: **** import core.simd; void doStuff() { float4 x = [1.0,0.4,1234.0,124.0]; float4 y = [1.0,0.4,1234.0,124.0]; float4 z = [1.0,0.4,1234.0,123.0]; for(long i = 0; i<1_000_000; i++) { x += y; x += z; z += x; } } **** Results in the following Assembly (i only pasted the function) **** void example.doStuff(): push rbp mov rbp,rsp sub rsp,0x40 movaps xmm0,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf> movaps XMMWORD PTR [rbp-0x40],xmm0 movaps xmm1,XMMWORD PTR [rip+0x0] # 1a <void example.doStuff()+0x1a> movaps XMMWORD PTR [rbp-0x30],xmm1 movaps xmm2,XMMWORD PTR [rip+0x0] # 25 <void example.doStuff()+0x25> movaps XMMWORD PTR [rbp-0x20],xmm2 mov QWORD PTR [rbp-0x10],0x0 cmp QWORD PTR [rbp-0x10],0xf4240 jge 6e <void example.doStuff()+0x6e> movaps xmm3,XMMWORD PTR [rbp-0x30] movaps xmm4,XMMWORD PTR [rbp-0x40] addps xmm4,xmm3 movaps XMMWORD PTR [rbp-0x40],xmm4 movaps xmm0,XMMWORD PTR [rbp-0x20] movaps xmm1,XMMWORD PTR [rbp-0x40] addps xmm1,xmm0 movaps XMMWORD PTR [rbp-0x40],xmm1 movaps xmm2,XMMWORD PTR [rbp-0x40] movaps xmm3,XMMWORD PTR [rbp-0x20] addps xmm3,xmm2 movaps XMMWORD PTR [rbp-0x20],xmm3 inc QWORD PTR [rbp-0x10] jmp 31 <void example.doStuff()+0x31> leave ret **** The most importand thing here is in the body of the for-loop: **** x += y; x += z; z += x; **** Becomes **** movaps xmm3,XMMWORD PTR [rbp-0x30] movaps xmm4,XMMWORD PTR [rbp-0x40] addps xmm4,xmm3 movaps XMMWORD PTR [rbp-0x40],xmm4 movaps xmm0,XMMWORD PTR [rbp-0x20] movaps xmm1,XMMWORD PTR [rbp-0x40] addps xmm1,xmm0 movaps XMMWORD PTR [rbp-0x40],xmm1 movaps xmm2,XMMWORD PTR [rbp-0x40] movaps xmm3,XMMWORD PTR [rbp-0x20] addps xmm3,xmm2 movaps XMMWORD PTR [rbp-0x20],xmm3 **** Insted of **** addps xmm0,xmm1 addps xmm0,xmm2 addps xmm2,xmm0 **** So the results of the calculation are put back into memory at each loop iteration insted of moving them into the xmm registers beforehand and storing them back afterwards. Also, in the beginning the value of the float4 is stored into xmm0-2. Insted of being used inside the loop, this assignment is ignored inside of the loop and only used for the copy into the array. The result of this is that the generated code runs slower than the manual operation on an array instead of being a significant speedup.

Comment #1 by malte.kiessling — 2016-10-08T10:10:08Z

asm.dlang.org example that shows this: https://goo.gl/wVQjQh

Comment #2 by malte.kiessling — 2016-10-08T12:03:43Z

I get a kinda similar output in ldc: http://tinyurl.com/hye9774 Though its better, in the loop its still storing the stuff away.

Comment #3 by b2.temp — 2016-10-08T12:07:14Z

(In reply to Malte Kießling from comment #1) > asm.dlang.org example that shows this: https://goo.gl/wVQjQh Unfortunately this report is only based on the backend production with the switch "-release", so it just remove the assertions! You should retry with "-release -O -boundscheck=off"

Comment #4 by malte.kiessling — 2016-10-08T12:13:20Z

(In reply to b2.temp from comment #3) > (In reply to Malte Kießling from comment #1) > > asm.dlang.org example that shows this: https://goo.gl/wVQjQh > > Unfortunately this report is only based on the backend production with the > switch "-release", so it just remove the assertions! > > You should retry with "-release -O -boundscheck=off" Woops i see. With "-release -O -boundscheck=off" i get the following: **** movaps xmm4,XMMWORD PTR [rip+0x0] # 4c <void example.doStuff()+0x4c> movaps xmm0,XMMWORD PTR [rsp] addps xmm0,xmm4 movaps XMMWORD PTR [rsp],xmm0 movaps xmm1,XMMWORD PTR [rsp+0x10] movaps xmm2,XMMWORD PTR [rsp] addps xmm2,xmm1 movaps XMMWORD PTR [rsp],xmm2 movaps xmm3,XMMWORD PTR [rsp] movaps xmm4,XMMWORD PTR [rsp+0x10] addps xmm4,xmm3 movaps XMMWORD PTR [rsp+0x10],xmm4 **** Wich is the same.

Comment #5 by b2.temp — 2016-10-08T12:31:44Z

(In reply to Malte Kießling from comment #4) > (In reply to b2.temp from comment #3) > > (In reply to Malte Kießling from comment #1) > > > asm.dlang.org example that shows this: https://goo.gl/wVQjQh > > > > Unfortunately this report is only based on the backend production with the > > switch "-release", so it just remove the assertions! > > > > You should retry with "-release -O -boundscheck=off" > > Woops i see. With "-release -O -boundscheck=off" i get the following: > **** > movaps xmm4,XMMWORD PTR [rip+0x0] # 4c <void example.doStuff()+0x4c> > movaps xmm0,XMMWORD PTR [rsp] > addps xmm0,xmm4 > movaps XMMWORD PTR [rsp],xmm0 > movaps xmm1,XMMWORD PTR [rsp+0x10] > movaps xmm2,XMMWORD PTR [rsp] > addps xmm2,xmm1 > movaps XMMWORD PTR [rsp],xmm2 > movaps xmm3,XMMWORD PTR [rsp] > movaps xmm4,XMMWORD PTR [rsp+0x10] > addps xmm4,xmm3 > movaps XMMWORD PTR [rsp+0x10],xmm4 > **** > > Wich is the same. No at all, you should have: push rax movaps xmm2,XMMWORD PTR [rip+0x0] # 8 <void example.doStuff()+0x8> movaps xmm3,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf> xor eax,eax movaps xmm0,XMMWORD PTR [rip+0x0] # 18 <void example.doStuff()+0x18> addps xmm2,xmm0 movaps xmm1,xmm3 addps xmm2,xmm1 movaps xmm4,xmm2 addps xmm3,xmm4 inc rax cmp rax,0xf4240 jb 11 <void example.doStuff()+0x11> pop rax ret see https://goo.gl/C3aquU