Comment #0 by malte.kiessling — 2016-10-08T10:04:44Z
I tried working with core.simd. I noticed that (at least for trivial operations like +=, *= etc) the generated code is kinda slow (slower than wihout SSE instructions!). I used asm.dlang.org to get these results (using the newest dmd) below.
This code:
****
import core.simd;
void doStuff()
{
float4 x = [1.0,0.4,1234.0,124.0];
float4 y = [1.0,0.4,1234.0,124.0];
float4 z = [1.0,0.4,1234.0,123.0];
for(long i = 0; i<1_000_000; i++) {
x += y;
x += z;
z += x;
}
}
****
Results in the following Assembly (i only pasted the function)
****
void example.doStuff():
push rbp
mov rbp,rsp
sub rsp,0x40
movaps xmm0,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf>
movaps XMMWORD PTR [rbp-0x40],xmm0
movaps xmm1,XMMWORD PTR [rip+0x0] # 1a <void example.doStuff()+0x1a>
movaps XMMWORD PTR [rbp-0x30],xmm1
movaps xmm2,XMMWORD PTR [rip+0x0] # 25 <void example.doStuff()+0x25>
movaps XMMWORD PTR [rbp-0x20],xmm2
mov QWORD PTR [rbp-0x10],0x0
cmp QWORD PTR [rbp-0x10],0xf4240
jge 6e <void example.doStuff()+0x6e>
movaps xmm3,XMMWORD PTR [rbp-0x30]
movaps xmm4,XMMWORD PTR [rbp-0x40]
addps xmm4,xmm3
movaps XMMWORD PTR [rbp-0x40],xmm4
movaps xmm0,XMMWORD PTR [rbp-0x20]
movaps xmm1,XMMWORD PTR [rbp-0x40]
addps xmm1,xmm0
movaps XMMWORD PTR [rbp-0x40],xmm1
movaps xmm2,XMMWORD PTR [rbp-0x40]
movaps xmm3,XMMWORD PTR [rbp-0x20]
addps xmm3,xmm2
movaps XMMWORD PTR [rbp-0x20],xmm3
inc QWORD PTR [rbp-0x10]
jmp 31 <void example.doStuff()+0x31>
leave
ret
****
The most importand thing here is in the body of the for-loop:
****
x += y;
x += z;
z += x;
****
Becomes
****
movaps xmm3,XMMWORD PTR [rbp-0x30]
movaps xmm4,XMMWORD PTR [rbp-0x40]
addps xmm4,xmm3
movaps XMMWORD PTR [rbp-0x40],xmm4
movaps xmm0,XMMWORD PTR [rbp-0x20]
movaps xmm1,XMMWORD PTR [rbp-0x40]
addps xmm1,xmm0
movaps XMMWORD PTR [rbp-0x40],xmm1
movaps xmm2,XMMWORD PTR [rbp-0x40]
movaps xmm3,XMMWORD PTR [rbp-0x20]
addps xmm3,xmm2
movaps XMMWORD PTR [rbp-0x20],xmm3
****
Insted of
****
addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0
****
So the results of the calculation are put back into memory at each loop iteration insted of moving them into the xmm registers beforehand and storing them back afterwards.
Also, in the beginning the value of the float4 is stored into xmm0-2. Insted of being used inside the loop, this assignment is ignored inside of the loop and only used for the copy into the array.
The result of this is that the generated code runs slower than the manual operation on an array instead of being a significant speedup.
Comment #1 by malte.kiessling — 2016-10-08T10:10:08Z
Comment #2 by malte.kiessling — 2016-10-08T12:03:43Z
I get a kinda similar output in ldc: http://tinyurl.com/hye9774
Though its better, in the loop its still storing the stuff away.
Comment #3 by b2.temp — 2016-10-08T12:07:14Z
(In reply to Malte Kießling from comment #1)
> asm.dlang.org example that shows this: https://goo.gl/wVQjQh
Unfortunately this report is only based on the backend production with the switch "-release", so it just remove the assertions!
You should retry with "-release -O -boundscheck=off"
Comment #4 by malte.kiessling — 2016-10-08T12:13:20Z
(In reply to b2.temp from comment #3)
> (In reply to Malte Kießling from comment #1)
> > asm.dlang.org example that shows this: https://goo.gl/wVQjQh
>
> Unfortunately this report is only based on the backend production with the
> switch "-release", so it just remove the assertions!
>
> You should retry with "-release -O -boundscheck=off"
Woops i see. With "-release -O -boundscheck=off" i get the following:
****
movaps xmm4,XMMWORD PTR [rip+0x0] # 4c <void example.doStuff()+0x4c>
movaps xmm0,XMMWORD PTR [rsp]
addps xmm0,xmm4
movaps XMMWORD PTR [rsp],xmm0
movaps xmm1,XMMWORD PTR [rsp+0x10]
movaps xmm2,XMMWORD PTR [rsp]
addps xmm2,xmm1
movaps XMMWORD PTR [rsp],xmm2
movaps xmm3,XMMWORD PTR [rsp]
movaps xmm4,XMMWORD PTR [rsp+0x10]
addps xmm4,xmm3
movaps XMMWORD PTR [rsp+0x10],xmm4
****
Wich is the same.
Comment #5 by b2.temp — 2016-10-08T12:31:44Z
(In reply to Malte Kießling from comment #4)
> (In reply to b2.temp from comment #3)
> > (In reply to Malte Kießling from comment #1)
> > > asm.dlang.org example that shows this: https://goo.gl/wVQjQh
> >
> > Unfortunately this report is only based on the backend production with the
> > switch "-release", so it just remove the assertions!
> >
> > You should retry with "-release -O -boundscheck=off"
>
> Woops i see. With "-release -O -boundscheck=off" i get the following:
> ****
> movaps xmm4,XMMWORD PTR [rip+0x0] # 4c <void example.doStuff()+0x4c>
> movaps xmm0,XMMWORD PTR [rsp]
> addps xmm0,xmm4
> movaps XMMWORD PTR [rsp],xmm0
> movaps xmm1,XMMWORD PTR [rsp+0x10]
> movaps xmm2,XMMWORD PTR [rsp]
> addps xmm2,xmm1
> movaps XMMWORD PTR [rsp],xmm2
> movaps xmm3,XMMWORD PTR [rsp]
> movaps xmm4,XMMWORD PTR [rsp+0x10]
> addps xmm4,xmm3
> movaps XMMWORD PTR [rsp+0x10],xmm4
> ****
>
> Wich is the same.
No at all, you should have:
push rax
movaps xmm2,XMMWORD PTR [rip+0x0] # 8 <void example.doStuff()+0x8>
movaps xmm3,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf>
xor eax,eax
movaps xmm0,XMMWORD PTR [rip+0x0] # 18 <void example.doStuff()+0x18>
addps xmm2,xmm0
movaps xmm1,xmm3
addps xmm2,xmm1
movaps xmm4,xmm2
addps xmm3,xmm4
inc rax
cmp rax,0xf4240
jb 11 <void example.doStuff()+0x11>
pop rax
ret
see https://goo.gl/C3aquU