Bug 16489 – [backend][optimizaton][registers] DMD is 10-20 times slower for GLAS

Status
RESOLVED
Resolution
LATER
Severity
major
Priority
P1
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2016-09-12T08:59:59Z
Last change time
2019-04-12T15:08:32Z
Keywords
performance, SIMD
Assigned to
No Owner
Creator
Илья Ярошенко

Comments

Comment #0 by ilyayaroshenko — 2016-09-12T08:59:59Z
Small static arrays should be allocated in registers if possible [2]. Currently DMD loads and stores values of static array each time. Currently DMD is 10-20(!) times slower for GLAS matrix multiplication then LDC. This is the largest DMD BE problem for GLAS [1]. [1] http://docs.mir.dlang.io/latest/mir_glas_l3.html [2] https://github.com/libmir/mir/blob/v0.17.0-alpha0/source/mir/glas/internal/gemm.d#L360 Related Issue: https://issues.dlang.org/show_bug.cgi?id=16488
Comment #1 by bugzilla — 2016-09-26T23:20:07Z
Could you post a short example, please?
Comment #2 by ilyayaroshenko — 2016-09-27T09:09:45Z
size_t length; // > 0 __vector(float[4])[2]* a; //aligned float[6]* b; __vector(float[4])[2][6] reg; // should be located in the registers // init reg = 0; __vector(float[4])[2] ai = void; __vector(float[4])[6] bi = void; do { ai[0] = a[0][0]; // should be located in the registers ai[1] = a[0][1]; // should be located in the registers foreach(i; AliasSeq!(0, 1, 2, 3, 4, 5)) { bi[i] = b[0][i]; // Issue 16488, // should be located in the registers reg[i][0] += ai[0] * bi[i]; reg[i][1] += ai[1] * bi[i]; } a++; b++; } while(--length);
Comment #3 by bugzilla — 2016-09-27T19:29:53Z
Ok, I understand. This is the 'slicing' optimization where an aggregate can be sliced up and stored in multiple registers. I went over it with deadalnix a while ago, as it was identified as a key optimization. It applies more generally than just for SIMD. I also worked out a scheme for implementing it in the DMD BE, I don't think it is that hard, or I've misunderstood it. The slicing can be done if: 1. all accesses lie within slices (not across slice boundaries) 2. a pointer to the aggregate is not taken (because then you lose control of (case 1)). The slicing then becomes a rewrite of the IR so the aggregate is decomposed into multiple independent variables, and the rest of the backend then proceeds normally.
Comment #4 by bugzilla — 2016-11-09T21:41:30Z
There are enough issues with the example that it won't compile, and inventing changes to make it compile may not show the issue. Can you please post one that does compile and illustrates the issue?
Comment #5 by ilyayaroshenko — 2016-11-10T06:45:38Z
(In reply to Walter Bright from comment #4) > There are enough issues with the example that it won't compile, and > inventing changes to make it compile may not show the issue. Can you please > post one that does compile and illustrates the issue? void foo( ref __vector(float[4])[2][6] c, __vector(float[4])[2]* a, __vector(float[4])[6]* b, size_t length) { import std.meta; __vector(float[4])[2][6] reg = void; // should be located in the registers reg = c; __vector(float[4])[2] ai = void; __vector(float[4])[6] bi = void; do { ai[0] = a[0][0]; // should be located in the registers ai[1] = a[0][1]; // should be located in the registers foreach(i; AliasSeq!(0, 1, 2, 3, 4, 5)) { bi[i] = b[0][i]; // Issue 16488, // should be located in the registers reg[i][0] += ai[0] * bi[i]; reg[i][1] += ai[1] * bi[i]; } a++; b++; } while(--length); c = reg; }