Bug 2278 – Guarantee alignment of stack-allocated variables on x86

Status
NEW
Severity
enhancement
Priority
P4
Component
dmd
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2008-08-11T10:17:19Z
Last change time
2024-12-13T17:48:45Z
Keywords
performance
Assigned to
Walter Bright
Creator
Don
Moved to GitHub: dmd#17771 →

Comments

Comment #0 by clugdbug — 2008-08-11T10:17:19Z
Use of SSE instructions in 32-bit Windows is problematic, since Windows and the C calling convention only aligns the stack to 4 bytes, not 8. It's too late for C and C++ to fix this problem. But D still has a chance, with a simple addition to the ABI... Insert the following line into the spec: D functions must be called with a stack aligned to an 8 byte boundary. And how to implement this: (1) whenever a D function is called, insert a 'push EBP'/'pop EBP' around it, if it has an odd-numbered number of (pushed arguments + pushed registers so far in this function). Note that this applies to invoking a delegate, too. (EBP is the best register to use, since it's guaranteed to be preserved, and it's almost certainly been used recently. On Intel CPUs this means it won't cause a register read stall). (2) if local variables are created, make sure that the frame allocates an even number of DWORDs. (Create a unused local int, if necessary). (3) extern() functions need stack alignment code at the top of them, since they could be called from other languages, with wrong stack alignment. Here's an example. --- void main() { asm { naked; mov EBP, ESP; and ESP, 0xFFFF_FFC0; // align to a 64 byte boundary. call alignedmain; mov ESP, EBP; ret; } } --- (4) alloca() also needs to ensure that it allocates an even number of DWORDs. Note that a clever compiler could play games with the frame pointer to eliminate the (tiny -- approx 1.5 cycles) overhead of (1) in almost all cases. (eg, by converting one of the 'push reg's into 'mov [EBP+xx], reg' ). The important thing to note about this solution (compared to using step(3) everywhere) is that it has lower overhead, and means that the innermost functions, which are most likely to need stack alignment, don't need to manually align it. Also note that when there's an even number of parameters, the overhead is _zero_.
Comment #1 by andrei — 2008-08-11T10:29:07Z
This looks like a broad change for a particular case. The particular case is short numeric arrays of constant size (because those get stack-allocated). So why not have the compiler align only those at 8-byte boundaries and leave everything else alone? Copy semantics for constant-size arrays will certainly help too.
Comment #2 by bugzilla — 2008-08-11T16:38:56Z
Keeping the stack always aligned is not that simple. The code generator will also push/pop register pairs when it runs out of them. Probably the most practical approach is to align static arrays by using the code to AND the ESP register, but this means that there will be two frame pointers for the function. Ug.
Comment #3 by shro8822 — 2008-08-11T17:08:17Z
IIRC there is a x86 (enter leave?) that moves the top of the stack in a way that can be undone. If that allows a non literal arguments, a pair of these around the scope would do it. offset = FP offset += ENTER_META_DATA.sizeof offset &= 0x0f offset -= ENTER_META_DATA.sizeof enter offset // push offset space and some metadata ..... scope leave // pop it all off
Comment #4 by davidl — 2008-08-12T12:06:45Z
enter & leave just simple sugar for pushing and popping ebp or whatever. if you can do it by enter & leave , you can do it simply by replacing it with pushing & popping ebp. align(8) void func() // make sure the stack align to 8 { } void func(){} // align to 4 , this might be useful to cut the use of the stack. align to 8 for all might result a lot stack memory unused(but i'm not sure about this). with instructions mentioned by W, it should be a fair enough trade-off of runtime efficiency & stack memory usage.
Comment #5 by bugzilla — 2008-08-12T18:21:29Z
The problem with entering a function and then aligning the stack is that the code in the function can no longer access the function parameters with a known offset. Probably the best approach to this is to do the equivalent to alloca() - allocate the aligned data on the stack separately, and store a pointer to it in the regular stack frame. The compiler can sugar over all this.
Comment #6 by clugdbug — 2010-01-15T04:51:12Z
*** Issue 1847 has been marked as a duplicate of this issue. ***
Comment #7 by elfy.nv — 2010-12-17T18:10:27Z
In D2 on entering main() stack may or may not be aligned to 8 bytes depending on length of command line with which program was ran. This may cause as much as x2 difference with no apparent reason for it. (Lack of alignment is a pity, but this particular case is plainly confusing). Example. Run with different command lines, for example with and without extension. import core.stdc.stdio: printf; import std.date: getUTCtime, ticksPerSecond; void main() { double d = 0.0; auto t0 = getUTCtime(); for (size_t i = 0; i < 100_000_000; i++) d += 1; auto t1 = getUTCtime(); printf("%lf\n", d); printf("%u\n", (cast(size_t)&d) % 8); printf("%lf\n", (cast(double)t1 - cast(double)t0) / ticksPerSecond); } Also this code shows that inside a frame variables are placed as if stack alignment was expected. (note that a & d are either both aligned on 8 or both unaligned) import core.stdc.stdio: printf; void main() { int a; double d; printf("%X:%u %X:%u\n", &a, (cast(size_t)&a) % 8, &d, (cast(size_t)&d) % 8); } Also +1 for some way to have locals aligned, be it explicit align(n) before declaration of var, or before function (I like this one), or throughout whole program.
Comment #8 by code — 2011-09-21T09:47:09Z
SSE is getting more and more important in performance cirtical applications. There should be at least one way to make shure that a certain variable that is beeing allocated on the stack is aligned. I recently came across this issue in D2.
Comment #9 by turkeyman — 2012-05-24T03:08:55Z
I'm at the point where I can't reasonably work around this issue anymore. It's not just for SSE (although that is one very important case), there are also structures that encapsulate SSE variables (16 byte), structures that must be L1 line aligned (64/128 bytes), structures that must be GPU page aligned (4k-ish), virtual page alignment, and occasionally other alignments are required (for instance, in one case an algorithms performance was near doubled by aligning to 256 bytes, and squatting a byte of data in the unused low bits of the pointer) Structure alignment is really really important, and it's very annoying to work-around (and often wastes memory in doing so) As we did with 256bit vectors, can we define the grammar for attributing a struct with an alignment? Then GDC/LDC can hook it straight up, and DMD can produce an unsupported message for the time being.
Comment #10 by bearophile_hugs — 2012-07-24T06:02:27Z
In DMD 2.060beta this problem seems partially solved, for structs: import core.stdc.stdio: printf; align(16) struct Foo { ubyte u; } // struct Foo { ubyte u; } // try this void main() { Foo f1; ubyte[3] b1; Foo f2; ubyte[5] b2; Foo f3; ubyte[7] b3; short s1; Foo f4; printf("%u\n", cast(size_t)&f1 % 16); printf("%u\n", cast(size_t)&f2 % 16); printf("%u\n", cast(size_t)&f3 % 16); printf("%u\n", cast(size_t)&f4 % 16); } Output: 0 0 0 0 But this syntax is not supported yet: void main() { align(16) ubyte u; }
Comment #11 by temtaime — 2013-08-15T15:47:59Z
BUMP. 2.63.2 regression ? import core.stdc.stdio: printf; align(16) struct Foo { ubyte u; } // struct Foo { ubyte u; } // try this void main() { Foo f1; ubyte[3] b1; Foo f2; ubyte[5] b2; Foo f3; ubyte[7] b3; short s1; Foo f4; printf("%u\n", cast(size_t)&f1 % 16); printf("%u\n", cast(size_t)&f2 % 16); printf("%u\n", cast(size_t)&f3 % 16); printf("%u\n", cast(size_t)&f4 % 16); } Output: 8 8 8 8
Comment #12 by temtaime — 2014-02-19T06:06:01Z
BUMP. align(16) struct A { ubyte t; } void main() { A a; writeln(cast(size_t)&a % 16); } Prints 4 right now. When it will be fixed?
Comment #13 by s_lange — 2014-03-10T07:33:21Z
The problem about alignment of the stack for (128 bit) SSE is, that it even needs to be aligned to 16 byte (or double quadrouble word) boundary for fast access (via aligned moves), 8 byte (or quadword) won't be enough(yes, there are unaligned SSE move instructions, but...). And that's only for basic 128 bit SSE, AVX may take 256 bit or even 512 bit alignment, at least when you need to use fast aligned move instructions. Hopefully, no (pure) 32 bit x86 CPU has 256 or 512 bit AVX registers. However, it's possible to use the upper parts of 256 or 512 bit AVX registers in 32 bit Windows on a 64 bit CPU in 32 bit compatibility mode, but I'm not sure whether older versions of Windows do recognize and correctly save/restore them on context switch, which is essential for using them safely (someone needs to check this out). so here you have it. 32-bit Windows requires 16 byte alignment for SSE. 64-bit Windows already has 16 byte alignment, but may require even more if AVX registers are used.
Comment #14 by verylonglogin.reg — 2014-03-18T07:00:56Z
This isn't a regression. It was just a luck build in Comment 10. As a [partial] workaround one can use an autoaligned buffer e.g. this one: http://denis-sh.bitbucket.org/unstandard/unstd.memory.misc.html
Comment #15 by robert.schadek — 2024-12-13T17:48:45Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17771 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB