Bug 14571 – [REG2.064] Large static arrays seem to lock up DMD
Status
RESOLVED
Resolution
FIXED
Severity
regression
Priority
P1
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2015-05-11T06:38:00Z
Last change time
2015-07-17T18:40:45Z
Keywords
performance
Assigned to
nobody
Creator
turkeyman
Comments
Comment #0 by turkeyman — 2015-05-11T06:38:57Z
Large static arrays seem to lock up the compiler for some reason:
struct InputPoint
{
double[3] position;
ubyte[4] color;
uint pad;
}
struct GrowableArray(T, size_t chunkSize)
{
struct Chunk
{
T[chunkSize] data;
}
}
alias X = GrowableArray!(InputPoint, 1024); // <-- this compiles quickly
//alias X = GrowableArray!(InputPoint, 128*1024); // <-- this is very slow, appears to hang the compiler
I have to say, I'm kinda worried that Chunk.init is going to be some gigantic statically allocated block of data, rather than a single instance of T that is repeated 'chunkSize' times...
I use patterns like this in C quite a bit. Should this should be avoided in D?
DMD-Win64 0.67.1
Comment #1 by ag0aep6g — 2015-05-11T09:52:37Z
A little shell script with a reduced test case:
----
> times
for i in `seq 2 2 60`
do
cat > test.d << code
struct InputPoint
{
double position;
}
struct GrowableArray
{
InputPoint[$i*1024] data;
}
code
(echo -n "$i "; time -f %U dmd -c test.d 2>&1) | tee -a times
done
gnuplot -e 'set terminal dumb; plot "times"'
----
Output:
----
2 0.01
4 0.04
6 0.08
8 0.13
10 0.22
12 0.30
14 0.40
16 0.53
18 0.70
20 0.87
22 1.04
24 1.32
26 1.50
28 1.82
30 2.15
32 2.62
34 3.22
36 4.05
38 5.19
40 6.23
42 7.68
44 10.13
46 10.97
48 11.97
50 14.07
52 22.59
54 24.41
56 26.22
58 33.27
60 33.24
35 ++----------+----------+-----------+-----------+----------+----------++
+ + + + + "times" A A A
| |
30 ++ ++
| |
25 ++ A ++
| A |
| A |
20 ++ ++
| |
| |
15 ++ A ++
| |
| A A |
10 ++ A ++
| A |
5 ++ A A ++
| A A |
+ + + A A A A A A + + +
0 ++A--A-A-A--A-A-A--A-A-A-----------+-----------+----------+----------++
0 10 20 30 40 50 60
----
That looks quadratic. Eww.
Comment #2 by turkeyman — 2015-05-11T10:14:46Z
More amazing that a quadratic increase in compile time as a result of linear increase in array size, is that script you just produced... that is truly astonishing!
Seriously, I'm speechless. O_O
Comment #3 by dlang-bugzilla — 2015-05-12T22:20:36Z
(In reply to Vladimir Panteleev from comment #3)
> This is a regression.
>
> Introduced in https://github.com/D-Programming-Language/dmd/pull/2605
The root issue is the inefficiency of 'dt' structure in dmd backend.
Reduced test case:
struct S(T, size_t dim)
{
T[dim] data;
}
version(OK) alias X = S!( uint, 128*1024);
version(NG) alias X = S!(double, 128*1024);
If the static array element is zero value (e.g. uint.init), it's handled as "N zero bytes". If the element value is non-zero (e.g. double.init), it's handled as 131072 repetition of RealExp.
Therefore, following code also hits the performance issue.
struct S(T, size_t dim)
{
T[dim] data = 1; // non-zero elemnet
}
alias X = S!(uint, 128*1024);
Comment #5 by ag0aep6g — 2015-05-17T09:14:44Z
(In reply to Kenji Hara from comment #4)
> Reduced test case:
>
> struct S(T, size_t dim)
> {
> T[dim] data;
> }
> version(OK) alias X = S!( uint, 128*1024);
> version(NG) alias X = S!(double, 128*1024);
This compiles quickly for me. Only when I wrap T in another struct does it grind to a halt:
----
struct D(T) {T m;}
struct S(T, size_t dim)
{
D!T[dim] data;
}
version(OK) alias X = S!( uint, 128*1024);
version(NG) alias X = S!(double, 128*1024);
----
[...]
> Therefore, following code also hits the performance issue.
>
> struct S(T, size_t dim)
> {
> T[dim] data = 1; // non-zero elemnet
> }
> alias X = S!(uint, 128*1024);
Same here. I only see the issue with another struct:
----
struct D(T) {T m = 1; /* non-zero element */}
struct S(T, size_t dim)
{
D!T[dim] data;
}
alias X = S!(uint, 128*1024);
----
Comment #6 by bugzilla — 2015-05-22T09:11:02Z
I'll fix this, but you should know that this necessarily creates large sections in the executable file of repeated data. For such large arrays, it is better to allocate and initialize them on program startup.
The way globals work on modern CPUs is you are not saving any execution time by using static data. Large static arrays is an artifact of FORTRAN.
Comment #7 by dlang-bugzilla — 2015-05-22T09:20:33Z
(In reply to Walter Bright from comment #6)
> I'll fix this, but you should know that this necessarily creates large
> sections in the executable file of repeated data. For such large arrays, it
> is better to allocate and initialize them on program startup.
>
> The way globals work on modern CPUs is you are not saving any execution time
> by using static data. Large static arrays is an artifact of FORTRAN.
There are no global variables in the problematic programs given here.
The problem is entirely with .init.
Additionally, dynamic arrays cannot always be used as a drop-in replacement for static arrays. They add one level of indirection and cannot easily be aligned together with other types (as with static arrays in a struct).
(In reply to ag0aep6g from comment #1)
> struct InputPoint
> {
> double position;
> }
The problem should be avoided if you change the declaration to "double position = 0;"
(In reply to Vladimir Panteleev from comment #7)
> The problem is entirely with .init.
Which is a global variable, and has all the downsides of them. My advice still applies.
> Additionally, dynamic arrays cannot always be used as a drop-in replacement
> for static arrays. They add one level of indirection and cannot easily be
> aligned together with other types (as with static arrays in a struct).
Global arrays already have that extra indirection. I'm also hard pressed to see how a large struct is going to suffer from an extra indirection to the array, when it surely is going to suffer much worse from cache invalidation.
> The problem should be avoided if you change the declaration to "double
> position = 0;"
True, then it goes into the BSS segment, but the other issues still apply that are not fixable by changing the compiler (i.e. executable bloat).
Comment #10 by bugzilla — 2015-05-22T10:45:30Z
Essentially, I think you'll find that speed improvements from using very large static arrays are illusory.
I'll still fix the compiler problem, though, just be aware there is no fix for the giant object files and exe files you'll also get.
Comment #11 by dlang-bugzilla — 2015-05-22T10:48:24Z
(In reply to Walter Bright from comment #9)
> (In reply to Vladimir Panteleev from comment #7)
> > The problem is entirely with .init.
>
> Which is a global variable, and has all the downsides of them. My advice
> still applies.
Certain data structures are much easier to implement using static arrays (for example, object pools with statically-sized chunks).
> Global arrays already have that extra indirection.
You mean TLS?
Arrays in the data segment have a fixed address. This means that arr.ptr is actually known at compile time (whether the language exposes it or not). arr[5] involves no address calculation at all during program execution as well.
> True, then it goes into the BSS segment, but the other issues still apply
> that are not fixable by changing the compiler (i.e. executable bloat).
BSS does not bloat executable size, only virtual memory.
Comment #12 by dlang-bugzilla — 2015-05-22T10:49:15Z
(In reply to Vladimir Panteleev from comment #11)
> Arrays in the data segment have a fixed address. This means that arr.ptr is
> actually known at compile time (whether the language exposes it or not).
Sorry, that's wrong. It's known at link time.
Comment #13 by bugzilla — 2015-05-22T10:49:26Z
Oh, and lest I forget, using large thread local static arrays is surely going to be a mistake, as it'll consume memory for every thread, and will make thread creation quite expensive.
Comment #14 by bugzilla — 2015-05-22T10:57:10Z
(In reply to Vladimir Panteleev from comment #11)
> Certain data structures are much easier to implement using static arrays
> (for example, object pools with statically-sized chunks).
>
> > Global arrays already have that extra indirection.
>
> You mean TLS?
Take a look at the code generated for global data. In 64 bit code, it's all relative to the program counter. In 32 bit code, it's indirect because of shared library support (PIC).
> Arrays in the data segment have a fixed address. This means that arr.ptr is
> actually known at compile time (whether the language exposes it or not).
Link time, not compile time.
> arr[5] involves no address calculation at all during program execution as
> well.
That died a couple decades ago with the advent of DLLs. x86-64 bit code doesn't even have a direct addressing mode. Even the presumably direct addressing modes in x32 are indirect because of the segment registers, and despite that, the CPU does such a good job of address pipelining you'll never see the effect of using a register offset.
> BSS does not bloat executable size, only virtual memory.
I know, that's why I mentioned it. BSS is special.
Comment #15 by dlang-bugzilla — 2015-05-22T11:10:11Z
(In reply to Walter Bright from comment #14)
> Take a look at the code generated for global data. In 64 bit code, it's all
> relative to the program counter. In 32 bit code, it's indirect because of
> shared library support (PIC).
Not on Win32, though.
> That died a couple decades ago with the advent of DLLs.
DLLs are relocated at load time (and usually are linked with a base unlikely to conflict, so relocations are often not done). The hypothetical ptr[5] would be relocated as well.
> x86-64 bit code
> doesn't even have a direct addressing mode. Even the presumably direct
> addressing modes in x32 are indirect because of the segment registers, and
> despite that, the CPU does such a good job of address pipelining you'll
> never see the effect of using a register offset.
I would need to run some benchmarks to test this. But a quick test shows that 64-bit code has dedicated CPU instructions for relative addressing of globals, but indexing arrays on the heap still requires two instructions (mov rax, arr + mov dword ptr [arr+idx*4], value).
> I know, that's why I mentioned it. BSS is special.
I understood your post as that executable bloat still applies even though it goes into BSS.
Comment #16 by bugzilla — 2015-05-23T02:52:33Z
(In reply to Vladimir Panteleev from comment #15)
> DLLs are relocated at load time (and usually are linked with a base unlikely
> to conflict, so relocations are often not done). The hypothetical ptr[5]
> would be relocated as well.
It goes through a relocation thunk. So does TLS.
> Not on Win32
Win32 is dead. Even phones are 64 bit processors, aren't they?
> I would need to run some benchmarks to test this. But a quick test shows
> that 64-bit code has dedicated CPU instructions for relative addressing of
> globals, but indexing arrays on the heap still requires two instructions
> (mov rax, arr + mov dword ptr [arr+idx*4], value).
64 bit code indexes static data with the Program Counter.
Furthermore, if you're accessing large arrays, the cost of getting a pointer to the start of it is utterly swamped by accessing the data itself. Like I said, I bet if you do some benchmarking, you'd be hard pressed to find ANY improvement of static large arrays over allocated ones.
Comment #17 by dlang-bugzilla — 2015-05-23T03:28:11Z
(In reply to Walter Bright from comment #16)
> (In reply to Vladimir Panteleev from comment #15)
> > DLLs are relocated at load time (and usually are linked with a base unlikely
> > to conflict, so relocations are often not done). The hypothetical ptr[5]
> > would be relocated as well.
>
> It goes through a relocation thunk. So does TLS.
I don't know what a "relocation thunk" is (no Google hits), but all offsets are adjusted at load time, if that's what you mean.
> Win32 is dead. Even phones are 64 bit processors, aren't they?
Only the newer ones... and the phones don't use x86_64. Is the situation on ARM the same?
> Furthermore, if you're accessing large arrays, the cost of getting a pointer
> to the start of it is utterly swamped by accessing the data itself. Like I
> said, I bet if you do some benchmarking, you'd be hard pressed to find ANY
> improvement of static large arrays over allocated ones.
I think you are right. But I also think that this is a valid, working pattern in C/C++/Delphi programs, so it should be supported. There is value in having a 1:1 port of a program work as-is without refactorings to work around compiler limitations.
Comment #18 by turkeyman — 2015-05-23T07:06:40Z
(In reply to Walter Bright from comment #6)
> I'll fix this, but you should know that this necessarily creates large
> sections in the executable file of repeated data. For such large arrays, it
> is better to allocate and initialize them on program startup.
I reported it because it was a bug, and I agree it's not excellent practise. That said though, Vladimir has already presented my thoughts; it's perfectly valid code, people do it, code like this exists.
In my case, I am porting C++ code, and it's quite a lot of code... it is difficult to refactor and port at the same time.
Regarding .init, this is something I never really thought about too much, but I'm now really concerned. I have been concerned by classinfo's in the past, but somehow init slipped under my radar.
I think a few things need to be considered and/or possible.
1. How do I disable the existence of 'init' for a type? It's conceivable that I want to produce an uninitialised (and uninitialisable) type, like these ones I have here.
2. Any type with a static array naturally has an unreasonably large .init value; what optimisations are possible with relation to init? Can they be alocated+synthesised at init (*cough*) time, rather than built into the exe?
An array is a series of repeated elements, so storing that full array in the binary is not only wasteful, but can only lead to disaster when people put a static array as a member of a type, and the length is large, or perhaps is fed from a 3rd party who doesn't have this specific consideration in mind (nor should they).
3. Can D effectively link-strip .init when it is un-referenced? How can we make this possible if there is something preventing it?
I'd love to spend some time working towards D binaries being the same predictable size as C/C++ binaries. For some reason, despite my efforts, I always seem to end up with D binaries that are easily 10 times the size of their counterpart C binary.
Infact, I constantly find myself in the surprising situation where I create a D interface for a C lib, which simply declares extern(C)'s along with minimal D code for adaptation, no actual functional D code in sight, and the .lib it produces is significantly larger than the entire C lib that it represents.
I have never taken the time to explore the problem, I suspect it's just classinfo's and init values... are there other known bloat inducing problems?
> The way globals work on modern CPUs is you are not saving any execution time
> by using static data. Large static arrays is an artifact of FORTRAN.
This isn't about execution time, it's about perfectly valid code that looks completely benign causing an unexpected explosion to your binary.
Not all programmers are aware or conscious of this sort of thing. It shouldn't be so simple for an unexpecting (junior?) programmer to make a mess like this, and likely not understand what they've done, or that they've even done it at all.
Comment #19 by dlang-bugzilla — 2015-05-23T07:21:34Z
(In reply to Manu from comment #18)
> 1. How do I disable the existence of 'init' for a type?
I think that if you make sure that all of the initial values are zero bytes, the compiler won't generate a .init block. Instead, the TypeInfo will have a null .init pointer, and the runtime will use that as a clue to simply do a memset instead of copying over the .init data when allocating new types. You might be able to also use this to ensure that your complex types aren't accidentally creating .init blocks.
> 2. Any type with a static array naturally has an unreasonably large .init
> value; what optimisations are possible with relation to init? Can they be
> alocated+synthesised at init (*cough*) time, rather than built into the exe?
Not at the moment, AFAIK.
> 3. Can D effectively link-strip .init when it is un-referenced? How can we
> make this possible if there is something preventing it?
Each .init would need to be in its own section to allow linker garbage collection. DMD doesn't seem to do this at the moment, though (at least not on Win32/Win64).
Whether to put things in individual sections is usually a trade-off between link time and resulting executable size. It would be great if DMD at least gave the user some control over this. gcc has e.g. -ffunction-sections and -fdata-sections.
> I'd love to spend some time working towards D binaries being the same
> predictable size as C/C++ binaries. For some reason, despite my efforts, I
> always seem to end up with D binaries that are easily 10 times the size of
> their counterpart C binary.
I agree, bloated executables are not nice. This becomes a real problem with proprietary/closed-source applications, since then the compiler is pulling in code and data that is never actually used, and which should not be present in the published executable.
> I have never taken the time to explore the problem, I suspect it's just
> classinfo's and init values... are there other known bloat inducing problems?
Yes.
- Static constructors pull in everything they reference.
- Object.factory requires that all classes that the compiler sees must be instantiatable, which means pulling in their vtables, invariants, virtual methods, and all their dependencies.
- Many things which could be emitted in separate sections are put in one section. As a result, anything that's referenced within that section pulls in everything else from it, and all their dependencies.
- There are probably other problems.
This is generally one of the more neglected aspects of D and the current implementations. People working on embedded D stuff are constantly running into the above problems as well.
Comment #20 by github-bugzilla — 2015-05-24T17:51:53Z