Bug 14943 – dmd should inline more aggressively

Comment #0 by hsteoh — 2015-08-21T01:09:30Z

/* Crude benchmark for comparing dmd/gdc output. */ import std.algorithm : map, filter, reduce; import std.conv : to; import std.range : iota; import std.stdio : writeln; auto fun(int n) { return iota(n) .map!(a => a*7) .filter!(a => a % 2 == 0) .reduce!((a,b) => a/2 + b); } void main() { writeln(fun(100_000_000)); } /* RESULTS: Compiled with: dmd -release -O -inline test.d -oftest.dmd gdc -frelease -finline -O3 test.d -o test.gdc (dmd git HEAD, gdc 5.2.1) Execution times: % time test.dmd ; time test.gdc 1399999944 real 0m0.628s user 0m0.627s sys 0m0.001s 1399999944 real 0m0.168s user 0m0.167s sys 0m0.000s % As can be seen, the executable produced by gdc runs about 3.7 times faster than the executable produced by dmd. Why? Looking at the disassembly, the first thing that stands out is that gdc has inlined the call to writeln, whereas dmd calls a separate function. While this isn't the bottleneck, it gives a hint of the things to come. We look next at fun(), which in both cases are standalone functions. The dmd version of fun() is pretty straightforward: it calls iota() to create the range, then map(), then filter(), and finally reduce() where most of the work takes place. This is pretty much a direct translation of the code. The gdc version of fun() is markedly different. We immediately notice that the only function call in it is a call to enforce(). Not only the first level calls to iota, map, filter, reduce have been inlined, pretty much their *entire* call trees have been inlined, except for the call to enforce(). Here I'd like to highlight the fact that in the dmd version of the code, the call to iota() has another function call to the ctor of the returned range. So we see that gdc has inlined two levels of function calls where dmd has inlined none, even though one would expect that with -inline, at least the call from iota() to the ctor of the range should have been inlined, since it's the only place where that ctor would be called; iota() itself being merely a thin wrapper around it. (The ctor itself is also pretty simple; I'm not sure why dmd fails to inline it.) Similar remarks apply to the calls to map() and filter() as well. Now let's look at reduce(), which is where the actual action takes place. The dmd version, of course, involves a separate function call, which in the grand scheme of things isn't all that important, since it's only a single function call. However, a look at the disassembly of reduce() shows that dmd has not inlined the calls to .empty, .front, and .popFront. In fact, the function calls yet another function -- reduceImpl -- where the main loop sits. Inside this main loop, .empty, .front, and .popFront are again called with no inlining -- even though .empty has a trivial body, .front involves only 1 multiplication, and .popFront only 1 multiplication and a single odd/even test. On top of this, each of these nested function calls involve a certain amount of boilerplate: twiddling with the stack registers, shuffling call arguments about, etc., that add up to quite a large overhead in reduceImpl's inner loop. The gdc version, by contrast, inlines *everything*, except the call to enforce() which is outside the inner loop. This aggressive inlining allowed gdc to trim the loop body down to about only 18 instructions with no function calls. While the dmd inner loop itself has only 15 instructions, it involves 3 function calls, with .front having 8 instructions, .empty also 8 instructions, and .popFront 13 instructions, making a total of 44 instructions per iteration. A significant percentage of these instructions are function call boilerplate. The entire inner loop in the gdc version would fit in about 4-5 CPU cache lines, whereas the dmd version would require a lot more. To dmd's credit, it did manage to inline the nested calls in .empty, .front, and .popFront(), which would have involved more function calls when no inlining at all is done (each wrapper range forwards the calls to the next). This probably helped to reduce the cost of running their respective function bodies. However, this isn't quite enough, since the overhead of 3 function calls in the inner loop is pretty expensive when the cost could have been eliminated completely, as gdc had done. */

Comment #1 by hsteoh — 2015-08-21T05:19:57Z

Further notes: - gdc not only inlines the call trees of .empty, .front, .popFront, it also applied other loop optimizations like strength reduction to refactor the a => a*7 into a += b; b += 7. Not sure if dmd is capable of doing this, but in any case the opportunity is missed because .popFront was not inlined, so the optimizer wouldn't have been able to apply strength reduction. - gdc's aggressive inlining also allowed various loop counters and accumulators to be completely held in registers, while the function calls generated by dmd necessitated dereferencing addresses to stack variables, which is an extra layer of indirection. Again, a missed opportunity due to not inlining aggressively enough. For reference, here's the inner loop produced by gdc: 403b80: 89 d7 mov %edx,%edi 403b82: c1 ef 1f shr $0x1f,%edi 403b85: 8d 34 3a lea (%rdx,%rdi,1),%esi 403b88: 83 c2 07 add $0x7,%edx 403b8b: 83 e6 01 and $0x1,%esi 403b8e: 39 fe cmp %edi,%esi 403b90: 75 1e jne 403bb0 <int test.fun(int)+0x80> 403b92: 89 c6 mov %eax,%esi 403b94: 8d 14 cd 00 00 00 00 lea 0x0(,%rcx,8),%edx 403b9b: c1 ee 1f shr $0x1f,%esi 403b9e: 01 f0 add %esi,%eax 403ba0: 29 ca sub %ecx,%edx 403ba2: d1 f8 sar %eax 403ba4: 01 d0 add %edx,%eax 403ba6: 83 c2 07 add $0x7,%edx 403ba9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 403bb0: 83 c1 01 add $0x1,%ecx 403bb3: 39 cb cmp %ecx,%ebx 403bb5: 75 c9 jne 403b80 <int test.fun(int)+0x50>

Comment #2 by jack — 2016-06-06T14:47:23Z

*** Issue 16120 has been marked as a duplicate of this issue. ***

Comment #3 by robert.schadek — 2024-12-13T18:44:18Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/19031 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB

Bug 14943 – dmd should inline more aggressively

Comments