← Back to index | Original Bugzilla link

Bug 6643 – Very slow compilation for large switch() using -O and -inline

Status: RESOLVED
Resolution: DUPLICATE
Severity: normal
Priority: P2
Component: dmd
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2011-09-10T09:07:00Z
Last change time: 2015-06-09T05:11:55Z
Keywords: performance
Assigned to: nobody
Creator: beirich2

Attachments

ID	Filename	Summary	Content-Type	Size
1023	bugtest.d	Test case	application/octet-stream	76367
1024	trace_log_def.zip	trace.log and trace.def	application/octet-stream	494692

Comments

Comment #0 by beirich2 — 2011-09-10T09:07:24Z

Created attachment 1023 Test case I am attempting to port some CPU emulators from C# to D. It uses a large central switch statement (2221 lines in this case) with a case per opcode; the contents of each opcode handler are 'manually inlined' because in C# the automatic inliner is not very aggressive and there is no explicit inline modifier (same as in D). Building the attached program takes under 1 second with only -inline, about 4-5 seconds with only -O, and around 15 minutes with -O and -inline. -release changes outcomes slightly but not in a significant way. Because of the nature of this issue, it was not possible to reduce this to a smaller test case. It is not the number of case statements in the switch that cause this issue, it is the size or complexity of the code inside the switch statement. It's very possible the issue is simply related to very large methods rather than anything at all to do with switch. The resulting EXE does work and -O -inline is about 17% faster than -O alone when run in a benchmark. It just takes a very long time to build. This isn't a total blocker, however, this is only one CPU and I have about 6-7 I would like to port immediately, and long term, maybe around a dozen. The Z80 CPU has a switch statement which is about 4x bigger than this. As I am very new to the D toolchain, I'm not currently using incremental builds. I assuming this is possible and it should mitigate the problem somewhat. It does make performance testing and optimization an issue, though. I am using DMD 2.054.

Comment #1 by bearophile_hugs — 2011-09-10T10:37:30Z

Are you able to compile DMD and profile it during its compilation of (a reduced version of) this code? This profiling data will probably help to fix this performance bug.

Comment #2 by beirich2 — 2011-09-10T21:13:25Z

Created attachment 1024 trace.log and trace.def First, I commented out half of the case statements, and this dropped the -O -inline compile time from 15 minutes to 2 minutes. So there is some definite non-linearity happening there. Then, I compiled DMD 2.055 with it's "trace" makefile setting. I used that version of DMD to compile the half-size test case with -O -inline. This took a very long time with instrumentation - over an hour. In any case, I attached the resulting trace.log and trace.def. It doesn't really mean anything to me, so I hope it means something to someone here!

Comment #3 by bearophile_hugs — 2011-09-11T04:05:24Z

The head of the trace log, a bit reformatted for clarity: Num Tree Func Calls Time Time 25 687874498 336057504 void cdecl copyprop() -896850051 294735899 294735899 unsigned cdecl vec_index(unsigned ,unsigned *) 3318 77706260 66901704 void cdecl flowcp() 159748 60550176 38034252 LIST *cdecl listrds(unsigned *,elem *,unsigned *) 137723661 5558797 5558797 int cdecl ERTOL(elem *) 48 2881023 2670153 void cdecl flowae() 14607381 1832766 1832766 void cdecl vec_orass(unsigned *,unsigned *) 30662 1028359 1028359 void cdecl updaterd(elem *,unsigned *,unsigned *) 9533245 1008464 1008464 void *cdecl mem_calloc(unsigned ) 6762446 972262 972262 void cdecl vec_subass(unsigned *,unsigned *) 5832042 801878 801878 void cdecl vec_sub(unsigned *,unsigned *,unsigned *) 9542441 1158120 592461 void cdecl vec_free(unsigned *) 8264499 1335987 579438 unsigned *cdecl vec_calloc(unsigned ) 9530121 565679 565679 void cdecl mem_free(void *) 3582398 471076 471076 void cdecl vec_andass(unsigned *,unsigned *) 1270712 411849 160136 unsigned *cdecl vec_clone(unsigned *) 1977317 120103 120103 void cdecl vec_copy(unsigned *,unsigned *) 1281149 97412 97412 int cdecl vec_equal(unsigned *,unsigned *) 25 59795507 95264 void cdecl loopopt()

Comment #4 by bearophile_hugs — 2011-09-11T05:32:04Z

Maybe it's possible to speed up the vec_index function (in srt\tk\vec.c) using smarter "bithacks".

Comment #5 by bearophile_hugs — 2011-09-11T06:10:47Z

With a test (removing @property, replacing immutable with const, and removing the imports) the LDC1 compiler shows normal compilation speeds with every combination of switches I have tried. Note: probably LDC1 disables the inlining logic of the D front-end and uses the LLVM one.

Comment #6 by yebblies — 2011-09-11T08:26:07Z

This is (from what I can see) a problem with the optimizer, a similar kind of bug to issue 2396. It doesn't seem to have anything to do with the switch, but looks like a slow algorithm trying to reorder assignments, and choking on a 2000+ line function. Doing some regex magic and placing each case block in a (){}(); delegate call gives normal compile times for me. Maybe this will be a suitable workaround for you. Unfortunately there are far fewer people working on the dmd backend than the frontend, so depending on the difficulty it may be a while before this is fixed.

Comment #7 by yebblies — 2012-02-01T06:26:44Z

*** This issue has been marked as a duplicate of issue 2396 ***