Created attachment 1208
fast divison by constant for uint values
It's a common knowledge and is totally expected for any modern compiler to
re-write divisions by constant to double word width _multiplication_ by a
specific constant followed by shift.
DMD still DOESN'T DO it.
Attached is an example of this optimization done via meta-programming in D
along with test and a benchmark. On average (on my PC) the mul version is
around 3 times faster then compiler-generated div. On LDC and GDC results are
that compiler-generate version is a bit faster.
Compiler can easily do a better job especially with 64bit values (as 2x64
accumulator is completely unaccessible for the programmer).
For full description of one such technique see Agner Fog's manuals on assembly
optimizations: http://www.agner.org/optimize/optimizing_assembly.pdf
See the chapter 16. "Problematic instructions", section on DIV/IDIV is 16.9.
Comment #1 by bearophile_hugs — 2013-04-11T11:00:21Z
Dupe of 5607 ?
(Even if it's a dupe you have good code in attach)
Comment #2 by dmitry.olsh — 2013-04-11T11:16:38Z
(In reply to comment #1)
> Dupe of 5607 ?
5607 is a special case of this enhancement.
This one is generalization since any division by constant can be optimized.
Using small divisors only allows some more specialized varations of this optimization (less accurate but enough for numbers in question).
>
> (Even if it's a dupe you have good code in attach)
I'd say this one supersedes 5607, so how about merge your example here and close 5607 ? (the 2 have to be merged anyway since the root cause is the same)
Comment #3 by bearophile_hugs — 2013-04-11T11:33:55Z
(In reply to comment #2)
> any division by constant can be optimized.
I didn't know this.
> I'd say this one supersedes 5607, so how about merge your example here and
> close 5607 ? (the 2 have to be merged anyway since the root cause is the same)
If you think this supersedes 5607, then copy what you think should be copied, and close it.
Comment #4 by dmitry.olsh — 2013-04-11T12:28:08Z
*** Issue 5607 has been marked as a duplicate of this issue. ***
Comment #5 by dmitry.olsh — 2013-04-11T12:33:09Z
Another data point taken from issue 5607 on div/mod heavy program seems to confirm my estimate of 3x+ speed difference with mul-trick applied.
The program is bound on div/mul operaiton so other differences of GCC vs DMC codegen has little effect.
---------------------------------
Timings, n = 100_000_000:
GCC: 3.13 s
DMD: 11.69 s
dmd -O -release -inline testd.d
gcc -O3 -s testc.c -o testc.exe
32 bit
GCC 4.5.2
DMD 2.052beta
---------------------------------
// D2 code
import std.c.stdio: printf;
int main() {
uint total = 0;
uint i;
for (i = 0; i < 100000000; i++) {
uint n = i;
uint res = n % 10;
n /= 10;
while (n != 0) {
res = (n % 10) + (10 * res);
n /= 10;
}
total += res;
}
printf("%u\n", total); // 461784529
return 0;
}
---------------------------------
// C code
#include "stdio.h"
typedef unsigned long uint;
int main() {
uint total = 0;
uint i;
for (i = 0; i < 100000000; i++) {
uint n = i;
uint res = n % 10;
n /= 10;
while (n != 0) {
res = (n % 10) + (10 * res);
n /= 10;
}
total += res;
}
printf("%u\n", total); // 461784529
return 0;
}
---------------------------------
DMD inner loop:
L1C: lea EBX,[EBX*4][EBX]
mov EAX,ESI
mov ECX,0Ah
xor EDX,EDX
add EBX,EBX
div ECX
add EBX,EDX
test EAX,EAX
mov ESI,EAX
jne L1C
---------------------------------
GCC inner loop:
L7:
movl %ecx, %eax
mull %esi
leal (%ebx,%ebx,4), %ebx
shrl $3, %edx
leal (%edx,%edx,4), %eax
addl %eax, %eax
subl %eax, %ecx
testl %edx, %edx
leal (%ecx,%ebx,2), %ebx
movl %edx, %ecx
jne L7