Comment #0 by bearophile_hugs — 2010-04-24T16:57:35Z
std.numeric.gcd can use a faster Binary GCD algorithm, especially when the input type is unsigned. This page has both C code (and asm, but the C code is probably enough in many situations):
http://en.wikipedia.org/wiki/Binary_GCD_algorithm
Comment #1 by Marco.Leise — 2012-01-31T12:11:35Z
Replace uint with ulong for the longer version ;)
I use this and it is notably faster than what I used before.
uint gcd(uint u, uint v)
{
int shift;
if (u == 0 || v == 0) return u | v;
for (shift = 0; ((u | v) & 1) == 0; ++shift) {
u >>= 1;
v >>= 1;
}
while ((u & 1) == 0) u >>= 1;
do {
while ((v & 1) == 0) v >>= 1;
if (u < v) {
v -= u;
} else {
uint diff = u - v;
u = v;
v = diff;
}
v >>= 1;
} while (v != 0);
return u << shift;
}
Comment #2 by peter.alexander.au — 2013-01-05T11:15:04Z
(In reply to comment #1)
> Replace uint with ulong for the longer version ;)
> I use this and it is notably faster than what I used before.
I implemented this (exactly as you have it) and it was slower than the algorithm that is already there. I tested on all pairs of integers below 10,000, and also on the pairs (x^2, y) for all x,y < 10,000. At best it was 50% slower, at worst 3x slower.
All tests used dmd -O -release -inline
I suspect the reason for the performance reduction is due to poor pipelining. The binary version involves a lot more branching, and more loop iterations than the standard algorithm. Also, the branches taken are highly unpredictable.
Maybe I'll look at this again in the future to try and make it faster, but it's pretty low on my priority list.
Comment #3 by bearophile_hugs — 2013-01-06T03:44:33Z
(In reply to comment #2)
> I implemented this (exactly as you have it) and it was slower than the
> algorithm that is already there. I tested on all pairs of integers below
> 10,000, and also on the pairs (x^2, y) for all x,y < 10,000. At best it was 50%
> slower, at worst 3x slower.
> ...
> Maybe I'll look at this again in the future to try and make it faster, but it's
> pretty low on my priority list.
Thank you for doing some experiments. Once the experiments are conclusive, this enhancement can be closed.
(Then at the moment a more important function for Phobos is an efficient GCD for bigints.)
Comment #4 by clugdbug — 2013-01-08T02:37:41Z
FWIW, you can get rid of most of the conditional branches by using:
min(u,v) = v + ( (cast(int)(u-v)) >> (8*int.sizeof - 1)) & (u-v)
the shift smears the sign bit of u-v so that it makes a mask either 0x0000_0000 or 0xFFFF_FFFF.
I think the general consensus is that (at least if you use asm), binary GCD is faster on all known processors, but not necessarily by a large amount.
Comment #5 by lomereiter — 2013-01-08T06:42:51Z
(In reply to comment #0)
> std.numeric.gcd can use a faster Binary GCD algorithm, especially when the
> input type is unsigned. This page has both C code (and asm, but the C code is
> probably enough in many situations):
>
> http://en.wikipedia.org/wiki/Binary_GCD_algorithm
Maybe instead of reinventing the wheel LibTomMath library should be used? It is in public domain, has decent performance, and is stable enough to provide implementation of big integers in TCL and Rubinius.
Comment #6 by bearophile_hugs — 2014-01-07T18:04:17Z
Created attachment 1313
Faster GCD
Code for a binary GCD that on LDC2 is about twice faster on uint values, and it's rather faster with dmd too.
Timings ("gcd" is the Phobos one):
DMD 2.061alpha (32 bit):
gcd, time = 1924
gcd_recursive, time = 1937
gcd_iterative_sub, time = 2931
gcd_iterative_mod, time = 1948
gcd_binary, time = 3357
gcd_binary_mod, time = 2136
gcd_binary_2, time = 1406
gcd_binary_2, time = 1391
Results sum: 1284258816
LDC2 (32 bit):
gcd, time = 1930
gcd_recursive, time = 1924
gcd_iterative_sub, time = 3148
gcd_iterative_mod, time = 1926
gcd_binary, time = 2635
gcd_binary_mod, time = 2036
gcd_binary_2, time = 1461
gcd_binary_3, time = 1026
Results sum: 1284258816
Comment #7 by alexandru.razvan.c — 2016-12-09T15:46:15Z
After conducting some benchmarks we arrived to the conclusion that currently the dmd compiler works best with Euclid's algorithm for GCD, otherwise we use Stein's algorithm for ldc or gdc.
I tested both the previously shared benchmarks on the forum and a couple of my own and noted that gcd_binary2 implemented by [email protected] has the best results, so I used that.
PR: https://github.com/dlang/phobos/pull/4940
Comment #8 by github-bugzilla — 2016-12-12T14:20:02Z