Version foo runs 5x slower depending on which scope the complex double variables are declared in.
import std.stdio, std.string;
void main(char[][] args)
{
char bit_num = 0, byte_acc = 0;
const int iter = 50;
const double lim = 2.0 * 2.0;
version(foo)
{
cdouble Z, C;
}
int n = atoi(args[1]);
writefln("P4\n%d %d",n,n);
for(int y=0; y<n; y++)
for(int x=0; x<n; x++)
{
version(foo)
{}
else
{
cdouble Z,C;
}
Z = 0 + 0i;
C = 2*cast(double)x/n - 1.5 + 2i*cast(double)y/n - 1i;
for(int i = 0; i < iter && norm(Z) <= lim; i++)
Z = Z*Z + C;
byte_acc = (byte_acc << 1) | ((norm(Z) > lim) ? 0x00:0x01);
bit_num++;
if(bit_num == 8)
{
putc(byte_acc,stdout);
bit_num = byte_acc = 0;
}
else if(x == n-1)
{
byte_acc <<= (8-n%8);
putc(byte_acc,stdout);
bit_num = byte_acc = 0;
}
}
}
double norm(cdouble C)
{
return C.re*C.re + C.im*C.im;
}
Comment #1 by jpelcis — 2006-07-16T21:11:58Z
I used the following code to time it:
long getCount () {
asm {
naked;
rdtsc;
ret;
}
}
and had the first and last lines of main check the time. I got a performance difference, but it was under 10%. What parameter were you passing and how were you testing the time?
Comment #2 by godaves — 2006-07-16T22:53:23Z
I timed it (the exact code posted) on Linux/P4 and Win32/AMD64 using an external timer in both cases, and in both cases the difference was as large over several runs, about 5x. I was using 1000 as n.
When I added your timer function the difference virtually disappeared (with it running at the better time once your internal timer was added).
Sure looks like an alignment issue, which can be transitive and sensitive to seemingly unrelated changes in the code elsewhere (which makes it all the more frustrating). I've run into this with math.pow() too.
Comment #3 by witold.baryluk+d — 2007-06-21T11:31:08Z
Timings (n=1000, Athlon 1GHz) of code from comment #0 (zero changes)
dmd-1.015: 1.95 sec
dmd-1.015 -version=foo: 2.03 sec
dmd-1.015 -O -inline: 1.03 sec
dmd-1.015 -O -inline -version=foo: 1.60 sec
using bash builtin function time. optimised version without foo is better becouse local variables can be properly alligned on the stack and access to them can be optimised.
Comment #4 by andrei — 2018-01-16T19:39:13Z
Adapted the code to the current D compiler, here's the code:
=================================================
import std.stdio, std.string;
void main(char[][] args)
{
char bit_num = 0, byte_acc = 0;
const int iter = 50;
const double lim = 2.0 * 2.0;
version(foo)
{
cdouble Z, C;
}
import core.stdc.stdlib;
int n = atoi(args[1].toStringz);
writefln("P4\n%d %d",n,n);
for(int y=0; y<n; y++)
for(int x=0; x<n; x++)
{
version(foo)
{}
else
{
cdouble Z,C;
}
Z = 0 + 0i;
C = 2*cast(double)x/n - 1.5 + 2i*cast(double)y/n - 1i;
for(int i = 0; i < iter && norm(Z) <= lim; i++)
Z = Z*Z + C;
byte_acc = cast(char) ((byte_acc << 1) | ((norm(Z) > lim) ? 0x00:0x01));
bit_num++;
if(bit_num == 8)
{
//putc(byte_acc,core.stdc.stdio.stdout);
bit_num = byte_acc = 0;
}
else if(x == n-1)
{
byte_acc <<= (8-n%8);
//putc(byte_acc,core.stdc.stdio.stdout);
bit_num = byte_acc = 0;
}
}
}
double norm(cdouble C)
{
return C.re*C.re + C.im*C.im;
}
=================================================
Then built two versions like this (file is test.d):
=================================================
$ dmd -O -inline -release -version=foo -oftestfoo test
$ dmd -O -inline -release test
=================================================
Then I measured like this:
=================================================
$ time ./testfoo 10000
P4
10000 10000
./testfoo 10000 17.70s user 0.00s system 99% cpu 17.704 total
$ time ./test 10000
P4
10000 10000
./test 10000 17.71s user 0.00s system 99% cpu 17.714 total
=================================================
I'll close this as "works for me", please reopen if I missed something. Thanks!