Vector literals in the form:
float4 v = [1,2,3,4];
still don't appear to work...
Literal should be allocated & aligned, ideally PC relative, right next to the function (at least possible in x64).
Additionally, assigning literals '0' or [0,0,0,0] should generate the appropriate xor opcode.
Comment #1 by turkeyman — 2012-01-31T17:27:58Z
There's a bunch of other handy immediates too:
-1 = cmpeq(anyReg, anyReg)
- 1 opcode, always better than a load
1,3,7,15,etc can be generated by shifting -1 right by some immediate
- 2 opcodes, pc-relative load might be better, depends on pipeline
2,4,8,16,etc can be generated by shifting 1 left
- 3 opcodes, pc-relative load is probably better, depends on pipeline
1.0f (0x3f800000) can be generated by shifting -1 right 25 bits, then left 23 bits
- 3 opcodes, pc-relative load is probably better, depends on pipeline
etc...
Just something to be aware of
Comment #2 by turkeyman — 2012-03-01T13:53:30Z
I'd like to see this promoted above the priority of other SIMD tasks.
This is holding me up. I can't write unit tests without literals.
Comment #3 by WorksOnMyMachine — 2012-03-01T14:16:27Z
Hex float constants don't seem to be expressive enough to handle all 2^32 or 2^64 bit patterns for float or double (nan, exponent masks, mantissa only masks etc).
This is a problem for defining more exotic data while maintining the correct 'type' for the constant (float4 instead of an int4, double2 instead of long2 etc)
Comment #4 by turkeyman — 2012-03-11T12:11:46Z
Walter had planned to use standard array literal syntax as far as I knew:
float4 v = [1.0, 2.0, 3.0, 4.0];
shourt8 sv = [1,2,3,4,5,6,7,8];
How he intended to disambiguate the type, I have no idea.
Vector literals must be 128bit aligned, so the compiler needs to recognise they're being assigned to a vector and treat them special.
It sounds better to have an explicit literal syntax to me.
I guess there's always this option, but it's horrible:
int4 v = __vector([1,2,3,4]);
Comment #5 by lovelydear — 2012-04-21T05:35:44Z
See also issue 7414
Comment #6 by bugzilla — 2012-04-30T22:25:35Z
Note that for the moment you can do things like:
int[4] o = [1,2,3,4];
int4 v1;
v1.array = o;
Comment #7 by turkeyman — 2012-05-01T07:44:29Z
(In reply to comment #6)
> Note that for the moment you can do things like:
>
> int[4] o = [1,2,3,4];
> int4 v1;
> v1.array = o;
Indeed, but that's not something you can just type conveniently in user code.
I wrap that in a little constructor function, which is fine for now.
Proper literals are blocking std.simd though (which I'd really like to finish). I don't think it's fair to say it's done without literals, and unit tests which need an awful lot of literals.
If it's not likely to make the short list though, I'll finish it off with some sort of constructor function in the mean time.
Comment #8 by bugzilla — 2012-05-01T10:40:39Z
I'm looking in to it, I just thought that the workaround could keep you going for the moment.
Comment #9 by turkeyman — 2012-05-01T10:51:12Z
(In reply to comment #8)
> I'm looking in to it, I just thought that the workaround could keep you going
> for the moment.
Yeah, no worries. Already on it. Cheers :)
Comment #10 by github-bugzilla — 2012-05-02T02:21:17Z
Haven't done the special case optimizations for constant loading.
Comment #12 by turkeyman — 2012-05-02T06:55:58Z
(In reply to comment #11)
> Haven't done the special case optimizations for constant loading.
No problem, I'm using GDC anyway which might detect those in the back end.
An efficient implementation would certainly use at least an xor for 0 initialisation, and the other tricks will get different mileage depending on the length of the pipeline surrounding. Not accessing memory is always better if there are pipeline cycles to soak up the latency.
Comment #13 by clugdbug — 2012-05-02T11:26:26Z
(In reply to comment #12)
> (In reply to comment #11)
> > Haven't done the special case optimizations for constant loading.
>
> No problem, I'm using GDC anyway which might detect those in the back end.
>
> An efficient implementation would certainly use at least an xor for 0
> initialisation, and the other tricks will get different mileage depending on
> the length of the pipeline surrounding. Not accessing memory is always better
> if there are pipeline cycles to soak up the latency.
The -1 trick is always worth doing, I think. Agner Fog has a nice list in his optimisation manuals, but the only ones _always_ worth doing are the 0 and -1 integer cases, and the 0.0 floating point case (also using xor).
Comment #14 by turkeyman — 2012-05-02T13:06:19Z
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > Haven't done the special case optimizations for constant loading.
> >
> > No problem, I'm using GDC anyway which might detect those in the back end.
> >
> > An efficient implementation would certainly use at least an xor for 0
> > initialisation, and the other tricks will get different mileage depending on
> > the length of the pipeline surrounding. Not accessing memory is always better
> > if there are pipeline cycles to soak up the latency.
>
> The -1 trick is always worth doing, I think. Agner Fog has a nice list in his
> optimisation manuals, but the only ones _always_ worth doing are the 0 and -1
> integer cases, and the 0.0 floating point case (also using xor).
If the compiler knows anything about the pipeline around the code, it should be able to make the best choice about the others.
Comment #15 by clugdbug — 2012-05-02T15:12:18Z
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > (In reply to comment #11)
> > > > Haven't done the special case optimizations for constant loading.
> > >
> > > No problem, I'm using GDC anyway which might detect those in the back end.
> > >
> > > An efficient implementation would certainly use at least an xor for 0
> > > initialisation, and the other tricks will get different mileage depending on
> > > the length of the pipeline surrounding. Not accessing memory is always better
> > > if there are pipeline cycles to soak up the latency.
> >
> > The -1 trick is always worth doing, I think. Agner Fog has a nice list in his
> > optimisation manuals, but the only ones _always_ worth doing are the 0 and -1
> > integer cases, and the 0.0 floating point case (also using xor).
>
> If the compiler knows anything about the pipeline around the code, it should be
> able to make the best choice about the others.
My guess is that it's pretty rare that the alternative sequences are favoured just on the basis of the pipeline, since MOVDQA only uses a load port, and nothing else. Especially on Sandy Bridge or AMD, where there are two load ports.
So I doubt there's much benefit to be had.
By contrast, if there's _any_ chance of a cache miss, they'd be a huge win, but unfortunately that's far beyond the compiler's capabilities.
Comment #16 by turkeyman — 2012-05-03T03:25:43Z
(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #13)
> > > (In reply to comment #12)
> > > > (In reply to comment #11)
> > > > > Haven't done the special case optimizations for constant loading.
> > > >
> > > > No problem, I'm using GDC anyway which might detect those in the back end.
> > > >
> > > > An efficient implementation would certainly use at least an xor for 0
> > > > initialisation, and the other tricks will get different mileage depending on
> > > > the length of the pipeline surrounding. Not accessing memory is always better
> > > > if there are pipeline cycles to soak up the latency.
> > >
> > > The -1 trick is always worth doing, I think. Agner Fog has a nice list in his
> > > optimisation manuals, but the only ones _always_ worth doing are the 0 and -1
> > > integer cases, and the 0.0 floating point case (also using xor).
> >
> > If the compiler knows anything about the pipeline around the code, it should be
> > able to make the best choice about the others.
>
> My guess is that it's pretty rare that the alternative sequences are favoured
> just on the basis of the pipeline, since MOVDQA only uses a load port, and
> nothing else. Especially on Sandy Bridge or AMD, where there are two load
> ports.
> So I doubt there's much benefit to be had.
>
> By contrast, if there's _any_ chance of a cache miss, they'd be a huge win, but
> unfortunately that's far beyond the compiler's capabilities.
And that's precisely my reasoning.
If the compiler knows the state of the pipeline around the load, and there aren't conflicts, ie, can slip the instructions in for free between other pipeline stalls, then generating an immediate is always better than touching memory. Schedulers usually do have this information while performing code generation, so it may be possible.
These sorts of considerations are obviously much more critical for non-x86 based architectures though, as with basically all optimisations ;)