Bug 6829 – Unsigned rotate standard function in Phobos, or rotate intrinsic in core.bitop
Status
RESOLVED
Resolution
WORKSFORME
Severity
enhancement
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2011-10-18T15:10:25Z
Last change time
2021-06-06T16:21:43Z
Assigned to
No Owner
Creator
bearophile_hugs
Comments
Comment #0 by bearophile_hugs — 2011-10-18T15:10:25Z
I think Phobos needs one or two functions to rotate the bits of a unsigned integral value, because DMD is now able to optimize a specific coding pattern to use specific CPU instructions, but it's easy to write the operation in a way that doesn't allow the pattern to be recognized.
A Phobos template function (with a constraint that lets it accept only the right input types) will allow to explicitly ask for this operation to every present and future D compiler, with no risk of mistakes from the programmer.
See also:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=146892
Comment #1 by hsteoh — 2013-07-09T09:05:22Z
The implementation should probably go into std.bitmanip or core.bitop.
What's the pattern that DMD recognizes for rotate instructions?
Comment #2 by bearophile_hugs — 2013-07-09T10:13:38Z
(In reply to comment #1)
> What's the pattern that DMD recognizes for rotate instructions?
Walter offers this example of recognizable rotation:
void test236() {
uint a;
int shift;
a = 7;
shift = 1;
int r;
r = (a >> shift) | (a << (int.sizeof * 8 - shift));
assert(r == 0x8000_0003);
r = (a << shift) | (a >> (int.sizeof * 8 - shift));
assert(a == 7);
}
Comment #3 by hsteoh — 2013-07-09T12:06:57Z
Hmph. That doesn't work. Compiling without -O just makes DMD translate it literally into shl/shr followed by or. Compiling with -O computes the values via CTFE, so I changed the code by making a and shift parameters, but DMD still uses shl/shr followed by or, instead of replacing it with rol/ror, as claimed.
So basically, the pattern either isn't there, or isn't working properly.
Comment #4 by hsteoh — 2013-07-09T12:20:16Z
In fact, not even gdc -O3 recognizes this pattern.
Comment #5 by bearophile_hugs — 2013-07-10T02:11:50Z
I think this should be a recognizable rotate left function:
private static uint rol(in uint x, in uint y) pure nothrow {
return (x << y) | (x >> (32 - y));
}
Comment #6 by ibuclaw — 2013-07-10T02:38:38Z
(In reply to comment #5)
> I think this should be a recognizable rotate left function:
>
> private static uint rol(in uint x, in uint y) pure nothrow {
> return (x << y) | (x >> (32 - y));
> }
It is (in gdc with -O :)
_D3rol3rolFNaNbxkxkZk:
mov x, %eax
mov y, %ecx
rol %cl, %eax
ret
Comment #7 by ibuclaw — 2013-07-10T02:50:07Z
(In reply to comment #5)
> I think this should be a recognizable rotate left function:
>
> private static uint rol(in uint x, in uint y) pure nothrow {
> return (x << y) | (x >> (32 - y));
> }
And likewise for rotate right function:
uint
ror (in uint x, in uint y) pure nothrow
{
return (x >> y) | (x << (32 - y));
}
Comment #8 by bearophile_hugs — 2013-07-10T03:28:19Z
(In reply to comment #6)
> (In reply to comment #5)
> > I think this should be a recognizable rotate left function:
> >
> > private static uint rol(in uint x, in uint y) pure nothrow {
> > return (x << y) | (x >> (32 - y));
> > }
>
> It is (in gdc with -O :)
>
> _D3rol3rolFNaNbxkxkZk:
> mov x, %eax
> mov y, %ecx
> rol %cl, %eax
> ret
Both dmd and ldc2 recognize the patterns:
uint rol(in uint x, in uint y) pure nothrow {
return (x << y) | (x >> (32 - y));
}
uint ror(in uint x, in uint y) pure nothrow {
return (x >> y) | (x << (32 - y));
}
extern(C) uint rol2(in uint x, in uint y) pure nothrow {
return (x << y) | (x >> (32 - y));
}
void main() {}
/*
dmd -O
_D5temp23rolFNaNbxkxkZk:
push EAX
mov EAX,8[ESP]
mov ECX,[ESP]
rol EAX,CL
pop ECX
ret 4
_D5temp23rorFNaNbxkxkZk:
push EAX
mov EAX,8[ESP]
mov ECX,[ESP]
ror EAX,CL
pop ECX
ret 4
_rol2 :
mov EAX,4[ESP]
mov ECX,8[ESP]
rol EAX,CL
ret
----------------
ldmd2 -O -output-s
__D5temp23rolFNaNbxkxkZk:
movl 4(%esp), %edx
movb %al, %cl
roll %cl, %edx
movl %edx, %eax
ret $4
__D5temp23rorFNaNbxkxkZk:
movl 4(%esp), %edx
movb %al, %cl
rorl %cl, %edx
movl %edx, %eax
ret $4
_rol2:
movb 8(%esp), %cl
movl 4(%esp), %eax
roll %cl, %eax
ret
*/
Comment #9 by ibuclaw — 2013-07-10T03:49:49Z
(In reply to comment #8)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > I think this should be a recognizable rotate left function:
> > >
> > > private static uint rol(in uint x, in uint y) pure nothrow {
> > > return (x << y) | (x >> (32 - y));
> > > }
> >
> > It is (in gdc with -O :)
> >
> > _D3rol3rolFNaNbxkxkZk:
> > mov x, %eax
> > mov y, %ecx
> > rol %cl, %eax
> > ret
>
>
> Both dmd and ldc2 recognize the patterns:
>
Excellent, so could put these as templates then. :)
T rol (T)(in T x, in uint y) pure nothrow
{
return cast(T)((x << y) | (x >> ((T.sizeof * 8) - y)));
}
T ror (T)(in T x, in uint y) pure nothrow
{
return cast(T)((x >> y) | (x << ((T.sizeof * 8) - y)));
}
// Tests to check for assembly output.
extern ubyte xb;
extern ushort xs;
extern uint xi;
extern ulong xl;
extern uint yi;
{
rol (xb, yi); // rolb
ror (xb, yi); // rorb
rol (xs, yi); // rolw
ror (xs, yi); // rorw
rol (xi, yi); // roll
ror (xi, yi); // rorl
rol (xl, yi); // version(X86_64) rolq
ror (xl, yi); // version(X86_64) rorq
}
Comment #10 by bearophile_hugs — 2013-07-10T04:15:16Z
(In reply to comment #9)
> Excellent, so could put these as templates then. :)
I have improved your code a little. Follows the 32 bit asm for ldc2 and dmd:
import std.traits: isIntegral, isUnsigned;
T rol(T)(in T x, in uint y) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
{
return cast(T)((x << y) | (x >> ((T.sizeof * 8) - y)));
}
T ror(T)(in T x, in uint y) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
{
return cast(T)((x >> y) | (x << ((T.sizeof * 8) - y)));
}
void main() {
// Tests to check for assembly output.
{
__gshared static ubyte xb;
__gshared static ushort xs;
__gshared static uint xi;
__gshared static ulong xl;
__gshared static uint yi;
rol(xb, yi); // rolb
ror(xb, yi); // rorb
rol(xs, yi); // rolw
ror(xs, yi); // rorw
rol(xi, yi); // roll
ror(xi, yi); // rorl
rol(xl, yi); // version(X86_64) rolq
ror(xl, yi); // version(X86_64) rorq
}
}
/*
ldmd2 -O -output-s
__D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
pushl %esi
movzbl 8(%esp), %edx
movb %al, %cl
movl %edx, %esi
shll %cl, %esi
movl $8, %ecx
subl %eax, %ecx
shrl %cl, %edx
orl %esi, %edx
movzbl %dl, %eax
popl %esi
ret $4
__D5temp210__T3rorThZ3rorFNaNbNfxhxkZh:
pushl %esi
movzbl 8(%esp), %edx
movb %al, %cl
movl %edx, %esi
shrl %cl, %esi
movl $8, %ecx
subl %eax, %ecx
shll %cl, %edx
orl %esi, %edx
movzbl %dl, %eax
popl %esi
ret $4
__D5temp210__T3rolTtZ3rolFNaNbNfxtxkZt:
pushl %esi
movzwl 8(%esp), %edx
movb %al, %cl
movl %edx, %esi
shll %cl, %esi
movl $16, %ecx
subl %eax, %ecx
shrl %cl, %edx
orl %esi, %edx
movzwl %dx, %eax
popl %esi
ret $4
__D5temp210__T3rorTtZ3rorFNaNbNfxtxkZt:
pushl %esi
movzwl 8(%esp), %edx
movb %al, %cl
movl %edx, %esi
shrl %cl, %esi
movl $16, %ecx
subl %eax, %ecx
shll %cl, %edx
orl %esi, %edx
movzwl %dx, %eax
popl %esi
ret $4
__D5temp210__T3rolTkZ3rolFNaNbNfxkxkZk:
movl 4(%esp), %edx
movb %al, %cl
roll %cl, %edx
movl %edx, %eax
ret $4
__D5temp210__T3rorTkZ3rorFNaNbNfxkxkZk:
movl 4(%esp), %edx
movb %al, %cl
rorl %cl, %edx
movl %edx, %eax
ret $4
__D5temp210__T3rolTmZ3rolFNaNbNfxmxkZm:
pushl %ebp
pushl %ebx
pushl %edi
pushl %esi
movl %eax, %ebx
movl $64, %ecx
subl %ebx, %ecx
movl 24(%esp), %edx
movl 20(%esp), %eax
movl %eax, %esi
shrdl %cl, %edx, %esi
movl %edx, %edi
shrl %cl, %edi
xorl %ebp, %ebp
testb $32, %cl
cmovnel %edi, %esi
cmovnel %ebp, %edi
movb %bl, %cl
shldl %cl, %eax, %edx
movb %bl, %cl
shll %cl, %eax
testb $32, %bl
cmovnel %eax, %edx
cmovnel %ebp, %eax
orl %esi, %eax
orl %edi, %edx
popl %esi
popl %edi
popl %ebx
popl %ebp
ret $8
__D5temp210__T3rorTmZ3rorFNaNbNfxmxkZm:
pushl %ebp
pushl %ebx
pushl %edi
pushl %esi
movl %eax, %ecx
movl 24(%esp), %edx
movl 20(%esp), %eax
movl %eax, %esi
shrdl %cl, %edx, %esi
movl %edx, %edi
shrl %cl, %edi
xorl %ebp, %ebp
testb $32, %cl
cmovnel %edi, %esi
cmovnel %ebp, %edi
movl $64, %ebx
subl %ecx, %ebx
movb %bl, %cl
shldl %cl, %eax, %edx
movb %bl, %cl
shll %cl, %eax
testb $32, %bl
cmovnel %eax, %edx
cmovnel %ebp, %eax
orl %esi, %eax
orl %edi, %edx
popl %esi
popl %edi
popl %ebx
popl %ebp
ret $8
--------------------------------
dmd -O
_D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
push EAX
push EAX
movzx ECX,byte ptr 0Ch[ESP]
mov EAX,ECX
mov 0[ESP],ECX
mov CL,4[ESP]
mov EDX,0[ESP]
shl AL,CL
mov ECX,8
sub ECX,4[ESP]
sar EDX,CL
add ESP,8
or AL,DL
ret 4
_D5temp210__T3rorThZ3rorFNaNbNfxhxkZh:
push EAX
push EAX
movzx ECX,byte ptr 0Ch[ESP]
mov EAX,ECX
mov 0[ESP],ECX
mov ECX,8
sub CL,4[ESP]
mov EDX,0[ESP]
shl AL,CL
mov ECX,4[ESP]
sar EDX,CL
add ESP,8
or AL,DL
ret 4
_D5temp210__T3rolTtZ3rolFNaNbNfxtxkZt:
push EAX
mov AX,8[ESP]
mov CX,[ESP]
mov DX,8[ESP]
shl EAX,CL
mov ECX,010h
sub ECX,[ESP]
and EDX,0FFFFh
sar EDX,CL
pop ECX
or EAX,EDX
ret 4
_D5temp210__T3rorTtZ3rorFNaNbNfxtxkZt:
push EAX
mov ECX,010h
mov AX,8[ESP]
sub ECX,[ESP]
mov DX,8[ESP]
shl EAX,CL
mov ECX,[ESP]
and EDX,0FFFFh
sar EDX,CL
or EAX,EDX
pop ECX
ret 4
_D5temp210__T3rolTkZ3rolFNaNbNfxkxkZk:
push EAX
mov EAX,8[ESP]
mov ECX,[ESP]
rol EAX,CL
pop ECX
ret 4
_D5temp210__T3rorTkZ3rorFNaNbNfxkxkZk:
push EAX
mov EAX,8[ESP]
mov ECX,[ESP]
ror EAX,CL
pop ECX
ret 4
_D5temp210__T3rolTmZ3rolFNaNbNfxmxkZm:
push EAX
mov ECX,0[ESP]
mov EDX,0Ch[ESP]
push EBX
mov EAX,0Ch[ESP]
test CL,020h
jne L1A
shld EDX,EAX,CL
shl EAX,CL
jmp short L20
L1A: shl EAX,CL
mov EDX,EAX
xor EAX,EAX
L20: mov ECX,040h
mov EBX,0Ch[ESP]
push EDX
mov EDX,014h[ESP]
sub ECX,8[ESP]
test CL,020h
jne L3E
shrd EBX,EDX,CL
shr EDX,CL
jmp short L44
L3E: shr EDX,CL
mov EBX,EDX
xor EDX,EDX
L44: mov ECX,EDX
or EAX,EBX
pop EDX
or EDX,ECX
pop EBX
pop ECX
ret 8
_D5temp210__T3rorTmZ3rorFNaNbNfxmxkZm:
push EAX
mov ECX,0[ESP]
mov EDX,0Ch[ESP]
push EBX
mov EAX,0Ch[ESP]
test CL,020h
jne L1A
shrd EAX,EDX,CL
shr EDX,CL
jmp short L20
L1A: shr EDX,CL
mov EAX,EDX
xor EDX,EDX
L20: mov ECX,040h
mov EBX,0Ch[ESP]
push EDX
mov EDX,014h[ESP]
sub ECX,8[ESP]
test CL,020h
jne L3E
shld EDX,EBX,CL
shl EBX,CL
jmp short L44
L3E: shl EBX,CL
mov EDX,EBX
xor EBX,EBX
L44: mov ECX,EDX
or EAX,EBX
pop EDX
or EDX,ECX
pop EBX
pop ECX
ret 8
*/
Comment #11 by ibuclaw — 2013-07-10T04:57:05Z
(In reply to comment #10)
> (In reply to comment #9)
>
> > Excellent, so could put these as templates then. :)
>
> I have improved your code a little. Follows the 32 bit asm for ldc2 and dmd:
>
gdc > (ldc && dmd);
It has no problem detecting all those cases. :o)
Comment #12 by bearophile_hugs — 2013-07-10T06:45:05Z
A bit better, with a pre-condition to catch some bugs:
T rol(T)(in T x, in uint n) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
assert(n < (T.sizeof * 8));
} body {
return cast(T)((x << n) | (x >> ((T.sizeof * 8) - n)));
}
T ror(T)(in T x, in uint n) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
assert(n < (T.sizeof * 8));
} body {
return cast(T)((x >> n) | (x << ((T.sizeof * 8) - n)));
}
--------------
(In reply to comment #11)
> gdc > (ldc && dmd);
>
> It has no problem detecting all those cases. :o)
But is that asm generated by gdc actually faster than this asm generated by ldc2?
One of the main points of adding those two templated functions to core.bitop is to have a single common standard pair of rotations that all present and future D compilers can recognize and optimize well.
Comment #13 by bearophile_hugs — 2013-07-10T06:46:44Z
(In reply to comment #11)
> It has no problem detecting all those cases. :o)
Perhaps you want to show the asm generated by gdc for those functions?
(Perhaps here there's material for a small enhancement request for LLVM.)
Comment #14 by ibuclaw — 2013-07-10T06:53:27Z
(In reply to comment #12)
> (In reply to comment #11)
>
> > gdc > (ldc && dmd);
> >
> > It has no problem detecting all those cases. :o)
>
> But is that asm generated by gdc actually faster than this asm generated by
> ldc2?
>
Well, I would have to assume that ro{lr}{bwlq} is faster than shl/shr. Someone should get speed reports in. =)
Comment #15 by ibuclaw — 2013-07-10T06:55:08Z
(In reply to comment #13)
> (In reply to comment #11)
>
> > It has no problem detecting all those cases. :o)
>
> Perhaps you want to show the asm generated by gdc for those functions?
>
> (Perhaps here there's material for a small enhancement request for LLVM.)
You see the assembler output for roll/rorl. It's identical with exception to the operand suffix - eg: rolb/rorb for ubyte.
But to give you a comparison.
// LDC
__D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
pushl %esi
movzbl 8(%esp), %edx
movb %al, %cl
movl %edx, %esi
shll %cl, %esi
movl $8, %ecx
subl %eax, %ecx
shrl %cl, %edx
orl %esi, %edx
movzbl %dl, %eax
popl %esi
ret $4
vs.
// GDC
__D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
movl %edi, %eax
movl %esi, %ecx
rolb %cl, %al
ret
Comment #16 by ibuclaw — 2013-07-10T06:59:38Z
(In reply to comment #13)
> (In reply to comment #11)
>
> > It has no problem detecting all those cases. :o)
>
> Perhaps you want to show the asm generated by gdc for those functions?
>
> (Perhaps here there's material for a small enhancement request for LLVM.)
Full listing:
_D4temp10__T3rolThZ3rolFNaNbNfxhxkZh:
movl %edi, %eax
movl %esi, %ecx
rolb %cl, %al
ret
_D4temp10__T3rorThZ3rorFNaNbNfxhxkZh:
movl %edi, %eax
movl %esi, %ecx
rorb %cl, %al
ret
_D4temp10__T3rolTtZ3rolFNaNbNfxtxkZt:
movl %edi, %eax
movl %esi, %ecx
rolw %cl, %ax
ret
_D4temp10__T3rorTtZ3rorFNaNbNfxtxkZt:
movl %edi, %eax
movl %esi, %ecx
rorw %cl, %ax
ret
_D4temp10__T3rolTkZ3rolFNaNbNfxkxkZk:
movl %edi, %eax
movl %esi, %ecx
roll %cl, %eax
ret
_D4temp10__T3rorTkZ3rorFNaNbNfxkxkZk:
movl %edi, %eax
movl %esi, %ecx
rorl %cl, %eax
ret
_D4temp10__T3rolTmZ3rolFNaNbNfxmxkZm:
movq %rdi, %rax
movl %esi, %ecx
rolq %cl, %rax
ret
_D4temp10__T3rorTmZ3rorFNaNbNfxmxkZm:
movq %rdi, %rax
movl %esi, %ecx
rorq %cl, %rax
ret
Comment #17 by bearophile_hugs — 2013-07-10T09:29:51Z
(In reply to comment #15)
> But to give you a comparison.
>
> // LDC
>
> __D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
> pushl %esi
> movzbl 8(%esp), %edx
> movb %al, %cl
> movl %edx, %esi
> shll %cl, %esi
> movl $8, %ecx
> subl %eax, %ecx
> shrl %cl, %edx
> orl %esi, %edx
> movzbl %dl, %eax
> popl %esi
> ret $4
>
> vs.
>
> // GDC
>
> __D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
> movl %edi, %eax
> movl %esi, %ecx
> rolb %cl, %al
> ret
I have asked in the LLVM IRC channel, and it seems a limit/problem of the llvm back-end. So maybe it's worth writing an LLVM enhancement request.
Comment #18 by hsteoh — 2013-07-11T21:37:37Z
Hmph. I'm using bearophile's code with DMD git HEAD, running as dmd -O, and *none* of the cases produce a rotate instruction. :-( Could it be an issue with 32bit vs. 64-bit? I'm running in a purely 64-bit environment.
Comment #19 by hsteoh — 2013-07-11T21:40:34Z
I can't get GDC git HEAD to recognize this pattern either. What am I doing wrong??
Comment #20 by hsteoh — 2013-07-11T21:46:05Z
Could it be because amd64 doesn't support this optimization? Seems odd, though. I'd expect it to work at least up to 16-bit argument form, since that's common to both x86 and amd.
Comment #21 by ibuclaw — 2013-07-12T00:58:49Z
1. Not my problem. :)
2. When comparing gdc and dmd, make sure your actually looking at object files generated by gdc, and not dmd. :)
3. That's highly unlikely as I've tested on x86_64. :)
Comment #22 by ibuclaw — 2013-07-12T01:00:57Z
(In reply to comment #21)
> 1. Not my problem. :)
> 2. When comparing gdc and dmd, make sure your actually looking at object files
> generated by gdc, and not dmd. :)
> 3. That's highly unlikely as I've tested on x86_64. :)
and 4. Under simple test conditions where all parameter values are const/known, templated function calls tend to have a habit of being inlined / optimised away.
Comment #23 by ibuclaw — 2013-07-12T01:02:52Z
(In reply to comment #22)
> (In reply to comment #21)
> > 1. Not my problem. :)
> > 2. When comparing gdc and dmd, make sure your actually looking at object files
> > generated by gdc, and not dmd. :)
> > 3. That's highly unlikely as I've tested on x86_64. :)
>
> and 4. Under simple test conditions where all parameter values are const/known,
> templated function calls tend to have a habit of being inlined / optimised
> away.
and 5. Make sure that you use bearophiles last implementation example. ;)
http://d.puremagic.com/issues/show_bug.cgi?id=6829#c12
Comment #24 by bearophile_hugs — 2013-07-12T01:59:52Z
(In reply to comment #23)
> and 5. Make sure that you use bearophiles last implementation example. ;)
Updated code, lacks unittests:
import std.traits: isIntegral, isUnsigned;
/// Left-shift x by n bits.
T rol(T)(in T x, in uint nBits) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
assert(nBits < (T.sizeof * 8));
} body {
return cast(T)((x << nBits) | (x >> ((T.sizeof * 8) - nBits)));
}
/// Right-shift x by n bits.
T ror(T)(in T x, in uint nBits) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
assert(nBits < (T.sizeof * 8));
} body {
return cast(T)((x >> nBits) | (x << ((T.sizeof * 8) - nBits)));
}
void main() {
// Tests to check for assembly output.
{
__gshared static ubyte xb;
__gshared static ushort xs;
__gshared static uint xi;
__gshared static ulong xl;
__gshared static uint yi;
rol(xb, yi); // rolb
ror(xb, yi); // rorb
rol(xs, yi); // rolw
ror(xs, yi); // rorw
rol(xi, yi); // roll
ror(xi, yi); // rorl
rol(xl, yi); // version(X86_64) rolq
ror(xl, yi); // version(X86_64) rorq
}
}
Comment #25 by hsteoh — 2013-07-12T08:13:16Z
Nope, it's still not working. I copied-n-pasted exactly the code posted above, and compiled with gdc -frelease -O3 test.d, and here is the disassembly output:
00000000004042d0 <_D4test10__T3rolThZ3rolFNaNbNfxhxkZh>:
4042d0: 40 0f b6 ff movzbl %dil,%edi
4042d4: b9 08 00 00 00 mov $0x8,%ecx
4042d9: 29 f1 sub %esi,%ecx
4042db: 89 f8 mov %edi,%eax
4042dd: d3 f8 sar %cl,%eax
4042df: 89 f1 mov %esi,%ecx
4042e1: d3 e7 shl %cl,%edi
4042e3: 09 f8 or %edi,%eax
4042e5: c3 retq
4042e6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4042ed: 00 00 00
00000000004042f0 <_D4test10__T3rorThZ3rorFNaNbNfxhxkZh>:
4042f0: 40 0f b6 ff movzbl %dil,%edi
4042f4: b9 08 00 00 00 mov $0x8,%ecx
4042f9: 29 f1 sub %esi,%ecx
4042fb: 89 f8 mov %edi,%eax
4042fd: d3 e0 shl %cl,%eax
4042ff: 89 f1 mov %esi,%ecx
404301: d3 ff sar %cl,%edi
404303: 09 f8 or %edi,%eax
404305: c3 retq
404306: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40430d: 00 00 00
0000000000404310 <_D4test10__T3rolTtZ3rolFNaNbNfxtxkZt>:
404310: 0f b7 ff movzwl %di,%edi
404313: b9 10 00 00 00 mov $0x10,%ecx
404318: 29 f1 sub %esi,%ecx
40431a: 89 f8 mov %edi,%eax
40431c: d3 f8 sar %cl,%eax
40431e: 89 f1 mov %esi,%ecx
404320: d3 e7 shl %cl,%edi
404322: 09 f8 or %edi,%eax
404324: c3 retq
404325: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40432c: 00 00 00
40432f: 90 nop
0000000000404330 <_D4test10__T3rorTtZ3rorFNaNbNfxtxkZt>:
404330: 0f b7 ff movzwl %di,%edi
404333: b9 10 00 00 00 mov $0x10,%ecx
404338: 29 f1 sub %esi,%ecx
40433a: 89 f8 mov %edi,%eax
40433c: d3 e0 shl %cl,%eax
40433e: 89 f1 mov %esi,%ecx
404340: d3 ff sar %cl,%edi
404342: 09 f8 or %edi,%eax
404344: c3 retq
404345: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40434c: 00 00 00
40434f: 90 nop
0000000000404350 <_D4test10__T3rolTkZ3rolFNaNbNfxkxkZk>:
404350: b9 20 00 00 00 mov $0x20,%ecx
404355: 89 f8 mov %edi,%eax
404357: 29 f1 sub %esi,%ecx
404359: d3 e8 shr %cl,%eax
40435b: 89 f1 mov %esi,%ecx
40435d: d3 e7 shl %cl,%edi
40435f: 09 f8 or %edi,%eax
404361: c3 retq
404362: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
404369: 00 00 00
40436c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000404370 <_D4test10__T3rorTkZ3rorFNaNbNfxkxkZk>:
404370: b9 20 00 00 00 mov $0x20,%ecx
404375: 89 f8 mov %edi,%eax
404377: 29 f1 sub %esi,%ecx
404379: d3 e0 shl %cl,%eax
40437b: 89 f1 mov %esi,%ecx
40437d: d3 ef shr %cl,%edi
40437f: 09 f8 or %edi,%eax
404381: c3 retq
404382: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
404389: 00 00 00
40438c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000404390 <_D4test10__T3rolTmZ3rolFNaNbNfxmxkZm>:
404390: b9 40 00 00 00 mov $0x40,%ecx
404395: 48 89 f8 mov %rdi,%rax
404398: 29 f1 sub %esi,%ecx
40439a: 48 d3 e8 shr %cl,%rax
40439d: 89 f1 mov %esi,%ecx
40439f: 48 d3 e7 shl %cl,%rdi
4043a2: 48 09 f8 or %rdi,%rax
4043a5: c3 retq
4043a6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4043ad: 00 00 00
00000000004043b0 <_D4test10__T3rorTmZ3rorFNaNbNfxmxkZm>:
4043b0: b9 40 00 00 00 mov $0x40,%ecx
4043b5: 48 89 f8 mov %rdi,%rax
4043b8: 29 f1 sub %esi,%ecx
4043ba: 48 d3 e0 shl %cl,%rax
4043bd: 89 f1 mov %esi,%ecx
4043bf: 48 d3 ef shr %cl,%rdi
4043c2: 48 09 f8 or %rdi,%rax
4043c5: c3 retq
4043c6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4043cd: 00 00 00
It's still using shift + bitwise OR instead of substituting a rotate instruction. I'm also on Linux x86_64 (so says uname -m), so I've no idea what I'm doing wrong. And gdc --version says it's gdc (GCC) 4.8.1. Did something go wrong/missing in my gdc build??
Comment #26 by ibuclaw — 2013-07-12T08:21:21Z
Don't currently have a gdc 4.8 compiler at hand to test...
Comment #27 by hsteoh — 2013-07-12T08:40:43Z
Huh? So which gdc have you been using to test this earlier?
Comment #28 by bugzilla — 2013-07-13T00:32:38Z
(In reply to comment #2)
> (In reply to comment #1)
>
> > What's the pattern that DMD recognizes for rotate instructions?
>
> Walter offers this example of recognizable rotation:
Ack, the example is bad. This one generates rol/ror:
-----------
void test(int shift)
{
uint a = 7;
uint r;
r = (a >> shift) | (a << (int.sizeof * 8 - shift));
assert(r == 0x8000_0003);
r = (r << shift) | (r >> (int.sizeof * 8 - shift));
assert(r == 7);
}
-----------
compiling with -O:
_D3foo4testFiZv comdat
assume CS:_D3foo4testFiZv
L0: push EAX
mov EDX,7
mov ECX,EAX
push EAX
ror EDX,CL
cmp EDX,080000003h
push EBX
mov 4[ESP],EDX
je L22
mov EAX,6
call near ptr _D3foo8__assertFiZv
L22: mov EBX,4[ESP]
mov ECX,8[ESP]
rol EBX,CL
cmp EBX,7
je L3B
mov EAX,8
call near ptr _D3foo8__assertFiZv
L3B: pop EBX
add ESP,8
ret
Comment #30 by bearophile_hugs — 2013-07-13T02:37:33Z
(In reply to comment #28)
> Ack, the example is bad. This one generates rol/ror:
For normal D programmers it's easy to write D "rotation" code that the D compiler doesn't recognize. So here I have proposed to add two standard rol/ror template functions to core.bitop that all D programmers can use safely and all D compilers can optimize well.
(In reply to comment #31)
> I don't understand what I'm doing wrong. Is this a platform-specific issue? I'm
> running Linux/x86_64.
It works with 32 bit code generation, not with 64. I didn't realize that. I'll look into fixing it.
Comment #33 by hsteoh — 2013-07-13T14:27:53Z
Interesting. Running dmd -m32 -O works, produces the rotate instructions.
For some reason, I still can't coax gdc to do this. I've tried all combinations of -O, -O2, -O3 and -m32 , -m64, and it still only produces shift + bitwise OR. Any clues, Iain? :)
Comment #34 by ibuclaw — 2013-07-14T02:47:24Z
(In reply to comment #33)
> Interesting. Running dmd -m32 -O works, produces the rotate instructions.
>
> For some reason, I still can't coax gdc to do this. I've tried all combinations
> of -O, -O2, -O3 and -m32 , -m64, and it still only produces shift + bitwise OR.
> Any clues, Iain? :)
Works for me (gcc vanilla development) - so I'm shrugging my shoulders on this.
Comment #35 by hsteoh — 2013-07-14T07:45:05Z
Interestingly, translating the code into C and compiling with gcc 4.8.1 does produce the rotate instructions. But compiling the D version with gdc 4.8 doesn't. I've no idea why.
Comment #36 by ibuclaw — 2013-07-14T08:10:23Z
(In reply to comment #35)
> Interestingly, translating the code into C and compiling with gcc 4.8.1 does
> produce the rotate instructions. But compiling the D version with gdc 4.8
> doesn't. I've no idea why.
You can look at output of -fdump-tree-gimple and compare gcc to gdc generated code.
But there should be no difference. :)
Comment #37 by ibuclaw — 2013-07-14T08:38:50Z
(In reply to comment #36)
> (In reply to comment #35)
> > Interestingly, translating the code into C and compiling with gcc 4.8.1 does
> > produce the rotate instructions. But compiling the D version with gdc 4.8
> > doesn't. I've no idea why.
>
> You can look at output of -fdump-tree-gimple and compare gcc to gdc generated
> code.
>
> But there should be no difference. :)
gdc 4.9.0 20130707 - produces rol/ror instructions.
gdc 4.9.0 20130616 - produces rol/ror instructions.
gdc 4.9.0 20130505 - produces shl/shr instructions.
Related commits between 10/05 - 14/05.
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=198770http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=198823http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=198864
Comment #38 by bearophile_hugs — 2013-07-14T09:10:08Z
(In reply to comment #37)
> gdc 4.9.0 20130707 - produces rol/ror instructions.
> gdc 4.9.0 20130616 - produces rol/ror instructions.
> gdc 4.9.0 20130505 - produces shl/shr instructions.
I am changing my mind, maybe now I want rol/ror _intrinsics_ in core.bitop... :-)
Comment #39 by ibuclaw — 2013-07-14T10:28:05Z
(In reply to comment #38)
> (In reply to comment #37)
>
> > gdc 4.9.0 20130707 - produces rol/ror instructions.
> > gdc 4.9.0 20130616 - produces rol/ror instructions.
> > gdc 4.9.0 20130505 - produces shl/shr instructions.
>
> I am changing my mind, maybe now I want rol/ror _intrinsics_ in core.bitop...
> :-)
core.bitop might be a good idea... then I could map the functions to gcc's builtin lrotate and rrotate expressions. :o)
Slightly off the mark, but could also do vector rotate intrinsics too...
Comment #40 by bearophile_hugs — 2013-07-28T10:19:00Z
(In reply to comment #13)
> (Perhaps here there's material for a small enhancement request for LLVM.)
Here it is, in the LLVM Bugzilla:
http://llvm.org/bugs/show_bug.cgi?id=16726
Comment #41 by hsteoh — 2013-08-13T09:42:34Z
Related: issue #1116.
Comment #42 by temtaime — 2013-08-13T10:47:52Z
Guys, please, fix the bugs, not improve performance of DMD's backend.
For that purpose there are LDC and GDC.
Comment #43 by bearophile_hugs — 2013-08-13T11:20:41Z
(In reply to comment #42)
> Guys, please, fix the bugs, not improve performance of DMD's backend.
> For that purpose there are LDC and GDC.
Walter is free to do what he likes with his time, including enhancing the DMD back-end. And currently LDC2 back-end is not optimizing the rotations well enough. I now think that having two rotation intrinsics is a good solution.
Comment #44 by ibuclaw — 2013-08-13T12:05:05Z
(In reply to comment #42)
> Guys, please, fix the bugs, not improve performance of DMD's backend.
> For that purpose there are LDC and GDC.
Intrinsics benefits GDC code generation also...
Comment #45 by kai — 2013-09-09T10:58:01Z
(In reply to comment #43)
> And currently LDC2 back-end is not optimizing the rotations well
> enough.
I had a closer look to this. The root cause is the uint nBits declaration. Because of this the x value is promoted to uint if T is ubyte or ushort (according to TDPL, 2.3.3). In this case LLVM is not smart enough to deduce that the zero extension is not necessary. If both types are ubyte or ushort, then the rolb/rolw and rorb/rorw instructions are generated.
I try to make LLVM smarter. :-)
Comment #46 by code — 2013-11-01T19:06:39Z
*** Issue 1116 has been marked as a duplicate of this issue. ***
Comment #47 by moonlightsentinel — 2021-06-06T16:21:43Z