Comment #0 by safety0ff.bugz — 2020-07-12T02:33:33Z
ROL/ROR should provide better performance and less constraints for register allocation.
The claim of better performance is based on:
- https://www.agner.org/optimize/instruction_tables.pdf
- Looking at gcc & llvm compiler output
The only disadvantage I see is that the instruction is longer.
Comment #1 by bcarneal11 — 2020-07-12T05:28:04Z
I didn't find a 'byteswap' in the core.bitop documentation. There is a bswap but only for uints and ulongs AFAICT. Regardless, here's a byteswap implementation for discussion:
auto byteswap(ushort x) { return cast(ushort)(x >> 8 | x << 8); }
For the above code ldc at -O or above generates:
movl %edi, %eax
rolw $8, %ax
retq
With ldc you can also get the above sequence using core.bitop.rol!8 explicitly.
Current dmd -O emits 7 instructions to accomplish the rolw in the code body. The code emitted by dmd -O for the explicit call to core.bitop.rol is even worse, which is strange.
So, yes, there's room here for DMD code gen improvement but ldc is right there.
Comment #2 by safety0ff.bugz — 2020-07-12T13:06:39Z
(In reply to Bruce Carneal from comment #1)
> I didn't find a 'byteswap' in the core.bitop documentation. There is a
> bswap but only for uints and ulongs AFAICT.
The intrinsic in question was added in the master branch here: https://github.com/dlang/dmd/pull/11388
Also the 64 bit version is to be added here: https://github.com/dlang/dmd/pull/11408
> For the above code ldc at -O or above generates:
> movl %edi, %eax
> rolw $8, %ax
> retq
I'd expect that since C/C++ clang emit that.
Comment #3 by safety0ff.bugz — 2020-07-12T13:32:02Z
(In reply to Bruce Carneal from comment #1)
> Current dmd -O emits 7 instructions to accomplish the rolw in the code body.
D converts many operations on narrow types to int, which DMD's backend then fails to optimize away when it is possible/advantageous.
Comment #4 by safety0ff.bugz — 2020-07-12T13:52:35Z
(In reply to safety0ff.bugz from comment #3)
> (In reply to Bruce Carneal from comment #1)
> > Current dmd -O emits 7 instructions to accomplish the rolw in the code body.
>
> D converts many operations on narrow types to int, which DMD's backend then
> fails to optimize away when it is possible/advantageous.
Further investigation: dmd/backend/cod2.d function cdshift also converts rotates of 8 in upper/lower 8 of word into XCHG's
Comment #5 by bcarneal11 — 2020-07-12T15:39:45Z
(In reply to safety0ff.bugz from comment #3)
> (In reply to Bruce Carneal from comment #1)
> > Current dmd -O emits 7 instructions to accomplish the rolw in the code body.
>
> D converts many operations on narrow types to int, which DMD's backend then
> fails to optimize away when it is possible/advantageous.
Yes. DMDs back end is quick, but the code it generates is not state-of-the-art.
That said, optimizing the DMD code gen for code.bitop rotations seems more useful than a ushort byteswap improvement. The latter could be implemented as an "inline" of the former.
Recognizing the rotation patterns generally, ala LLVM, would be even better but quite a bit of work I'd imagine. Probably not worth it given current resource constraints (Walter's time). Lots of big front-end fish to fry.
Comment #6 by robert.schadek — 2024-12-13T19:10:07Z