Bug 21041 – core.bitop.byteswap(ushort) should used ROL/ROR instead of XCHG

Status
NEW
Severity
enhancement
Priority
P4
Component
dmd
Product
D
Version
D2
Platform
x86_64
OS
All
Creation time
2020-07-12T02:33:33Z
Last change time
2024-12-13T19:10:07Z
Keywords
backend, performance
Assigned to
No Owner
Creator
safety0ff.bugz
Moved to GitHub: dmd#19745 →

Comments

Comment #0 by safety0ff.bugz — 2020-07-12T02:33:33Z
ROL/ROR should provide better performance and less constraints for register allocation. The claim of better performance is based on: - https://www.agner.org/optimize/instruction_tables.pdf - Looking at gcc & llvm compiler output The only disadvantage I see is that the instruction is longer.
Comment #1 by bcarneal11 — 2020-07-12T05:28:04Z
I didn't find a 'byteswap' in the core.bitop documentation. There is a bswap but only for uints and ulongs AFAICT. Regardless, here's a byteswap implementation for discussion: auto byteswap(ushort x) { return cast(ushort)(x >> 8 | x << 8); } For the above code ldc at -O or above generates: movl %edi, %eax rolw $8, %ax retq With ldc you can also get the above sequence using core.bitop.rol!8 explicitly. Current dmd -O emits 7 instructions to accomplish the rolw in the code body. The code emitted by dmd -O for the explicit call to core.bitop.rol is even worse, which is strange. So, yes, there's room here for DMD code gen improvement but ldc is right there.
Comment #2 by safety0ff.bugz — 2020-07-12T13:06:39Z
(In reply to Bruce Carneal from comment #1) > I didn't find a 'byteswap' in the core.bitop documentation. There is a > bswap but only for uints and ulongs AFAICT. The intrinsic in question was added in the master branch here: https://github.com/dlang/dmd/pull/11388 Also the 64 bit version is to be added here: https://github.com/dlang/dmd/pull/11408 > For the above code ldc at -O or above generates: > movl %edi, %eax > rolw $8, %ax > retq I'd expect that since C/C++ clang emit that.
Comment #3 by safety0ff.bugz — 2020-07-12T13:32:02Z
(In reply to Bruce Carneal from comment #1) > Current dmd -O emits 7 instructions to accomplish the rolw in the code body. D converts many operations on narrow types to int, which DMD's backend then fails to optimize away when it is possible/advantageous.
Comment #4 by safety0ff.bugz — 2020-07-12T13:52:35Z
(In reply to safety0ff.bugz from comment #3) > (In reply to Bruce Carneal from comment #1) > > Current dmd -O emits 7 instructions to accomplish the rolw in the code body. > > D converts many operations on narrow types to int, which DMD's backend then > fails to optimize away when it is possible/advantageous. Further investigation: dmd/backend/cod2.d function cdshift also converts rotates of 8 in upper/lower 8 of word into XCHG's
Comment #5 by bcarneal11 — 2020-07-12T15:39:45Z
(In reply to safety0ff.bugz from comment #3) > (In reply to Bruce Carneal from comment #1) > > Current dmd -O emits 7 instructions to accomplish the rolw in the code body. > > D converts many operations on narrow types to int, which DMD's backend then > fails to optimize away when it is possible/advantageous. Yes. DMDs back end is quick, but the code it generates is not state-of-the-art. That said, optimizing the DMD code gen for code.bitop rotations seems more useful than a ushort byteswap improvement. The latter could be implemented as an "inline" of the former. Recognizing the rotation patterns generally, ala LLVM, would be even better but quite a bit of work I'd imagine. Probably not worth it given current resource constraints (Walter's time). Lots of big front-end fish to fry.
Comment #6 by robert.schadek — 2024-12-13T19:10:07Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/19745 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB