← Back to index | Original Bugzilla link

Bug 21041 – core.bitop.byteswap(ushort) should used ROL/ROR instead of XCHG

Status: NEW
Severity: enhancement
Priority: P4
Component: dmd
Product: D
Version: D2
Platform: x86_64
OS: All
Creation time: 2020-07-12T02:33:33Z
Last change time: 2024-12-13T19:10:07Z
Keywords: backend, performance
Assigned to: No Owner
Creator: safety0ff.bugz

Moved to GitHub: dmd#19745 →

Comments

Comment #0 by safety0ff.bugz — 2020-07-12T02:33:33Z

ROL/ROR should provide better performance and less constraints for register allocation. The claim of better performance is based on: - https://www.agner.org/optimize/instruction_tables.pdf - Looking at gcc & llvm compiler output The only disadvantage I see is that the instruction is longer.

Comment #1 by bcarneal11 — 2020-07-12T05:28:04Z

I didn't find a 'byteswap' in the core.bitop documentation. There is a bswap but only for uints and ulongs AFAICT. Regardless, here's a byteswap implementation for discussion: auto byteswap(ushort x) { return cast(ushort)(x >> 8 | x << 8); } For the above code ldc at -O or above generates: movl %edi, %eax rolw $8, %ax retq With ldc you can also get the above sequence using core.bitop.rol!8 explicitly. Current dmd -O emits 7 instructions to accomplish the rolw in the code body. The code emitted by dmd -O for the explicit call to core.bitop.rol is even worse, which is strange. So, yes, there's room here for DMD code gen improvement but ldc is right there.

Comment #2 by safety0ff.bugz — 2020-07-12T13:06:39Z

(In reply to Bruce Carneal from comment #1) > I didn't find a 'byteswap' in the core.bitop documentation. There is a > bswap but only for uints and ulongs AFAICT. The intrinsic in question was added in the master branch here: https://github.com/dlang/dmd/pull/11388 Also the 64 bit version is to be added here: https://github.com/dlang/dmd/pull/11408 > For the above code ldc at -O or above generates: > movl %edi, %eax > rolw $8, %ax > retq I'd expect that since C/C++ clang emit that.

Comment #3 by safety0ff.bugz — 2020-07-12T13:32:02Z

(In reply to Bruce Carneal from comment #1) > Current dmd -O emits 7 instructions to accomplish the rolw in the code body. D converts many operations on narrow types to int, which DMD's backend then fails to optimize away when it is possible/advantageous.

Comment #4 by safety0ff.bugz — 2020-07-12T13:52:35Z

Comment #5 by bcarneal11 — 2020-07-12T15:39:45Z

(In reply to safety0ff.bugz from comment #3) > (In reply to Bruce Carneal from comment #1) > > Current dmd -O emits 7 instructions to accomplish the rolw in the code body. > > D converts many operations on narrow types to int, which DMD's backend then > fails to optimize away when it is possible/advantageous. Yes. DMDs back end is quick, but the code it generates is not state-of-the-art. That said, optimizing the DMD code gen for code.bitop rotations seems more useful than a ushort byteswap improvement. The latter could be implemented as an "inline" of the former. Recognizing the rotation patterns generally, ala LLVM, would be even better but quite a bit of work I'd imagine. Probably not worth it given current resource constraints (Walter's time). Lots of big front-end fish to fry.

Comment #6 by robert.schadek — 2024-12-13T19:10:07Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/19745 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB