Tracked down a severe performance issue in my new AA implementation, where it zeroed a freshly allocated entry.
DMD generates the following code for the assignment.
----
void zero(ubyte[] ary) { ary[] = 0; }
----
mov rcx, rdi ; 0008 _ 48: 89. F9
xor rax, rax ; 000B _ 48: 31. C0
mov rdi, rsi ; 000E _ 48: 8B. FE
rep stosb ; 0011 _ F3: AA
----
This is a bytewise store 0 and is about 4x slower than memset, if sz >= 4. It's slightly faster for sz < 4.
Not sure why `rep stosb` suddenly becomes 4x slower when sz increases from 3 to 4 bytes, but in any case the compiler should optimize the small case to direct assignments and the big case to memset, or always use memset.
Not going to bother with OPstrcpy or OPstrcmp as they are never generated by DMD.
Comment #3 by dlang-bot — 2020-07-21T01:10:28Z
@WalterBright created dlang/dmd pull request #11437 "fix Issue 14458 - very slow ubyte[] assignment (dmd doesn't use memset)" fixing this issue:
- fix Issue 14458 - very slow ubyte[] assignment (dmd doesn't use memset)
https://github.com/dlang/dmd/pull/11437
Comment #4 by dlang-bot — 2020-07-21T13:23:03Z
dlang/dmd pull request #11437 "fix Issue 14458 - very slow ubyte[] assignment (dmd doesn't use memset)" was merged into master:
- b8f31faeb720f25cfa672dcb7ae0d72d8efd2a0c by Walter Bright:
fix Issue 14458 - very slow ubyte[] assignment (dmd doesn't use memset)
https://github.com/dlang/dmd/pull/11437