Bug 17484 – high penalty for vbroadcastsd with -mcpu=avx
Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P3
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2017-06-09T03:58:02Z
Last change time
2017-08-16T13:23:43Z
Assigned to
No Owner
Creator
Martin Nowak
Comments
Comment #0 by code — 2017-06-09T03:58:02Z
With -mcpu=avx, the compiler emits
vbroadcastsd ymm2, qword ptr [rsp]
even when initializing only 128-bit wide double2 variables.
This causes a high 50-80 cycle penalty when later some legacy SSE instruction is used with such a register value (or a derived value), because the CPU does not know that the upper bits are zero, and apparently preserves them in an internal register buffer.
https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
We should A not write to 256-bit wide YMM registers when only 128-bit wide XMM registers are used, and B avoid mixing legacy encoded SSE instructions (movsd) with vex encoded AVX-128 instructions, i.e. use vmovsd instead of movsd.
Comment #1 by github-bugzilla — 2017-07-17T19:52:40Z
Commit pushed to master at https://github.com/dlang/dmdhttps://github.com/dlang/dmd/commit/1f11aa0eb8f6087b7dbadeb770e4526ec9808ccc
fix Issue 17484 - high penalty for AVX-256 instructions with AVX-128 regs
- as the upper 128-bits are no longer zero, the CPU will save/restore
them when that register is used with legacy SSE instructions
- avoid using vbroadcastsd which is a AVX-256 only instruction to
initialize 128-bit XMM vectors
Comment #2 by github-bugzilla — 2017-08-07T13:17:30Z