Bug 15873 – In order to implement std.simd, compile time info about CPU specifics is needed

Status
RESOLVED
Resolution
INVALID
Severity
blocker
Priority
P1
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2016-04-04T17:22:23Z
Last change time
2020-12-21T08:45:02Z
Keywords
CTFE, SIMD
Assigned to
No Owner
Creator
Jack Stouffer

Comments

Comment #0 by jack — 2016-04-04T17:22:23Z
To quote Manu, "I still have no way to detect what simd version was supplied to the compiler on the command line on GCC/Clang, and DMD has no such concept. The library can't emit opcodes that violate the simd level request made to the compiler; I need to know the level requested and then I can produce the best code for that level using static if." This also is blocking std.blas, to quote Ilya, "I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Target cpu configuration: - CPU architecture (done) - Count of FP/Integer registers - Allowed sets of instructions: for example, AVX2, FMA4 - Compiler optimization options (for math)"
Comment #1 by bugzilla — 2016-04-04T18:34:26Z
For DMD, the minimum SIMD level can be ascertained by: 1. the operating system - for example, OSX is only sold on certain CPUs and above. Also, Linux assumes SIMD in the default behavior of gcc. 2. 32 or 64 bit code being generated The DMD compiler assumes the existence of that minimum SIMD level, and generates SIMD code accordingly. The SIMD capabilities can be tested at runtime: http://dlang.org/phobos/core_cpuid.html This is used, for example, here: https://github.com/D-Programming-Language/druntime/blob/master/src/rt/arraydouble.d#L33 The idea is to use a template to statically generated code for each supported SIMD level. Then, test the capabilities at a high level, and select the right branch at the high level. Then each level's implementation runs at full speed with custom code for that level.
Comment #2 by bugzilla — 2016-04-04T18:39:34Z
Comment #3 by bugzilla — 2016-04-04T19:29:54Z
DMD predefines "D_SIMD" for: 1. all 64 bit code generation 2. OSX 32 bit code generation and does generate SIMD instructions for those platforms. DMD does not have compiler switches to select SIMD levels.
Comment #4 by bugzilla — 2016-04-04T19:44:40Z
Comment #5 by aliloko — 2016-04-04T19:53:40Z
Could DMD also generate SSE code for 32-bit targets (easily)? SSE2 is very common. I see two main advantages: - it can also avoid some divergence in results between 32-bit and 64-bit related to the unexpected higher precision of FPU operations. Using the FPU you might think that floats are sufficient for one task when they aren't, because they were promoted to 80-bit float internally. - avoiding denormals. It is a recurring concern in audio code though not that bad. MSVC generates SSE2 in 32-bit by default I think.
Comment #6 by turkeyman — 2016-04-06T12:19:21Z
DMD really needs some way to select the simd level to target from the command line. Runtime selection is appropriate at the outer loop, but runtime selection is not practical for small occurrences of SIMD appearing littered around, or where the selection would be made in the inner loop.
Comment #7 by Marco.Leise — 2016-04-11T18:02:11Z
My concern is with "fast.json" where the call site reads auto json = parseJSON(...); and I feel that import core.cpuid; if (sse42) handleJson!true(); else handleJson!false(); void handleJson(bool sse42)() { auto json = parseJSON!sse42(...); } is just not palatable. ('handleJson' being needed, since the return value would be a RAII struct with compile-time specialization.) Importing core.cpuid, figuring out which flag to use and set as a template argument and writing a switch-case or if-else is not economically reasonable, so to speak when you could enable SSE4 globally and often implicitly (-march=native). Also in my case DMD wont profit, because it's inline assembly doesn't inline (making it too slow) and GDC wont profit because it is not supported by core.cpuid, leaving only LDC - but that's another story. My argument here is that the one writing SIMD code is not necessarily the one calling it. Compile-time information about the (implied) target enables us to reduce the cognitive load for library users, and still make use of the latest CPU features. This is working to great benefit with intrinsics in other compilers (for popcnt, memcpy, etc.), but we can't imitate that. So we ended up with runtime checks against a global variable in popcnt for what should be a single instruction on recent CPUs and an additional "SSE4 only" _popcnt in http://dlang.org/phobos/core_bitop.html#.popcnt
Comment #8 by bugzilla — 2020-12-21T08:45:02Z
DMD predefines some version identifiers based on SIMD level: version (D_SIMD) - for SSE2 instruction sets version (D_AVX) - for SSE2..AVX instruction sets version (D_AVX2) - for SSE2..AVX2 instruction sets which should do the job.