LDC, GDC and DMD implement int4 differently when it comes to multiplication.
------ test-case.d ---------
import core.simd;
int4 mul_4_ints (int4 a, int4 b)
{
return a * b; // ok with LDC and GDC, but not DMD
}
----------------------------
An efficient int4 * int4 requires Neon or SSE4.1 with the
- DMD doesn't implement int4 * int4
- GDC and LDC implement it with a replacement sequence and two multiply instructions. GDC gained that function at one point.
In intel-intrinsics, I now tell people to use a _mm_mullo_epi32 to stay portable, it will do the workarounds. Since having this operation is a bit of a portability trap.
Two solutions I could see here:
- A. remove support from LDC and GDC, since no particular hardware support is here below SSE4.1. User is forced to think about portability.
- B. add support for int4*int4 in DMD, to match the capabilities. Use can use core.simd without unknowingly breaking compat.
Personally have no idea what is best.
Comment #1 by aliloko — 2023-01-19T12:04:40Z
> requires Neon or SSE4.1 with the
requires Neon or SSE4.1 with the pmulld instruction
Comment #2 by ibuclaw — 2023-01-19T20:07:29Z
(In reply to ponce from comment #0)
> LDC, GDC and DMD implement int4 differently when it comes to multiplication.
>
With DMD, you need to explicitly pass -mcpu=avx when compiling. It uses a strict gate at compile-time to determine whether or not the expression would map to a single opcode in the dmd backend for the given type mode.
GDC and LDC ignores this gate - even if the information is there and can be queried against GCC or LLVM respectively - and just permissively allows the operation, which does mean that when passing down to the backend, it may split up the vector op into narrower modes when the target being compiled for doesn't have an available opcode.
This behaviour is justified because strictly, we don't know whether the optimizer might rewrite the expression in such a way that there *is* an a supported opcode.
For example: `a / b` has no vector op, but `a >> b` does.
https://d.godbolt.org/z/vrn77GG9f
(FYI, in gdc-13, `-Wvector-operation-performance` will be turned on by default so you'll at least get a non-blocking warning about expressions that have been expanded at narrower modes).
Comment #3 by bugzilla — 2023-01-29T08:43:38Z
This behavior of DMD is as designed. (As mentioned here, it will work if the -mcpu=avx switch is used.)
Workarounds can be much much slower than the native instructions. The user may not realize that a slow workaround is happening. By notifying the user that the native instruction for it does not exist, the user can then deliberately choose the workaround that works best for his particular application. In particular, the user may not actually need the full capability of the native instruction, so using a full semantic workaround is a pessimization. Or a different algorithm can be selected that does not require the missing native instruction.
This behavior comes at the request of Manu Evans, who spends a lot of time coding high performance vector code.
GDC and LDC have a different philosophy about this, which is their prerogative.
Therefore, I'm going to mark this as INVALID, as the behavior is deliberate and not a bug.