on modern (x86_64) CPUs, dmd emit cmpxchg8b instead of CMPXCHG16B
text/plain
159
Comments
Comment #0 by mingwu — 2020-05-17T19:35:01Z
Created attachment 1788
on modern (x86_64) CPUs, dmd emit cmpxchg8b instead of CMPXCHG16B
$ cat c.d
--------------------------------------------------------------------------------
import std.stdio;
import core.atomic;
struct N {
N* prev;
N* next;
}
shared(N) n;
void main() {
cas(&n, n, n);
writeln(size_t.sizeof*2, N.sizeof); // output 16 16
}
--------------------------------------------------------------------------------
$ dmd -m64 c.d
$ /usr/bin/obj2asm c.o > c.o.asm
$ grep -i xchg c.o.asm
cmpxchg8b [R8]
However
$ grep flags /proc/cpuinfo | head -1 | grep 16
--------------------------------------------------------------------------------
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
--------------------------------------------------------------------------------
in particular, the flag cx16 is there:
CX16 * Supports CMPXCHG16B instruction
https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo
$ uname -a
Linux titan 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ dmd --version
DMD64 D Compiler v2.092.0
Copyright (C) 1999-2020 by The D Language Foundation, All Rights Reserved written by Walter Bright
Comment #1 by mingwu — 2020-05-17T19:37:48Z
BTW, I only tested on x86_64 Linux, I think this bug on other platform too MacOS, Windows (on modern x86_64 CPUs with CX16 support).
Comment #2 by pro.mathias.lang — 2020-05-17T20:26:18Z
Doesn't affect OSX (used objdump, grepped for cmpxchg).
Comment #3 by mingwu — 2020-05-17T22:40:38Z
Yes, verified, not on MacOS.
Thank you (at least there is a system I can use now).
$ objdump -disassemble-all c.o > c.o.asm
$ grep -i cmpxchg c.o.asm
80a: 49 0f c7 08 cmpxchg16b (%r8)
$ uname -a
Darwin 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64
Comment #4 by greeenify — 2020-05-17T23:38:37Z
> at least there is a system I can use now).
Well, you could just use LDC like almost everyone else ;-)
Comment #5 by mingwu — 2020-05-18T02:15:27Z
LDC ? did I miss sth?
--------------------------------------------------------------------------------
$ ldc2 -m64 -c c.d
$ obj2asm c.o > c.o.asm
$ grep -i xchg c.o.asm
cmpxchg8b [RSI]
cmpxchg8b [RSI]
cmpxchg8b [RSI]
cmpxchg8b [RSI]
cmpxchg8b [RSI]
.data._D100TypeInfo_S3ldc10intrinsics__T13CmpXchgResultTS4core8internal6atomic__T11_AtomicTypeTS1c1NZ5UCentZQCq6__initZ segment
_D100TypeInfo_S3ldc10intrinsics__T13CmpXchgResultTS4core8internal6atomic__T11_AtomicTypeTS1c1NZ5UCentZQCq6__initZ:
.data._D100TypeInfo_S3ldc10intrinsics__T13CmpXchgResultTS4core8internal6atomic__T11_AtomicTypeTS1c1NZ5UCentZQCq6__initZ ends
$ ldc2 --version
LDC - the LLVM D compiler (1.21.0):
based on DMD v2.091.1 and LLVM 10.0.0
built with LDC - the LLVM D compiler (1.21.0)
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
http://dlang.org - http://wiki.dlang.org/LDC
--------------------------------------------------------------------------------
Want to file a bug against LDC ?
Comment #6 by kinke — 2020-05-18T15:27:00Z
(In reply to mw from comment #5)
> Want to file a bug against LDC ?
No need, cmpxchg16 is used for all x86_64 CPUs: https://d.godbolt.org/z/HesA24
For the few old CPUs not supporting it, it can be disabled via `-mattr=-cx16` (but then it doesn't fall back to cmpxchg8 anway, so no idea how your results came about).
Comment #7 by mingwu — 2020-05-18T16:45:26Z
Hi kinke,
> so no idea how your results came about
I downloaded directly from:
https://github.com/ldc-developers/ldc/releases/download/v1.21.0/ldc2-1.21.0-linux-x86_64.tar.xz
And I just downloaded 1.20, which is on the d.godbolt.org page you mentioned, but the result is the same:
--------------------------------------------------------------------------------
$ wget https://github.com/ldc-developers/ldc/releases/download/v1.20.0/ldc2-1.20.0-linux-x86_64.tar.xz
$ ldc2 -m64 -c c.d
$ obj2asm c.o > c.o.asm
$ grep -i xchg c.o.asm
cmpxchg8b [RSI]
cmpxchg8b [RSI]
cmpxchg8b [RSI]
cmpxchg8b [RSI]
cmpxchg8b [RSI]
.data._D100TypeInfo_S3ldc10intrinsics__T13CmpXchgResultTS4core8internal6atomic__T11_AtomicTypeTS1c1NZ5UCentZQCq6__initZ segment
_D100TypeInfo_S3ldc10intrinsics__T13CmpXchgResultTS4core8internal6atomic__T11_AtomicTypeTS1c1NZ5UCentZQCq6__initZ:
.data._D100TypeInfo_S3ldc10intrinsics__T13CmpXchgResultTS4core8internal6atomic__T11_AtomicTypeTS1c1NZ5UCentZQCq6__initZ ends
$ ldc2 --version
LDC - the LLVM D compiler (1.20.0):
based on DMD v2.090.1 and LLVM 9.0.1
built with LDC - the LLVM D compiler (1.20.0)
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
http://dlang.org - http://wiki.dlang.org/LDC
$ uname -a
Linux titan 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
--------------------------------------------------------------------------------
What else should I check?
Or, can try my step on a Linux box?
Comment #8 by mingwu — 2020-05-18T16:56:00Z
And for GDC:
--------------------------------------------------------------------------------
$ gdc-10 -m64 -c c.d
$ obj2asm c.o > c.o.asm
$ grep -i xch c.o.asm
extrn __atomic_compare_exchange
call __atomic_compare_exchange@PLT32
--------------------------------------------------------------------------------
On this page: https://d.godbolt.org/z/HesA24
I changed the compiler to "gdc 9.2.0", and searched the window, and search for 'xch':
mov rsi, rax
mov edi, 16
call __atomic_compare_exchange
mov BYTE PTR [rbp-1], al
.loc 3 1413 9
movzx eax, BYTE PTR [rbp-1]
I'm not an asm guy, can someone help to read where is this __atomic_compare_exchange? and confirm it's calling the correct CMPXCHG16B?
Thanks.
Comment #9 by mingwu — 2020-05-18T16:56:37Z
Oh,
$ gdc-10 --version
gdc-10 (Debian 10.1.0-1) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Comment #10 by kinke — 2020-05-18T17:27:44Z
(In reply to mw from comment #7)
> What else should I check?
Wrt. LDC, I'm almost certain this is solely an issue with your 'workflow' involving obj2asm. - Godbolt runs on Linux (but you can inspect the produced assembly for any target using LDC's -mtriple option).
Comment #11 by mingwu — 2020-05-18T21:37:46Z
> this is solely an issue with your 'workflow' involving obj2asm.
Thank you. It's indeed the problem of obj2asm, with gnu's tool objdump:
--------------------------------------------------------------------------------
$ ldc2 -m64 c.d
$ objdump -S --disassemble c > c.asm
$ grep -i cmpxch c.asm
f5b3: f0 48 0f c7 0e lock cmpxchg16b (%rsi)
1ecc3: f0 48 0f b1 3a lock cmpxchg %rdi,(%rdx)
1ed32: f0 40 0f b0 3a lock cmpxchg %dil,(%rdx)
1ed42: 66 f0 0f b1 3a lock cmpxchg %di,(%rdx)
30563: 48 8d 15 ba ea 01 00 lea 0x1eaba(%rip),%rdx # 4f024 <_D4core5cpuid13_hasCmpxchg8byb>
30573: 48 8d 15 ab ea 01 00 lea 0x1eaab(%rip),%rdx # 4f025 <_D4core5cpuid14_hasCmpxchg16byb>
30da6: f0 48 0f b1 3c ce lock cmpxchg %rdi,(%rsi,%rcx,8)
--------------------------------------------------------------------------------
I found the cmpxchg16b instruction.
But I'm not sure what the other 'cmpxchg' is. Can some asm expert help explain?
BTW: I find another issue with LDC: with this code on https://d.godbolt.org/z/HesA24
i.e. remove the import std.stdio and writeln
--------------------------------------------------------------------------------
$ cat c.d
//import std.stdio;
import core.atomic;
struct N {
N* prev;
N* next;
}
shared(N) n;
void main() {
cas(&n, n, n);
//writeln(size_t.sizeof*2, N.sizeof);
}
$ ldc2 -m64 c.d
$ ./c
Segmentation fault (core dumped)
$ ldc2 --version
LDC - the LLVM D compiler (1.20.0):
based on DMD v2.090.1 and LLVM 9.0.1
built with LDC - the LLVM D compiler (1.20.0)
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
http://dlang.org - http://wiki.dlang.org/LDC
--------------------------------------------------------------------------------
Although on https://d.godbolt.org/z/HesA24
the "Output" dropdown has an option "Run the compiled binary", I select that, but didn't see the result.
With import std.stdio and writeln, the LDC output behave normally (no segfault):
--------------------------------------------------------------------------------
$ ldc2 -m64 c.d
$ ./c
1616
--------------------------------------------------------------------------------
Can you try if you can reproduce this segfault on a local Linux box?
Comment #12 by kinke — 2020-05-18T22:09:40Z
(In reply to mw from comment #11)
> Can you try if you can reproduce this segfault on a local Linux box?
We're abusing DMD's bug tracker, but anyway: you need to manually take care of required 16-bytes alignment:
align(16) shared(N) n; // or `align(2 * size_t.sizeof)`
Comment #13 by mingwu — 2020-05-18T22:26:23Z
> you need to manually take care of required 16-bytes alignment:
> align(16) shared(N) n; // or `align(2 * size_t.sizeof)`
Thank you again!
(I'm a newbie to D, not sure where is the best place to continue discuss this? pls let me know.)
BUT: can the DMD compiler (after seeing the 'cas' call) take care of this alignment? either silently, or issue an warning message to the programmer?
Can I log another bug for this suggestion of DMD compiler improvement?
The current behavior that I just discovered is definitely a puzzle for a D newbie like me. With a smarter compiler, it will help new users.