Bug 24254 – LDC crash on Epyc Bergamo

Status
NEW
Severity
major
Priority
P1
Component
druntime
Product
D
Version
D2
Platform
x86_64
OS
All
Creation time
2023-11-21T11:28:12Z
Last change time
2024-12-07T13:43:03Z
Assigned to
No Owner
Creator
Jure Pečar
Moved to GitHub: dmd#17211 →

Comments

Comment #0 by jurij.pecar — 2023-11-21T11:28:12Z
Trying to figure out why Sambamba is crashing on Bergamo I noticed that issue is already present with LDC. Just trying to start it results in a stack trace: LDC 1.24 binary build: # ldc2 ldc2[0x33f03d4] Floating point exception (core dumped) LDC 1.35 built from source with Easybuild, LLVM 16.0.6, GCC 12.3 # ldc2 #0 0x00007ffff43c6bbe llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/g/easybuild/x86_64/Rocky/8/genoa/software/LLVM/16.0.6-GCCcore-12.3.0/lib/libLLVM-16.so+0x918bbe) #1 0x00007ffff43c459b SignalHandler(int) Signals.cpp:0:0 #2 0x00007ffff3494cf0 __restore_rt (/lib64/libpthread.so.0+0x12cf0) #3 0x0000000000c82ac4 _D4core5cpuid8cpuidX86FNbNiNeZv (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc82ac4) #4 0x0000000000c82c09 _D4core5cpuid26_sharedStaticCtor_L1065_C1FNbNiNeZv (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc82c09) #5 0x0000000000c9cbc9 _D2rt5minfo13rt_moduleCtorUZ14__foreachbody1MFKSQBu19sections_elf_shared3DSOZi (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc9cbc9) #6 0x0000000000c9db6a _D2rt19sections_elf_shared3DSO7opApplyFMDFKSQBqQBqQyZiZi (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc9db6a) #7 0x0000000000c93428 rt_init (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc93428) #8 0x0000000000c939ad _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc939ad) #9 0x0000000000c93808 _d_run_main2 (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc93808) #10 0x0000000000c9365e _d_run_main (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0xc9365e) #11 0x0000000000797e4d main (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0x797e4d) #12 0x00007ffff2b4cd85 __libc_start_main (/lib64/libc.so.6+0x3ad85) #13 0x000000000079a12e _start (/g/easybuild/x86_64/Rocky/8/znver4/software/LDC/1.35.0-GCCcore-12.3.0/bin/ldc2+0x79a12e) Can't run binary release 1.35 as it requires glibc 2.29 (el8 only has 2.28). Does this stack trace ring any bells? Otherwise please let me know what further info would be useful to provide in order to fix this issue. It might be that root cause is further still, in llvm. Thanks,
Comment #1 by kinke — 2023-11-21T13:55:44Z
According to the backtrace, the problem is in druntime's `core.cpuid` - of the **host compiler's** druntime used to build LDC. Which one did you use? The issue might have been fixed in recent druntime already.
Comment #2 by jurij.pecar — 2023-11-21T14:20:53Z
I don't know what official binaries are built with, my build used gcc 12.3, llvm 16.0.6 and ldc 1.24 to build ldc 1.35.
Comment #3 by kinke — 2023-11-21T14:47:33Z
Please leave my component and hardware changes in place, this has absolutely nothing to do with the DMD compiler. The official LDC binaries are compiled with itself, so v1.35 is built with v1.35. So from your description, we only know that `core.cpuid` of the LDC v1.24 druntime, i.e. druntime v2.094, doesn't support your CPU. But as your LDC v1.24 host compiler works, this means that the host druntime used for building that LDC v1.24 works.
Comment #4 by jurij.pecar — 2023-11-21T14:58:21Z
Sorry this is my first time meeting D ecosystem, I'm having trouble following your feedback. So by "host compiler" you mean the previous version of LDC that was used to build current version of LDC? If that's the case, can you tell me which version of LDC started recognizing and working with zen4c cpus?
Comment #5 by kinke — 2023-11-21T16:47:20Z
(In reply to Jure Pečar from comment #4) > Sorry this is my first time meeting D ecosystem, I'm having trouble > following your feedback. No worries. > So by "host compiler" you mean the previous version of LDC that was used to > build current version of LDC? Yes. > If that's the case, can you tell me which version of LDC started recognizing > and working with zen4c cpus? I don't know if it is working in current druntime. You could e.g. launch some Ubuntu/Debian container (min Ubuntu 20.04 for glibc) and try to run the official v1.35 in there. If there's no startup error, druntime v2.105 probably works. FWIW, the problematic module is https://github.com/dlang/dmd/blob/master/druntime/src/core/cpuid.d. As the name suggests, it uses/depends on the CPUID instruction. I can only tell you that everything works on my workstation, a Threadripper 3960X.
Comment #6 by jurij.pecar — 2023-11-21T16:58:43Z
I'll try to wade through this cpuid detection logic in an attempt to spot something. How would I narrow down the approximate location in the code where crash happens? FYI, Sambamba (and LDC) works fine on zen4 cpus such as Genoa and Genoa-X. /proc/cpuinfo reports identical cpuid level and flags for all three. Only difference for zen4c (Bergamo) should be smaller cache. Does that help us narrowing down the issue?
Comment #7 by jurij.pecar — 2023-11-21T17:53:46Z
Here's a diff of `cpuid -1` output from 32c Genoa (-) and 128c Bergamo (+): @@ -3,16 +3,16 @@ version information (1/eax): processor type = primary processor (0) family = 0xf (15) - model = 0x1 (1) - stepping id = 0x1 (1) + model = 0x0 (0) + stepping id = 0x2 (2) extended family = 0xa (10) - extended model = 0x1 (1) + extended model = 0xa (10) (family synth) = 0x19 (25) - (model synth) = 0x11 (17) - (simple synth) = AMD EPYC (4th Gen) (Genoa B1) [Zen 4], 5nm + (model synth) = 0xa0 (160) + (simple synth) = AMD Ryzen (Bergamo) [Zen 4c], 5nm miscellaneous (1/ebx): - process local APIC physical ID = 0x10 (16) - maximum IDs for CPUs in pkg = 0x40 (64) + process local APIC physical ID = 0xd6 (214) + maximum IDs for CPUs in pkg = 0xff (255) CLFLUSH line size = 0x8 (8) brand index = 0x0 (0) brand id = 0x00 (0): unknown @@ -80,7 +80,7 @@ RDRAND instruction = true hypervisor guest status = false cache and TLB information (2): - processor serial number = 00A1-0F11-0000-0000-0000-0000 + processor serial number = 00AA-0F02-0000-0000-0000-0000 deterministic cache parameters (4): --- cache 0 --- cache type = no more caches (0) @@ -287,7 +287,7 @@ bit width of fixed counters = 0x0 (0) anythread deprecation = false x2APIC features / processor topology (0xb): - extended APIC ID = 16 + extended APIC ID = 214 --- level 0 --- level number = 0x0 (0) level type = thread (1) @@ -296,8 +296,8 @@ --- level 1 --- level number = 0x1 (1) level type = core (2) - bit width of level & previous levels = 0x6 (6) - number of logical processors at level = 0x40 (64) + bit width of level & previous levels = 0x8 (8) + number of logical processors at level = 0x100 (256) --- level 2 --- level number = 0x2 (2) level type = invalid (0) @@ -401,13 +401,13 @@ highest COS number supported = 0xf (15) extended processor signature (0x80000001/eax): family/generation = 0xf (15) - model = 0x1 (1) - stepping id = 0x1 (1) + model = 0x0 (0) + stepping id = 0x2 (2) extended family = 0xa (10) - extended model = 0x1 (1) + extended model = 0xa (10) (family synth) = 0x19 (25) - (model synth) = 0x11 (17) - (simple synth) = AMD EPYC (4th Gen) (Genoa B1) [Zen 4], 5nm + (model synth) = 0xa0 (160) + (simple synth) = AMD Ryzen (Bergamo) [Zen 4c], 5nm extended feature flags (0x80000001/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true @@ -469,7 +469,7 @@ LLC performance counter extensions = true MWAITX/MONITORX supported = true Address mask extension support = true - brand = "AMD EPYC 9334 32-Core Processor " + brand = "AMD EPYC 9754 128-Core Processor " L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries = 0x40 (64) instruction associativity = 0xff (255) @@ -509,7 +509,7 @@ line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x9 (9) - size (in 512KB units) = 0x100 (256) + size (in 512KB units) = 0x200 (512) RAS Capability (0x80000007/ebx): MCA overflow recovery support = true SUCCOR support = true @@ -566,8 +566,8 @@ branch sampling feature support = false (vuln to branch type confusion synth) = false Size Identifiers (0x80000008/ecx): - number of threads = 0x40 (64) - ApicIdCoreIdSize = 0x6 (6) + number of threads = 0x100 (256) + ApicIdCoreIdSize = 0x8 (8) performance time-stamp counter size = 40 bits (0) Feature Extended Size (0x80000008/edx): max page count for INVLPGB instruction = 0x7 (7) @@ -714,13 +714,13 @@ line size in bytes = 0x40 (64) physical line partitions = 0x1 (1) number of ways = 0x10 (16) - number of sets = 32768 + number of sets = 16384 write-back invalidate = true cache inclusive of lower levels = false - (synth size) = 33554432 (32 MB) - extended APIC ID = 16 + (synth size) = 16777216 (16 MB) + extended APIC ID = 214 Core Identifiers (0x8000001e/ebx): - core ID = 0x8 (8) + core ID = 0x6b (107) threads per core = 0x2 (2) Node Identifiers (0x8000001e/ecx): node ID = 0x0 (0) @@ -799,14 +799,14 @@ number of LBR stack entries = 0x10 (16) number of avail Northbridge perf ctrs = 0x10 (16) number of available UMC PMCs = 0x20 (32) - active UMCs bitmask = 0x6db + active UMCs bitmask = 0xfff Multi-Key Encrypted Memory Capabilities (0x80000023): secure host multi-key memory support = true number of encryption key IDs = 0x3f (63) 0x80000024 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000025 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 AMD Extended CPU Topology (0x80000026): - extended APIC ID = 16 + extended APIC ID = 214 --- level 0 --- level number = 0x0 (0) level type = core (1) @@ -821,9 +821,9 @@ CMPXCHG8B = true conditional move/compare = true PREFETCH/PREFETCHW = true - (multi-processing synth) = multi-core (c=32), hyper-threaded (t=2) + (multi-processing synth) = multi-core (c=128), hyper-threaded (t=2) (multi-processing method) = AMD leaf 0xb - (APIC widths synth): CORE_width=5 SMT_width=1 - (APIC synth): PKG_ID=0 CORE_ID=8 SMT_ID=0 - (uarch synth) = AMD Zen 4, 5nm - (synth) = AMD EPYC (4th Gen) (Genoa B1) [Zen 4], 5nm + (APIC widths synth): CORE_width=7 SMT_width=1 + (APIC synth): PKG_ID=0 CORE_ID=107 SMT_ID=0 + (uarch synth) = AMD Zen 4c, 5nm + (synth) = AMD Ryzen (Bergamo) [Zen 4c], 5nm Since that cpuid.d is mostly poking around these register values, I'm pretty sure that the key to fixing this issue is hiding in here.
Comment #8 by jurij.pecar — 2023-11-23T08:16:04Z
Keen eyes in easybuild community noticed that cpuid.d uses ubyte for numcores in a couple of places. For example, function getcacheinfoCPUID4 uses uint for numcores, but function getAMDcacheinfo uses ubyte. It also doesn't differentiate between cores and threads so I assume it walks all logical cpus there in the loop on lines 633-641. Bergamo has 128 cores, 256 threads, ubyte rolls over and then on line 659 you divide something by numcores. Boom. To test this hypothesis, I disabled SMT on one of the Bergamo nodes. Indeed, LDC then works as expected: # ldc2 Error: No source files So I'd say the fix is to just s/ubyte/uint/g on cpuid.d. And check if you do any similar things elsewhere. Thanks,
Comment #9 by dlang-bot — 2023-11-23T22:19:53Z
@kinke created dlang/dmd pull request #15859 "core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores" mentioning this issue: - core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores See: https://en.wikipedia.org/wiki/CPUID#EAX=80000008h:_Virtual_and_Physical_address_Sizes This *might* fix Issue 24254, although I'd expect the read value for that CPU to be 127 (*physical* cores minus 1), not the problematic 255. https://github.com/dlang/dmd/pull/15859
Comment #10 by kinke — 2023-11-23T22:29:54Z
(In reply to Jure Pečar from comment #8) > Keen eyes in easybuild community noticed that cpuid.d uses ubyte for > numcores in a couple of places […] Thank you, and please send those keen eyes my regards. :)
Comment #11 by dlang-bot — 2023-11-24T15:08:17Z
dlang/dmd pull request #15859 "[stable] core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores" was merged into stable: - edfa13e57de7b8597df2a95475179288c98cb25e by Martin Kinkelin: core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores See: https://en.wikipedia.org/wiki/CPUID#EAX=80000008h:_Virtual_and_Physical_address_Sizes This *might* fix Issue 24254, although I'd expect the read value for that CPU to be 127 (*physical* cores minus 1), not the problematic 255. https://github.com/dlang/dmd/pull/15859
Comment #12 by dlang-bot — 2023-11-26T10:05:23Z
@WalterBright updated dlang/dmd pull request #15864 "fix Issue 24262 - Assert error with bit fields" mentioning this issue: - core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores See: https://en.wikipedia.org/wiki/CPUID#EAX=80000008h:_Virtual_and_Physical_address_Sizes This *might* fix Issue 24254, although I'd expect the read value for that CPU to be 127 (*physical* cores minus 1), not the problematic 255. https://github.com/dlang/dmd/pull/15864
Comment #13 by dlang-bot — 2023-11-26T10:52:34Z
@dkorpel created dlang/dmd pull request #15865 "Merge stable into master" mentioning this issue: - core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores See: https://en.wikipedia.org/wiki/CPUID#EAX=80000008h:_Virtual_and_Physical_address_Sizes This *might* fix Issue 24254, although I'd expect the read value for that CPU to be 127 (*physical* cores minus 1), not the problematic 255. https://github.com/dlang/dmd/pull/15865
Comment #14 by dlang-bot — 2023-11-26T12:12:05Z
dlang/dmd pull request #15865 "Merge stable into master" was merged into master: - 27b891c0d810d1fcf88ff5e702e9d049232e8f8d by Martin Kinkelin: core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores See: https://en.wikipedia.org/wiki/CPUID#EAX=80000008h:_Virtual_and_Physical_address_Sizes This *might* fix Issue 24254, although I'd expect the read value for that CPU to be 127 (*physical* cores minus 1), not the problematic 255. https://github.com/dlang/dmd/pull/15865
Comment #15 by robert.schadek — 2024-12-07T13:43:03Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17211 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB