Bug 4890 – GC.collect() deadlocks multithreaded program.

Status
RESOLVED
Resolution
FIXED
Severity
regression
Priority
P2
Component
dmd
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2010-09-18T14:31:00Z
Last change time
2014-10-06T20:47:18Z
Keywords
industry
Assigned to
nobody
Creator
tomash.brechko

Attachments

IDFilenameSummaryContent-TypeSize
1416backtrace.txtstack tracetext/plain1246

Comments

Comment #0 by tomash.brechko — 2010-09-18T14:31:41Z
The following program deadlocks with dmd 2.049 (all the time, though it may be subject to a race): import core.thread; import core.memory; import std.stdio; class Job : Thread { this() { super(&run); } private: void run() { while (true) write("*"); } } void main() { Job j = new Job; j.start(); //j.sleep(1); GC.collect(); while(true) write("."); } Uncommenting j.sleep(1); avoids the deadlock.
Comment #1 by sean — 2010-09-21T11:35:29Z
Fixed in druntime changeset 392. Will be in DMD-2.050.
Comment #2 by sean — 2011-01-04T13:41:41Z
It turns out that the fix I applied produces a race condition with the GC. I'll have to re-wrap Thread.start() in a synchronized block as per the code prior to rev 392. This may re-introduce the deadlock, in which case it will be necessary to replace the isRunning flag with a state field that distinguishes starting from running. A starting thread should be suspended/resumed but not scanned. Or perhaps something else can be sorted out to deal with a thread being in the list that doesn't have its TLS section set, getThis() doesn't work, etc.
Comment #3 by code — 2011-01-21T15:12:13Z
I've also stumbled over the racing condition in thread_processGCMarks() where a thread was already added to the global thread list but didn't had it's m_tls set yet. It seems fine to test for m_tls being null at that specific place.
Comment #4 by schveiguy — 2011-01-24T06:03:14Z
(In reply to comment #3) > I've also stumbled over the racing condition in thread_processGCMarks() where a > thread was already added to the global thread list but didn't had it's m_tls > set yet. It seems fine to test for m_tls being null at that specific place. That's something that I recently added. Sean, can you confirm that if a thread's m_tls is not yet set, then it's actual TLS can not have been used yet? It seems reasonable to check the tls block for null at that point. (will have to start using github soon...)
Comment #5 by wallbraker — 2011-07-11T18:01:31Z
This looks fixed with 2.054 on MacOSX, at least I can repro this.
Comment #6 by sean — 2011-09-06T11:39:15Z
A thread will be added to the global thread list before its TLS range is set, but the range will be set before the thread ever actually uses TLS data. I think this one can be closed.
Comment #7 by stanislav.blinov — 2014-02-08T14:05:36Z
A quick search lead me to this issue. It would appear the deadlock still occurs. I've been encountering it now and then, first when running singleton tests from http://forum.dlang.org/thread/[email protected], then when running druntime unittests (more specifically, test/shared/host) while working on providing shared qualifiers for core.sync primitives. At first I though it had to do with my changes to druntime, but after testing on a clean druntime I encoutered it as well. The deadlock doesn't happen on every run though, so may be tricky to track down. It's in this piece of test/shared/src/plugin.d: 23 launchThread(); 24 GC.collect(); 25 joinThread(); GC.collect() simply doesn't return. I haven't investigated deeper yet. Maybe it has something to do with GC trying to pause/resume an exiting/finished thread? This is on 64-bit Linux.
Comment #8 by safety0ff.bugz — 2014-02-08T15:04:51Z
(In reply to comment #7) > A quick search lead me to this issue. It would appear the deadlock still > occurs. > [...SNIP...] > The deadlock doesn't happen on every run though, so may be tricky to track > down. > It's in this piece of test/shared/src/plugin.d: > > 23 launchThread(); > 24 GC.collect(); > 25 joinThread(); > > GC.collect() simply doesn't return. I haven't investigated deeper yet. Maybe it > has something to do with GC trying to pause/resume an exiting/finished thread? > This is on 64-bit Linux. Does the code in the first post deadlock for you? If not then issues #11981 / #10351 also look relevant. For #11981 / #10351, we REALLY need a way to reproduce the deadlock, along with information about the system it is running on (glibc version, linux kernel version, etc, as much as we can get.)
Comment #9 by stanislav.blinov — 2014-02-08T15:09:32Z
(In reply to comment #8) > Does the code in the first post deadlock for you? No it doesn't. > If not then issues #11981 / #10351 also look relevant. > > For #11981 / #10351, we REALLY need a way to reproduce the deadlock, along with > information about the system it is running on (glibc version, linux kernel > version, etc, as much as we can get.) I'll go look at those issues then.
Comment #10 by andrea.9940 — 2014-09-04T15:16:34Z
This bug is present in DMD 2.066 on Arch Linux 3.14.17-1-lts x86_64 (GNU libc 2.19). The code posted originally still deadlocks (and even with j.sleep uncommented, it never prints a "." which means GC.collect never returns): import core.thread, core.memory, std.stdio; class Job : Thread { this() { super(&run); } private void run() { while (true) write("*"); } } void main() { Job j = new Job; j.start(); //j.sleep(dur!"msecs"(1)); GC.collect(); while(true) write("."); }
Comment #11 by sean — 2014-09-04T16:38:05Z
My initial guess is that this has something to do with the changes for critical regions, as the algorithm for collection before that seemed quite solid. I'll try for a repro on my end though. What would be really useful from whoever encounters this is to trap it in a debugger and include stack traces of all relevant threads. Something has to be blocked on a lock or signal somewhere, but without knowing which one there's little that can be done.
Comment #12 by sean — 2014-09-04T16:49:43Z
Um... I may be wrong in what I just said. It looks like someone added a delegate call within the signal handler for coordinating collections on Linux. There's a decent chance that a dynamic stack frame is being allocated by the GC within that signal handler, which would be Very Bad.
Comment #13 by andrea.9940 — 2014-09-04T17:10:48Z
Just tested, the bug is not present on Windows (DMD 2.066)
Comment #14 by sean — 2014-09-04T17:15:04Z
It's likely as I said. The way GC collections work is different on different platforms. Both Windows and OSX use a kernel call to suspend threads and inspect their stacks. On other Unix platforms (like Linux), the suspending is done via signals, and signal handlers are VERY restrictive in what can safely be done inside them. And either way, having one thread try to allocate something from the GC inside this suspend handler is a guaranteed deadlock. If this is really what's going on I'm amazed that D on Linux works at all. Maybe it really is something else... I'm setting up a new Linux VM and so should hopefully be able to repro this shortly.
Comment #15 by sean — 2014-09-04T20:15:29Z
Okay, I can't reproduce this using the provided code on Oracle Linux 64-bit. If someone has a reliable repro, please let me know.
Comment #16 by andrea.9940 — 2014-09-05T09:50:27Z
(In reply to Sean Kelly from comment #15) > Okay, I can't reproduce this using the provided code on Oracle Linux 64-bit. > If someone has a reliable repro, please let me know. My Linux machine is using Arch Linux, 3.14.17-1-lts x86_64 kernel, GNU libc 2.19. Oracle Linux is completely different as it is using the 3.8.13 x86_64 kernel and glibc 2.17 (http://www.oracle.com/us/technologies/linux/product/specifications/index.html). Try Manjaro Linux wich is based on Arch but come with a ready desktop environment (just run `pacman -S dlang-dmd` to get DMD)
Comment #17 by andrea.9940 — 2014-09-05T10:51:01Z
Created attachment 1416 stack trace
Comment #18 by Marco.Leise — 2014-09-05T11:00:14Z
*** Issue 10351 has been marked as a duplicate of this issue. ***
Comment #19 by dfj1esp02 — 2014-09-05T18:44:16Z
(In reply to badlink from comment #17) > stack trace Hmm... if a thread hangs on a mutex, does it handle signals?
Comment #20 by sean — 2014-09-05T18:52:59Z
It should. Not doing so seems pretty broken. But it this particular kernel it seems like maybe signals are ignored in this situation. What's happening specifically is that the one thread is blocked on the mutex protecting the GC, and another thread holds that lock and is attempting a collection. I could change this code to use a spin lock instead, but the same problem could crop up with any mutex if I understand the problem correctly. I'm kind of curious to see whether the Boehm GC deadlocks in a similar situation with this kernel. It should, since last time I checked it coordinated collections the exact same way on Linux.
Comment #21 by dfj1esp02 — 2014-09-05T19:53:00Z
This mutex protects various global data like the list of threads in core.thread, not GC.
Comment #22 by sean — 2014-09-05T20:12:27Z
Yes I misspoke somewhat. The GC acquires the lock to the global thread list while collecting to ensure that everything remains in a consistent state while the collection takes place. In this case the GC already holds this lock and Thread.start() is blocked on it waiting to add the new thread to the list.
Comment #23 by tomash.brechko — 2014-09-06T13:01:54Z
I think the order of events is such that pthread_create() is followed by pthread_kill() from main thread before the new thread had any chance to run. In this case there are reports that the new thread may miss signals on Linux: http://stackoverflow.com/questions/14827509/does-the-new-thread-exist-when-pthread-create-returns . I think POSIX intent is such that pthread_kill() should work once you have thread ID, i.e. it's a bug with (some versions of) Linux kernel (maybe the signal is first raised and then pending signals are cleared (per POSIX) for the new thread when it starts, or the signal is not become pending as it is not blocked, but is not delivered either because the thread is not really running yet; though on my 3.15.10 pthread_kill() after pthread_create() always works in C, and I don't have D compiler at the moment to check if I'm still able to reproduce original problem). OTOH issue 10351 is marked as duplicate, but it's not clear if the threads involved there are newly created. On a side note, in thread_entryPoint() there's a place: // NOTE: isRunning should be set to false after the thread is // removed or a double-removal could occur between this // function and thread_suspendAll. Thread.remove( obj ); obj.m_isRunning = false; Note that if thread_suspendAll() is called after remove() but before assignment you still will have double removal. This shouldn't relate to bug in question however.
Comment #24 by tomash.brechko — 2014-09-06T13:15:01Z
Now I see that I was wrong about double removal, please ignore that part.
Comment #25 by sean — 2014-09-06T18:58:49Z
Hrm... at one point thread_entryPoint called Thread.add to add itself, but I think the add was moved to Thread.start at some point to deal with a race. I had a comment in Thread.start explaining the rationale, but it looks like Thread.start has been heavily edited and the comment is gone. Either way, having Thread.start call Thread.add *after* pthread_create is totally wrong, as it leaves a window for the thread to exist and be allocating memory but be unknown to the GC. I think I'll have to roll back thread.d to find my original comments and see how it used to be implemented. Something was clearly changed here, but there's no longer enough info to tell exactly what. I've got to say that seeing these and other changes in core.thread without careful documentation of what was changed and why it was done is very frustrating. There's simply no way to unit test for the existence or lack of deadlocks, and the comments in this module were built up over years of bug fixes to explain each situation and why the code was the way it was. If someone changes the code in this module they *must* be absolutely sure of what they are doing and document accordingly.
Comment #26 by safety0ff.bugz — 2014-09-06T19:14:25Z
(In reply to Sean Kelly from comment #25) > > I think I'll have to roll back thread.d to find my original comments and see > how it used to be implemented. Something was clearly changed here, but > there's no longer enough info to tell exactly what. This change? https://github.com/D-Programming-Language/druntime/commit/7a731ffe0869dc
Comment #27 by sean — 2014-09-08T16:30:04Z
Earlier than that.
Comment #28 by braddr — 2014-09-10T09:20:48Z
Might not be related, but for reference, bug 13416
Comment #29 by andrea.9940 — 2014-10-01T08:06:49Z
Also present in DMD 2.067.0-b1. Stacktrace of the sample program in comment 10: http://pastebin.com/4mudSeEX
Comment #30 by andrea.9940 — 2014-10-01T08:29:40Z
*** Issue 11806 has been marked as a duplicate of this issue. ***
Comment #31 by dfj1esp02 — 2014-10-02T11:28:40Z
Hmm... stack trace in issue 11806 is quite different.
Comment #32 by code — 2014-10-04T20:58:08Z
Can we first confirm that this is a regression.
Comment #33 by sean — 2014-10-05T02:52:41Z
The GC was in use for probably 5 years without a reported deadlock. Though history isn't exactly proof. I don't suppose someone wants to regress this and find the offending release? Isn't there a D tool for this? Or would we be stuck with git bisect?
Comment #34 by andrea.9940 — 2014-10-05T10:04:22Z
(In reply to Sean Kelly from comment #33) > The GC was in use for probably 5 years without a reported deadlock. Though > history isn't exactly proof. I don't suppose someone wants to regress this > and find the offending release? Isn't there a D tool for this? Or would we > be stuck with git bisect? I just have tried a few old versions: - 2.054 (self-compiled) deadlocks - 2.042 (self-compiled) deadlocks - 1.076 (http://dlang.org/download.html) fullCollect never returns - 1.030 (http://dlang.org/download.html) fullCollect never returns code used for D1: http://pastebin.com/wu9guHA6 stacktrace (DMDv1.030): http://pastebin.com/CfNStRvm Does anyone else have this issue ? Right now I'm on Arch Linux 3.14.19-1-lts x86_64, GNU libc 2.20.
Comment #35 by andrea.9940 — 2014-10-06T20:47:18Z
I took my time and started digging trough the druntime source to find the problem. I discovered that core.thread.suspend() sends SIGUSR1 to the thread to suspend so I launched GDB and tried to catch the signal but GDB never caught it. I couldn't explain that behavior so I googled "linux sigusr1 not sent" and Bam! --> https://bbs.archlinux.org/viewtopic.php?id=181142 The people on there already figured that it's a bug in the Gnome Display Manager which blocks SIGUSR1 for all child applications ! The bug is present in the current package gdm-3.12.2-1 which I am using and is causing this nasty deadlock in the D garbage collector. As a quick test I tried to run the testcase in a tty and it worked as expected. Hopefully the problem will solve itself in the next gdm update.