The deadlock is extremely rare, but I managed to remember to capture the stacktraces of an example before killing the process this time. It looks like the suspend signal happened while in the alloc library was in use and the subsequently called a function that needed to alloc. I'll bet that tls_get_addr_tail isn't legal to call from a signal handler. Filed as OS 'all', but it's really all posix quite likely.
Thread 1 (Thread 0x2b40440adf40 (LWP 5616)):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1 0x00002b4044abcdb9 in core.thread.suspend() () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#2 0x00002b4044abcfac in thread_suspendAll () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#3 0x00002b4044ac7c73 in gc.gc.Gcx.fullcollect() () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#4 0x00002b4044ac629b in gc.gc.GC.fullCollect() () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#5 0x00002b4044ac9867 in gc_collect () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#6 0x00002b4044aba5f9 in core.memory.GC.collect() () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#7 0x00002b4045434469 in ?? ()
#8 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x2b4045836700 (LWP 5620)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00002b4044517b8c in _L_lock_152 () at malloc.c:5151
#2 0x00002b404451076b in get_free_list () at arena.c:774
#3 0x00002b4044515535 in arena_get2 (avoid_arena=<optimized out>, size=<optimized out>, a_tsd=<optimized out>) at arena.c:838
#4 __GI___libc_malloc (bytes=88) at malloc.c:2856
#5 0x00002b404406a2ad in allocate_and_init (map=0xe25640) at dl-tls.c:529
#6 tls_get_addr_tail (ti=0x2b4044cfabd0, dtv=0xe344d0, the_map=0xe25640) at dl-tls.c:742
#7 0x00002b4044abb6ec in core.thread.thread_suspendHandler() () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#8 0x00002b4044abcd00 in core.thread.callWithStackShell() () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#9 0x00002b4044abb6ca in thread_suspendHandler () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#10 <signal handler called>
#11 0x00002b404451077e in get_free_list () at arena.c:777
#12 0x00002b4044515535 in arena_get2 (avoid_arena=<optimized out>, size=<optimized out>, a_tsd=<optimized out>) at arena.c:838
#13 __GI___libc_malloc (bytes=32) at malloc.c:2856
#14 0x00002b4044f1bb9f in pthread_getattr_np (thread_id=47555044140800, attr=0x2b4045835e20) at pthread_getattr_np.c:167
#15 0x00002b4044abb3f5 in thread_entryPoint () from /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so
#16 0x00002b4044f19f6e in start_thread (arg=0x2b4045836700) at pthread_create.c:311
#17 0x00002b404458c9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Comment #1 by safety0ff.bugz — 2014-02-05T19:28:43Z
Comment #2 by stanislav.blinov — 2014-02-08T16:33:53Z
Don't know how useful this could be but...
This code:
import std.concurrency;
import core.memory;
import core.thread;
import core.atomic;
import std.stdio;
// Reduced from host unittest
version(all) // version(none) avoids deadlock
{
shared uint tctor, tdtor;
static this() { auto v = atomicOp!"+="(tctor,1); writefln(" new thread %s", v); }
static ~this() { auto v = atomicOp!"+="(tdtor,1); writefln(" cleanup %s", v); }
}
void main()
{
enum uint numThreads = 16;
long i = 0;
while(true)
{
writefln("Starting run %s", i);
shared uint finished;
foreach(immutable t; 0..numThreads)
{
spawn((shared uint* f) { atomicOp!"+="(*f, 1); }, &finished);
}
long n = 0;
while (atomicLoad!(MemoryOrder.acq)(finished) != numThreads)
{
writefln("suspend %s", n);
thread_suspendAll();
writefln("resume %s", n);
thread_resumeAll();
++n;
}
thread_joinAll();
writefln("Run %s done", i);
++i;
}
}
deadlocks in every single run on my machine (though randomly in each run).
$uname -a
Linux 3.8.0-26-generic #38-Ubuntu SMP Mon Jun 17 21:43:33 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ldd --version
ldd (Ubuntu EGLIBC 2.17-0ubuntu5.1) 2.17
Comment #3 by safety0ff.bugz — 2014-02-08T17:07:19Z
That program deadlocks due to the writeln between thread_suspendAll and thread_resumeAll (removing it removes the deadlock.)
A notable characteristic about the GC related deadlock is that one thread is/should be stuck in thread_suspendAll.
Comment #4 by stanislav.blinov — 2014-02-08T17:16:28Z
(In reply to comment #3)
> That program deadlocks due to the writeln between thread_suspendAll and
> thread_resumeAll (removing it removes the deadlock.)
Yes, I've just noticed. My bad :( Will try to look more into original unittest.
Comment #5 by stanislav.blinov — 2014-02-08T18:07:14Z
Ok, I've built druntime with -g -debug and built a modified host.c that runs infinitely to eventually hit that deadlock:
--- host.c 2014-02-09 01:45:36.399539870 +0400
+++ host_.c 2014-02-09 05:40:45.885080082 +0400
@@ -5,6 +5,7 @@
int main(int argc, char* argv[])
{
+while(1) {
const size_t pathlen = strrchr(argv[0], '/') - argv[0] + 1;
char *name = malloc(pathlen + sizeof("plugin1.so"));
memcpy(name, argv[0], pathlen);
@@ -46,5 +47,6 @@
assert(dlclose(plugin1) == 0);
free(name);
+}
return EXIT_SUCCESS;
}
Here's the stacktrace:
Thread 4 (Thread 0x7ffff6a05700 (LWP 16188)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff7675fcc in _L_lock_11850 () at malloc.c:5151
#2 0x00007ffff7673575 in __GI___libc_malloc (bytes=88) at malloc.c:2856
#3 0x00007ffff7ddb2cd in allocate_and_init (map=0x603610) at dl-tls.c:529
#4 tls_get_addr_tail (ti=0x7ffff73dcb28, dtv=0x611d00, the_map=0x603610) at dl-tls.c:742
#5 0x00007ffff71816e4 in core.thread.thread_suspendHandler() (this=0x0, sp=0x7ffff6a046c0) at src/core/thread.d:1069
#6 0x00007ffff718387f in core.thread.callWithStackShell() (fn=...) at src/core/thread.d:2085
#7 0x00007ffff71816bd in thread_suspendHandler (sig=10) at src/core/thread.d:411
#8 <signal handler called>
#9 0x00007ffff7673564 in __GI___libc_malloc (bytes=16) at malloc.c:2856
#10 0x00007ffff71811c7 in thread_entryPoint (arg=0x0) at src/rt/tlsgc.d:34
#11 0x00007ffff79c0f8e in start_thread (arg=0x7ffff6a05700) at pthread_create.c:311
#12 0x00007ffff76eaa0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7ffff7fc8740 (LWP 16182)):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1 0x00007ffff7183940 in core.thread.suspend() (t=0x7ffff7ec8c00) at src/core/thread.d:2244
#2 0x00007ffff7183bcd in thread_suspendAll () at src/core/thread.d:2322
#3 0x00007ffff7194300 in gc.gc.Gcx.fullcollect() (this=0x6056d0) at src/gc/gc.d:2389
#4 0x00007ffff71919cd in gc.gc.GC.fullCollect() (this=0x6056a0) at src/gc/gc.d:1156
#5 0x00007ffff71971fa in gc_collect () at src/gc/proxy.d:165
#6 0x00007ffff717fb79 in core.memory.GC.collect() () at src/core/memory.d:168
#7 0x00007ffff6a07449 in runTests () at plugin.d:24
#8 0x0000000000400a6b in main (argc=1, argv=0x7fffffffdcc8) at host.c:31
Reposting machine details:
$uname -a
Linux 3.8.0-26-generic #38-Ubuntu SMP Mon Jun 17 21:43:33 UTC 2013 x86_64
x86_64 x86_64 GNU/Linux
$ldd --version
ldd (Ubuntu EGLIBC 2.17-0ubuntu5.1) 2.17
Comment #6 by stanislav.blinov — 2014-02-08T18:15:46Z
Another run, this time replaced GC.collect() with thread_suspendAll()/thread_resumeAll() in plugin.d:
--- plugin.d 2014-02-09 01:52:48.944214965 +0400
+++ plugin_.d 2014-02-09 06:11:44.677437500 +0400
@@ -21,7 +21,9 @@
assert(tdtor >= 0);
// test some runtime functionality
launchThread();
- GC.collect();
+ //GC.collect();
+ thread_suspendAll();
+ thread_resumeAll();
joinThread();
}
catch (Throwable)
stack trace:
Thread 19 (Thread 0x7ffff6a05700 (LWP 16354)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff7675c3c in _L_lock_152 () at malloc.c:5151
#2 0x00007ffff766e81b in get_free_list () at arena.c:774
#3 0x00007ffff76735e5 in arena_get2 (avoid_arena=<optimized out>, size=<optimized out>, a_tsd=<optimized out>) at arena.c:838
#4 __GI___libc_malloc (bytes=88) at malloc.c:2856
#5 0x00007ffff7ddb2cd in allocate_and_init (map=0x603080) at dl-tls.c:529
#6 tls_get_addr_tail (ti=0x7ffff73dcb28, dtv=0x611d00, the_map=0x603080) at dl-tls.c:742
#7 0x00007ffff71816e4 in core.thread.thread_suspendHandler() (this=0x0, sp=0x7ffff6a04640) at src/core/thread.d:1069
#8 0x00007ffff718387f in core.thread.callWithStackShell() (fn=...) at src/core/thread.d:2085
#9 0x00007ffff71816bd in thread_suspendHandler (sig=10) at src/core/thread.d:411
#10 <signal handler called>
#11 get_free_list () at arena.c:776
#12 0x00007ffff76735e5 in arena_get2 (avoid_arena=<optimized out>, size=<optimized out>, a_tsd=<optimized out>) at arena.c:838
#13 __GI___libc_malloc (bytes=32) at malloc.c:2856
#14 0x00007ffff79c2bbf in pthread_getattr_np (thread_id=140737331091200, attr=0x7ffff6a04d88) at pthread_getattr_np.c:167
#15 0x00007ffff718116a in thread_entryPoint (arg=0x0) at src/core/thread.d:2715
#16 0x00007ffff79c0f8e in start_thread (arg=0x7ffff6a05700) at pthread_create.c:311
#17 0x00007ffff76eaa0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7ffff7fc8740 (LWP 16333)):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1 0x00007ffff7183940 in core.thread.suspend() (t=0x7ffff7ec8900) at src/core/thread.d:2244
#2 0x00007ffff7183bcd in thread_suspendAll () at src/core/thread.d:2322
#3 0x00007ffff73ee489 in runTests () at plugin.d:25
#4 0x0000000000400bc2 in main (argc=1, argv=0x7fffffffdcc8) at host.c:44
Comment #7 by safety0ff.bugz — 2014-02-08T18:24:18Z
Here's the sequence of events I suspect causes this:
3 Threads:
Thread 2 acquires malloc internal lock
Thread 1 causes a GC collection and therefore thread_suspendAll
Thread 3 enters the signal handler for suspendAll, which somehow calls tls_get_addr_tail, which calls malloc, and since Thread 2 has the lock, it causes a deadlock.
Comment #8 by stanislav.blinov — 2014-02-08T18:30:16Z
(In reply to comment #7)
> Here's the sequence of events I suspect causes this:
> 3 Threads:
But there are only 2, unless I am mistaken?
Comment #9 by safety0ff.bugz — 2014-02-08T18:35:18Z
(In reply to comment #8)
> (In reply to comment #7)
> > Here's the sequence of events I suspect causes this:
> > 3 Threads:
>
> But there are only 2, unless I am mistaken?
I assumed there were other threads because the stacktraces were for Threads 1, 4 and 19.
On a separate note: Async-safe TLS access is scheduled for glibc 2.19 (upcoming release.)
Comment #10 by stanislav.blinov — 2014-02-08T18:40:06Z
(In reply to comment #9)
> I assumed there were other threads because the stacktraces were for Threads 1,
> 4 and 19.
There are only 2: main and one spawned by launchThread(). Looks like the spawned one locks itself up when calling malloc inside suspend handler. What does it allocate there?..
> On a separate note: Async-safe TLS access is scheduled for glibc 2.19 (upcoming
> release.)
Comment #11 by stanislav.blinov — 2014-02-08T19:09:56Z
Oh, I see. SIGUSR1 is being sent to a thread while it's being constructed.
Either getStackBottom() or rt.tlsgc.init() call malloc(). When SIGUSR1 is received it calls Thread.getThis(), and that TLS reference had not yet been allocated, so it calls malloc() again and locks.
Comment #12 by stanislav.blinov — 2014-02-08T20:23:11Z
So the failing sequence boils down to this:
Main: | t:
|
// t.start(): |
|
slock.lock(); |
m_isRunning = true; |
pthread_create(); |
// at this point, |
// t.isRunning == true |
add(this); |
// at this point, |
// t is subject to |
// thread_suspendAll |
slock.unlock(); |
|
// thread_suspendAll(): |
slock.lock(); |
|
// suspend(t): | // thread_entryPoint():
pthread_kill(); | getStackBottom();
| // or rt.tlsgc.init()
| // or (maybe?) Thread.setThis()
At this point, t *may* be inside malloc(). It catches SIGUSR1, thread_suspendHandler() is called. Eventually, the handler gets to Thread.getThis(), which calls malloc() again. If t already was inside malloc(), it locks.
Comment #13 by stanislav.blinov — 2014-02-08T22:37:21Z
Comment #14 by safety0ff.bugz — 2014-02-08T23:25:08Z
(In reply to comment #13)
> Not pretty, but possible solution:
The thread start up code looks very race-prone, I hadn't looked at it before.
Comment #15 by code — 2014-02-09T02:27:31Z
(In reply to comment #14)
> The thread start up code looks very race-prone, I hadn't looked at it before.
Yeah, pretty subtle, I'd really like to get rid of the Fiber context stack for example, but it's not trivial. One reason is that we're using suspend/resume on OSX and Windows but signals on Posix.
> Async-safe TLS access is scheduled for glibc 2.19 (upcoming
release.)
As I wrote on the pull request, a correct fix would be to let Thread.sm_this use pthread_getspecific again.
https://github.com/D-Programming-Language/druntime/pull/718#discussion_r9568348
Comment #16 by safety0ff.bugz — 2014-02-09T03:13:27Z
(In reply to comment #15)
> > Async-safe TLS access is scheduled for glibc 2.19 (upcoming
> release.)
> As I wrote on the pull request, a correct fix would be to let Thread.sm_this
> use pthread_getspecific again.
> https://github.com/D-Programming-Language/druntime/pull/718#discussion_r9568348
Looks like they've delayed async-safe TLS until a later release.
I'm not familiar with pthread_getspecific, but as usual I'm learning as I go to try to help with druntime issues.
Comment #17 by stanislav.blinov — 2014-02-09T07:03:20Z
I'm starting to wonder if it's really posix-specific. The issue only manifests when using shared libraries (second malloc call is made from dl-tls.c).
Can Windows DLL TLS handling be susceptible to this too? Is it currently possible to replicate that unittest for Windows?
Comment #18 by code — 2014-02-09T10:00:29Z
(In reply to comment #17)
> I'm starting to wonder if it's really posix-specific. The issue only manifests
> when using shared libraries (second malloc call is made from dl-tls.c).
>
Yep, because the TLS segment for loaded shared libraries is allocated lazily, while the TLS segments the executable and linked shared libraries are allocated eagerly.
> Can Windows DLL TLS handling be susceptible to this too? Is it currently
> possible to replicate that unittest for Windows?
The same problem cannot happen on OSX or Windows, because we're using suspend API's there, i.e. no signal handler code that does the TLS access is run.
Comment #19 by github-bugzilla — 2014-02-10T15:16:28Z
I'll go ahead and mark this resolved. Given it's relative rarity it's hard to state conclusively that it's fixed, but it certainly sounds like it might be based on the change description.
That said, we need to be MUCH more careful about what code is executed in the context of signal handlers as it's a very narrow set of valid apis. I'll bet that we're technically violating the list in a number of ways and are just getting away with it most of the time.
Comment #21 by code — 2014-02-17T13:35:31Z
*** Issue 11630 has been marked as a duplicate of this issue. ***
Comment #22 by sean — 2014-10-24T17:57:37Z
I'm unclear why you're calling thread_suspendAll/thread_resumeAll in those examples. These routines are for the GC to coordinate collections and have very strict requirements imposed on their use, fairly similar to those for code within signal handlers. When all threads are suspended, you can't do anything that might block (such as call malloc) or the app could deadlock. Was this simply for illustrative purposes?
Upon further reading, it seems like the problem was that the folks working on dynamic library support weren't aware of this limitation. Yes, the thread-GC interface is very race-prone, and heavily commented as a result. If someone could come up with an alternative to using signals for coordinating collections on linux, that would be awesome.
Comment #23 by github-bugzilla — 2015-01-26T07:11:51Z