Comment #0 by apreobrazhensky — 2016-04-18T18:58:33Z
I have multi-threaded application with threads doing memory intensive work and main thread cleaning up the garbage periodically by calling GC.collect manually. Sometimes GC.collect causes deadlock. I don't have simple example, but I do have stack traces of the threads at the moment of the deadlock.
It happens both for dmd 2.071.0 and for dmd 2.070.* (so it is not related to the recent GC spinlock change).
That seems like a blocker to me, I suspect that if it happens when I call it manually it could also happen during normal collections. I'm not familiar with runtime code, but I would expect some sort of race condition judging from stack traces below.
Configuration:
dmd 2.071.0 with -O -release -inline -boundscheck=off
Linux 3.2.0-95-generic #135-Ubuntu SMP Tue Nov 10 13:33:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
That's the main thread's stack trace.
Thread 1 (Thread 0x7ff6653bb6c0 (LWP 6857)):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1 0x00000000007b3ff6 in thread_suspendAll ()
#2 0x000000000079980d in gc.gc.Gcx.fullcollect() ()
#3 0x000000000079c2b2 in gc.gc.GC.__T9runLockedS49_D2gc2gc2GC11fullCollectMFNbZ2goFNbPS2gc2gc3GcxZmTPS2gc2gc3GcxZ.runLocked() ()
#4 0x0000000000796535 in gc.gc.GC.fullCollect() ()
#5 0x000000000076091c in gc_collect ()
...application stack
That's how stack trace looks like for the threads which were suspended correctly.
Thread XX (Thread 0x7ff5c6ffd700 (LWP 6897)):
#0 0x00007ff6640e6454 in do_sigsuspend (set=0x7ff5c6ff9bc0) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1 __GI___sigsuspend (set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:78
#2 0x00000000007c0401 in core.thread.thread_suspendHandler() ()
#3 0x00000000007c045c in core.thread.callWithStackShell() ()
#4 0x00000000007c038f in thread_suspendHandler ()
#5 <signal handler called>
... application stack
That's how stack trace looks like for the threads which weren't suspended:
Thread YY (Thread 0x7ff5c67fc700 (LWP 6898)):
#0 0x00007ff664d9b52d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1 0x000000000075dfde in core.thread.Thread.sleep() ()
#2 0x00000000007b46e0 in core.internal.spinlock.SpinLock.yield() ()
#3 0x00000000007b467c in core.internal.spinlock.SpinLock.lock() ()
#4 0x000000000079bc21 in gc.gc.GC.__T9runLockedS46_D2gc2gc2GC12extendNoSyncMFNbPvmmxC8TypeInfoZmS21_D2gc2gc10extendTimelS21_D2gc2gc10numExtendslTPvTmTmTxC8TypeInfoZ.runLocked() ()
#5 0x0000000000760bcc in gc_extend ()
#6 0x0000000000763c85 in _d_arraysetlengthT ()
... application stack
Thread ZZ (Thread 0x7ff566ffd700 (LWP 6918)):
#0 0x00007ff664d9b52d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1 0x000000000075dfde in core.thread.Thread.sleep() ()
#2 0x00000000007b46e0 in core.internal.spinlock.SpinLock.yield() ()
#3 0x00000000007b467c in core.internal.spinlock.SpinLock.lock() ()
#4 0x000000000079ba3c in gc.gc.GC.__T9runLockedS47_D2gc2gc2GC12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS21_D2gc2gc10mallocTimelS21_D2gc2gc10numMallocslTmTkTmTxC8TypeInfoZ.runLocked() ()
#5 0x00000000007953be in gc.gc.GC.malloc() ()
#6 0x0000000000760a04 in gc_malloc ()
#7 0x0000000000762c43 in _d_newclass ()
... application stack
Comment #1 by apreobrazhensky — 2016-04-18T22:16:25Z
I wasn't able to reproduce the issue using simpler code using GC operations only. I noticed that nanosleep is a syscall which should be interrupted by GC signal. So, probably there is something else involved aside from GC. I use standard library only and I have no custom signal-related code.
Comment #2 by ilyayaroshenko — 2016-04-19T09:30:34Z
+1
I had the same problems
Comment #3 by Marco.Leise — 2016-04-19T22:31:00Z
This issue has a smell of https://issues.dlang.org/show_bug.cgi?id=10351
In absence of a repro case that works without the profiler I just kept it open for future reference. Not how the GC hangs in thread_suspendAll() in both cases.
Comment #4 by dfj1esp02 — 2016-04-20T09:21:21Z
(In reply to Aleksei Preobrazhenskii from comment #0)
> dmd 2.071.0 with -O -release -inline -boundscheck=off
Do these flags affect the hang?
Comment #5 by dfj1esp02 — 2016-04-20T09:26:51Z
Also what about 32-bit mode?
Comment #6 by apreobrazhensky — 2016-04-20T19:59:25Z
I think I saw the same behaviour in debug builds, I will try to verify it. As of 32-bit question, due to the nature of the program I can't test it in 32-bit environment.
After investigating problem a little further, I think that the issue might be in GC relying on traditional POSIX signals. One way to get such stack traces is if suspend signal (SIGUSR1 by default) wasn't delivered, which could happen for traditional POSIX signals if they occur in quick succession. Like, if thread_suspendAll happens while some threads are still in the thread_suspendHandler (already handled resume signal, but still didn't leave the suspend handler).
Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery guarantees, I'm going to try the same code but with thread_setGCSignals(SIGRTMIN, SIGRTMIN + 1).
Comment #7 by apreobrazhensky — 2016-04-25T18:37:06Z
I was running tests for past five days, I didn't see any deadlocks since I switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN, SIGRTMIN + 1)). I would recommend to change default signals accordingly.
Comment #8 by bugzilla — 2016-04-27T04:56:23Z
(In reply to Aleksei Preobrazhenskii from comment #7)
> I was running tests for past five days, I didn't see any deadlocks since I
> switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN,
> SIGRTMIN + 1)). I would recommend to change default signals accordingly.
Since you've written the code to fix it, please write a Pull Request for it. That way you get the credit!
Comment #9 by safety0ff.bugz — 2016-04-27T23:31:05Z
Could you run strace to get a log of the signal usage?
For example:
strace -f -e signal -o signals.log command_to_run_program
Then add the output signals.log to the bug report?
I don't know if it'll be useful but it will be something more to look for hints.
I'm wondering if there are any other signal handler invocations in the "...application stack" part of your stack traces.
I've seem a deadlock caused by an assert firing within the thread_suspendHandler, which deadlocks on the GC lock.
(In reply to Aleksei Preobrazhenskii from comment #6)
> Like, if thread_suspendAll happens while some threads are still in the
> thread_suspendHandler (already handled resume signal, but still didn't leave
> the suspend handler).
What should happen in this case is since the signal is masked upon signal handler invocation, the new suspend signal is marked as "pending" and run once thread_suspendHandler returns and the signal is unblocked.
The suspended thread cannot receive another resume or suspend signal until after the sem_post in thread_suspendHandler.
I've mocked up the suspend / resume code and it does not deadlock from the situation you've described.
> Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery
> guarantees
Their queuing and ordering guarantees should be irrelevant due to synchronization and signal masks.
I don't see any other benefits of RT signals.
(In reply to Walter Bright from comment #8)
>
> Since you've written the code to fix it, please write a Pull Request for it.
> That way you get the credit!
He modified his code to use the thread_setGCSignals function:
https://dlang.org/phobos/core_thread.html#.thread_setGCSignals
P.S.: I don't mean to sound doubtful, I just want a sound explanation of the deadlock so it can be properly address at the cause.
Comment #10 by apreobrazhensky — 2016-04-28T01:43:01Z
(In reply to safety0ff.bugz from comment #9)
> Could you run strace to get a log of the signal usage?
I did it before to catch the deadlock, but I wasn't able to do that while strace was running. And, unfortunately, I don't have original code running in production anymore.
> I'm wondering if there are any other signal handler invocations in the
> "...application stack" part of your stack traces.
No, there was no signal related code in hidden parts of stack trace.
> I've seem a deadlock caused by an assert firing within the
> thread_suspendHandler, which deadlocks on the GC lock.
In my case that was a release build, so I assume no asserts.
> What should happen in this case is since the signal is masked upon signal
> handler invocation, the new suspend signal is marked as "pending" and run
> once thread_suspendHandler returns and the signal is unblocked.
Yeah, my reasoning was wrong. I did a quick test and saw that signals weren't delivered, apparently, I forgot that pthread_kill is asynchronous, so signals should've coalesced in my test.
> Their queuing and ordering guarantees should be irrelevant due to
> synchronization and signal masks.
Ideally, yeah, but as I said, I just changed SIGUSR1/SIGUSR2 to SIGRTMIN/SIGRTMIN+1 and didn't see any deadlocks for a long time, and I saw them pretty consistently before. So, either "irrelevant" part is wrong, or there is something else which is different and relevant (and probably not documented) for real-time signals. The other explanation is that bug is still there and real-time signals just somehow reduced probability of it happening.
Also, I have no other explanation why stack traces look like that, the simplest one is that signal wasn't delivered.
Comment #11 by code — 2016-05-08T11:32:35Z
Having the main thread hang while waiting for semaphore posts in the thread_suspendAll is a good indication that the signal was lost.
Did you have gdb attached while the signal was send? That sometime causes issues w/ signal delivery.
The setup looks fairly simple (a few threads allocating classes and extending arrays) to be run for a few days, maybe we can reproduce the problem.
Are there any other reasons for switching to real-time signals?
Which real-time signals are usually not used for other purposes?
Comment #12 by apreobrazhensky — 2016-05-09T19:32:04Z
(In reply to Martin Nowak from comment #11)
> Did you have gdb attached while the signal was send? That sometime causes
> issues w/ signal delivery.
No, I didn't. I attached gdb to investigate deadlock which already happened at that point.
> Are there any other reasons for switching to real-time signals?
I read that traditional signals are internally mapped to real-time signals. If that's true I see no reason to stick with inferior emulated entity with weaker guarantees.
> Which real-time signals are usually not used for other purposes?
Basically all real-time signals from range SIGRTMIN .. SIGRTMAX are intended for custom use (SIGRTMIN might vary from platform to platform though, because of things like NPTL and LinuxThreads).
Comment #13 by apreobrazhensky — 2016-05-11T22:13:14Z
I saw new deadlock with different symptoms today.
Stack trace of collecting thread:
Thread XX (Thread 0x7fda6ffff700 (LWP 32383)):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1 0x00000000007b4046 in thread_suspendAll ()
#2 0x00000000007998dd in gc.gc.Gcx.fullcollect() ()
#3 0x0000000000797e24 in gc.gc.Gcx.bigAlloc() ()
#4 0x000000000079bb5f in gc.gc.GC.__T9runLockedS47_D2gc2gc2GC12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS21_D2gc2gc10mallocTimelS21_D2gc2gc10numMallocslTmTkTmTxC8TypeInfoZ.runLocked() ()
#5 0x000000000079548e in gc.gc.GC.malloc() ()
#6 0x0000000000760ac7 in gc_qalloc ()
#7 0x000000000076437b in _d_arraysetlengthT ()
...application stack
Stack traces of other threads:
Thread XX (Thread 0x7fda5cff9700 (LWP 32402)):
#0 0x00007fda78927454 in do_sigsuspend (set=0x7fda5cff76c0) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1 __GI___sigsuspend (set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:78
#2 0x000000000075d979 in core.thread.thread_suspendHandler() ()
#3 0x000000000075e220 in core.thread.callWithStackShell() ()
#4 0x000000000075d907 in thread_suspendHandler ()
#5 <signal handler called>
#6 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:160
#7 0x0000000000760069 in core.sync.condition.Condition.wait() ()
...application stack
All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that.
It doesn't invalidate the hypothesis that RT signals helped with original deadlock though.
Comment #14 by safety0ff.bugz — 2016-05-12T16:50:49Z
(In reply to Aleksei Preobrazhenskii from comment #13)
>
> All suspending signals were delivered, but it seems that number of calls to
> sem_wait was different than number of calls to sem_post (or something
> similar). I have no reasonable explanation for that.
>
> It doesn't invalidate the hypothesis that RT signals helped with original
> deadlock though.
I haven't looked too closely at whether there's any races for thread termination.
My suspicions are still on a low-level synchronization bug.
Have you tried a more recent (3.19+ kernel) or trying to newer glibc?
I'm aware of this bug [1] which was supposed to affect kernels 3.14 - 3.18 but perhaps there's a preexisting bug which affects your machine?
[1] https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
Comment #15 by lomereiter — 2016-05-20T18:05:29Z
I'm apparently bumping into the same problem. Here's the last stack trace that I've received from a user, very similar to the one posted here: https://gist.github.com/rtnh/e2eab6afa7c0a37dbc96578d0f73c540
The prominent kernel bug mentioned here has been ruled out already. Another hint I've got is that reportedly 'error doesn't happen on XenServer hypervisors, only on KVM' (full discussion is taking place at https://github.com/lomereiter/sambamba/issues/189)
Comment #16 by code — 2016-08-09T16:25:52Z
(In reply to Aleksei Preobrazhenskii from comment #13)
> All suspending signals were delivered, but it seems that number of calls to
> sem_wait was different than number of calls to sem_post (or something
> similar). I have no reasonable explanation for that.
>
> It doesn't invalidate the hypothesis that RT signals helped with original
> deadlock though.
To be hypothesis it must verifyable, but as we can't explain why RT signals would help, it's not a real hypothesis. Can anyone somewhat repeatedly reproduce the issue?
I would suspect that this issue came with the recent parallel suspend feature.
https://github.com/dlang/druntime/pull/1110, that would affect dmd >= 2.070.0.
Could someone test their code with 2.069.2?
Comment #17 by ilyayaroshenko — 2016-08-09T16:38:48Z
(In reply to Martin Nowak from comment #16)
> (In reply to Aleksei Preobrazhenskii from comment #13)
> > All suspending signals were delivered, but it seems that number of calls to
> > sem_wait was different than number of calls to sem_post (or something
> > similar). I have no reasonable explanation for that.
> >
> > It doesn't invalidate the hypothesis that RT signals helped with original
> > deadlock though.
>
> To be hypothesis it must verifyable, but as we can't explain why RT signals
> would help, it's not a real hypothesis. Can anyone somewhat repeatedly
> reproduce the issue?
It is not easy to catch it on PC. The bug was found when program was running on multiple CPUs on multiple servers during a day.
> I would suspect that this issue came with the recent parallel suspend
> feature.
> https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> 2.070.0.
> Could someone test their code with 2.069.2?
Yes, the bug was found first for 2.069.
(In reply to Илья Ярошенко from comment #17)
> > https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> > 2.070.0.
> > Could someone test their code with 2.069.2?
>
> Yes, the bug was found first for 2.069.
But that change is not in 2.069.x, only in 2.070.0 and following.
Can you somewhat reproduce it? Would simplify my life a lot.
Following my hypothesis, it should be fairly simple to trigger with one thread continuously looping on GC.collect(), while concurrently spawning many short lived threads, to increase the change of triggering the race between signal delivery and the thread exiting.
If realtime signals are delivered faster (before pthread_kill returns), then they might indeed avoid the race condition by pure chance.
Comment #20 by ilyayaroshenko — 2016-08-10T10:17:01Z
I have not access to the source code anymore :/
Comment #21 by code — 2016-08-11T10:13:07Z
Nope, that doesn't seem to be the problem.
All the thread exit code synchronizes on Thread.slock_nothrow.
It shouldn't even be possible to send a signal to an exiting thread, b/c they get removed from the thread list before that, and that is synchronized around the suspend loop.
Might still be a problem with the synchronization of m_isRunning and/or thread_cleanupHandler. Did your apps by any chance use thread cancellation or pthread_exit?
Comment #22 by ilyayaroshenko — 2016-08-11T10:18:17Z
(In reply to Martin Nowak from comment #21)
> Nope, that doesn't seem to be the problem.
> All the thread exit code synchronizes on Thread.slock_nothrow.
> It shouldn't even be possible to send a signal to an exiting thread, b/c
> they get removed from the thread list before that, and that is synchronized
> around the suspend loop.
>
> Might still be a problem with the synchronization of m_isRunning and/or
> thread_cleanupHandler. Did your apps by any chance use thread cancellation
> or pthread_exit?
No, but an Exception may be thrown in a thread.
Comment #23 by code — 2016-09-22T11:04:31Z
Anyone still experiencing this issue? Can't seem to fix it w/o reproducing it.
Comment #24 by apreobrazhensky — 2016-09-22T19:19:23Z
(In reply to Martin Nowak from comment #23)
> Anyone still experiencing this issue? Can't seem to fix it w/o reproducing
> it.
Since I changed signals to real-time and migrated to recent kernel I haven't seen that issue in the release builds, however, I tried running profile build recently (unfortunately I only did it for the old kernel) and it was consistently stuck every time. It might be something related to the issue, I will try to reproduce it with simpler code when I have time.