I've been chasing this on and off for a couple months now. Basically, trying to access the RefCounted.refCountedStore.isInitialized of a non-initialized RefCounted in a field of a temporary will create a stack overflow. I know that's not clear, but here is the reduced usecase:
//----
import std.container, std.stdio, std.typecons, std.exception;
struct S
{
RefCounted!int _data;
this(int)
{_data.refCountedStore.ensureInitialized();}
int get() @property
{
writeln("here");
enforce(_data.refCountedStore.isInitialized); //OH NOES!!!
writeln("there");
return _data.refCountedPayload;
}
}
void main()
{
// 1)
writeln(S(1).get);
// 2)
S s;
writeln(s.get).collectException();
// 3)
writeln(S().get);
}
//----
1) This will create a temporary S, and intialize the ref counted. the writeln works.
2) The creates a non-temporary S. Trying to access the ref counted will (correctly) throw an exception.
3) This will stack overflow at the "//OH NOES!!!" line: It will first call:
ref inout(RefCountedStore) refCountedStore() inout
To get the store, and then will recursively call "isInitialized" until the program stack overflows. I have no idea why:
//----
@property nothrow @safe
bool isInitialized() const
{
return _store !is null;
}
//----
This seems to me like the tip of a more serious bug somewhere. I would be very pleased if someone with more knowledge than me could try to look into it?
I think it might also create problems with things such as arrays of arrays: Every time I've tried to fix http://d.puremagic.com/issues/show_bug.cgi?id=6153 I've had crashes (NOT asserts/enforeces), and I think this might be the reason.
Originally found with this code:
//----
void main()
{
writeln(Array!int()[0]);
}
//----
Yes, the code is wrong, but it should *assert*. Currently, it just dies.
Comment #1 by maxim — 2013-02-01T08:11:50Z
The situation is more complicated.
import std.container, std.stdio, std.typecons, std.exception;
struct S
{
RefCounted!int _data;
this(int)
{_data.refCountedStore.ensureInitialized();}
int get() @property
{
writeln("here");
enforce(_data.refCountedStore.isInitialized); //OH NOES!!! //13
writeln("there");
return _data.refCountedPayload;
}
}
void main()
{
version (A) {
writeln(S(1).get);
}
version (B) {
S s;
writeln(s.get).collectException();
}
version (C) {
writeln(S().get);
}
}
When compiling with version A or B, everything is fine. Version C fails enforcement on line 13. Both A and C throws failed enforcement. Both B and C segfault in Refcounted dtor. Tested on linux64 git head.
Comment #2 by maxim — 2013-02-01T08:25:53Z
Oh, it depends on compiler switches. Reduced from previous by combining B and C:
import std.container, std.stdio, std.typecons, std.exception;
struct S
{
RefCounted!int _data;
int get() @property
{
writeln("here");
enforce(_data.refCountedStore.isInitialized); //OH NOES!!!
writeln("there");
return _data.refCountedPayload;
}
}
void main()
{
S s;
writeln(s.get).collectException();
writeln(S().get);
}
When compiling with -g, it prints "here", "here" and enforcement failure, without -g it segfaults as like above. Since valgrind does not detect for -g version memory errors, the issue may be dmd codegen bug.
Comment #3 by monarchdodra — 2013-02-01T09:09:48Z
I wanted to try to get RefCounted out of the way, and I was able to get a massively reduced test case. That said, the result is mind bogglingly wtf...
FYI: win32 on win7.64
//----
import std.stdio, std.exception;
//S is merely a struct with a pointer. And a destructor.
struct S
{
int* _payload = null;
~this()
{
writeln("~this: ", _payload);
if (!_payload) return;
writeln("Should not be here...");
}
}
void foo(int* p)
{
throw new Exception("bla");
}
void main()
{
int* p = S()._payload;
writeln("wait for it...");
foo(S()._payload);
}
//----
~this: null
wait for it...
~this: FFFF05E8
Should not be here...
[email protected](17): bla
//----
Apparently, we create a temporary, an exception is thrown, the temporary gets corrupted, and then things start getting awry in the destructor.
I'm getting this also for as far back as 2.55 (didn't try anything earlier), so it doesn't seem to be a regression.
In any case, exceptions silently corrupting stack temporaries? That's a critical in my book.
Who was it again that said our destructors weren't very well tested?
Comment #4 by monarchdodra — 2013-02-01T09:17:05Z
(In reply to comment #3)
> I wanted to try to get RefCounted out of the way, and I was able to get a
> massively reduced test case. That said, the result is mind bogglingly wtf...
Just to add, I just tried compiling with different flags, including -O -release -debug -g:
The scenario occurs in all cases, except when the flag "-g" is set, in which case things work correctly.
Comment #5 by maxim — 2013-02-01T09:18:14Z
(In reply to comment #3)
> I wanted to try to get RefCounted out of the way, and I was able to get a
> massively reduced test case. That said, the result is mind bogglingly wtf...
>
> FYI: win32 on win7.64
>
> //----
Cannot reproduce on linux 64 githead.
Comment #6 by maxim — 2013-02-01T10:07:03Z
(In reply to comment #3)
> I wanted to try to get RefCounted out of the way, and I was able to get a
> massively reduced test case. That said, the result is mind bogglingly wtf...
>
> FYI: win32 on win7.64
>
Can reproduce. And -g fixes the program.
Comment #7 by maxim — 2013-02-01T12:56:04Z
Reduced for linux
struct RefCounted
{
void *p;
~this()
{
p = null;
}
}
struct S
{
RefCounted _data;
int get() @property
{
throw new Exception("");
}
}
void main()
{
S s;
S().get;
}
With -g it throws, without -g segfaults.
Comment #8 by maxim — 2013-02-01T13:24:49Z
It really seems to be codegen bug. The problem is that presence of code like in main function (struct temporary + simple stack struct) makes dmd generate wrong exception handler table.
If you compile main.d one version with -release -O -noboundcheck and other version with the same switches and additionally with -g, you will have absolutely identical asm (obj2asm output) except the single difference is in data segment.
In segfaulting version you have
.data segment
_HandlerTable0:
db 050h,000h,000h,000h,063h,000h,000h,000h ;P...c...
db 002h,000h,000h,000h,000h,000h,000h,000h ;........
db 019h,000h,000h,000h,048h,000h,000h,000h ;....H...
db 0ffffffffh,0ffffffffh,0ffffffffh,0ffffffffh,000h,000h,000h,000h ;........
db 057h,000h,000h,000h,000h,000h,000h,000h ;W.......
db 02bh,000h,000h,000h,037h,000h,000h,000h ;+...7...
db 000h,000h,000h,000h,000h,000h,000h,000h ;........
db 042h,000h,000h,000h,000h,000h,000h,000h ;B....... // 42h
and in throwing version you will have
_HandlerTable0:
db 050h,000h,000h,000h,063h,000h,000h,000h ;P...c...
db 002h,000h,000h,000h,000h,000h,000h,000h ;........
db 019h,000h,000h,000h,048h,000h,000h,000h ;....H...
db 0ffffffffh,0ffffffffh,0ffffffffh,0ffffffffh,000h,000h,000h,000h ;........
db 057h,000h,000h,000h,000h,000h,000h,000h ;W.......
db 02bh,000h,000h,000h,037h,000h,000h,000h ;+...7...
db 000h,000h,000h,000h,000h,000h,000h,000h ;........
db 03eh,000h,000h,000h,000h,000h,000h,000h ;>....... //3eh
If you patch incorrect binary, the bug goes away.
Corrupted handler table leads to following problem (asm snippet from main):
0x0000000000418888 <+60>: jmp <_Dmain+72>
0x000000000041888a <+62>: lea -0x10(%rbp),%rdi //3Eh
0x000000000041888e <+66>: callq <_D4main1S11__fieldDtorMFZv> //42h
0x0000000000418893 <+71>: retq
0x0000000000418894 <+72>: sub $0x8,%rsp
0x0000000000418898 <+76>: callq 0x4188a3 <_Dmain+87>
0x000000000041889d <+81>: add $0x8,%rsp
0x00000000004188a1 <+85>: jmp 0x4188ad <_Dmain+97>
0x00000000004188a3 <+87>: lea -0x18(%rbp),%rdi
0x00000000004188a7 <+91>: callq 0x418810 <_D4main1S11__fieldDtorMFZv>
0x00000000004188ac <+96>: retq
0x00000000004188ad <+97>: xor %eax,%eax
0x00000000004188af <+99>: pop %r15
In segfaulting version druntime unwinds up to _Dmain+66, after instruction which sets into %rdi this reference, hence dtor receives corrupted pointer. In correct version druntime unwinds up to _Dmain+62, so the this pointer is correct.
Comment #9 by monarchdodra — 2013-02-03T09:26:40Z
(In reply to comment #8)
> It really seems to be codegen bug.
Thankyou for investigating this. As I said, it really is out of my league. Do you know what the next step is for fixing this? Who we should forward it to?
We should really get this fixed. Stack corruption when an exception is thrown? Nothing good can come out of this.
Comment #10 by maxim — 2013-02-04T01:11:11Z
(In reply to comment #9)
> (In reply to comment #8)
> > It really seems to be codegen bug.
>
> Thankyou for investigating this. As I said, it really is out of my league. Do
> you know what the next step is for fixing this? Who we should forward it to?
This does not work. Until somebody who knows dmd source faces the issue, the bug will not be fixed simply because there are too much problems and too few people.
> We should really get this fixed. Stack corruption when an exception is thrown?
> Nothing good can come out of this.