Bug 15586 – std.utf.toUTF8() segfaults when fed an invalid dchar

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2016-01-20T19:38:00Z
Last change time
2016-03-19T20:22:34Z
Assigned to
nobody
Creator
thomas.bockman

Comments

Comment #0 by thomas.bockman — 2016-01-20T19:38:23Z
Repro: void main() { import std.utf : toUTF8; char[4] buf = void; auto b = toUTF8(buf, cast(dchar)0x110000); import std.stdio; writeln(b); } Cause: toUTF8() asserts that invalid dchars just *don't exist*. It then proceeds to return *nothing at all* when fed one, which violates memory safety. Because of https://issues.dlang.org/show_bug.cgi?id=15585 the compiler may optimize away attempts to fix this entirely.
Comment #1 by thomas.bockman — 2016-01-20T19:52:54Z
Comment #2 by hsteoh — 2016-01-20T20:12:03Z
I don't understand something here. The in-contract of toUTF8 asserts that the dchar must be valid... but why does the assert not trigger at runtime (even in spite of not compiling with -release). This doesn't seem like a Phobos bug; it seems to be a bug in the compiler for "optimizing" away the assert just because it thinks that dchar's cannot have invalid values.
Comment #3 by thomas.bockman — 2016-01-20T22:22:23Z
> This doesn't seem like a Phobos bug; it seems to be a bug in the compiler for "optimizing" away the assert just because it thinks that dchar's cannot have invalid values. That's the compiler bug I linked to in the OP (15585).
Comment #4 by hsteoh — 2016-01-20T22:38:53Z
Keep in mind, though, that one *could* argue that casting an arbitrary value to dchar already constitutes UB, if dchars are deemed to only contain valid Unicode codepoints. If you need to work with incoming character data of unknown validity, you're probably better off working with uint (or uint[], ubyte[], etc.) instead, and only convert to dchar (string, etc.) after explicit validation. Generally, you probably shouldn't be casting stuff unless you know what you're doing and are ready to handle the consequences when things go wrong.
Comment #5 by thomas.bockman — 2016-01-20T23:07:49Z
> Keep in mind, though, that one *could* argue that casting an arbitrary value > to dchar already constitutes UB This has already been settled. It's not: http://forum.dlang.org/post/[email protected] > Generally, you probably shouldn't be casting stuff unless you know what > you're doing and are ready to handle the consequences when things go wrong. 1) No casting is required to trigger this bug. I was just giving you a simplified test case. Here's a slightly less simple one, without any casts: import std.stdio; void main() { import std.utf : toUTF8; dchar a = dchar.max; a += 1; char[4] buf = void; auto b = toUTF8(buf, a); import std.stdio; writeln(b); } 2) "If you don't want segfaults, don't cast stuff" is an awful solution to a hard crash in @safe code. There is no good reason that casts of basic numeric or character types should cause this kind of failure.
Comment #6 by thomas.bockman — 2016-01-21T23:07:07Z
I may have reduced this one too far: https://github.com/D-Programming-Language/phobos/pull/3943#issuecomment-173381348 Arguably, the real bug is that certain other functions in Phobos call `toUTF8()` without verifying that the input they are supplying satisfies the contract. This will be a bit more work to fix, though.
Comment #7 by github-bugzilla — 2016-02-11T02:51:31Z
Commits pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/2b09b8b59cc94ff23f52f3d18212c727d3e89d7b Fix Phobos issue #15586 - std.utf.toUTF8() halts on invalid dchar. https://github.com/D-Programming-Language/phobos/commit/cadc8197614dac7313ffc44dd4e81c51c17ee8f3 Merge pull request #3943 from tsbockman/dchar-crash Fix Phobos Issue 15586
Comment #8 by github-bugzilla — 2016-03-19T20:22:34Z