Bug 16090 – popFront generates out-of-bounds array index on corrupted utf-8 strings

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Mac OS X
Creation time
2016-05-29T05:17:00Z
Last change time
2016-10-01T11:44:54Z
Keywords
pull
Assigned to
ag0aep6g
Creator
jrdemail2000-dlang

Comments

Comment #0 by jrdemail2000-dlang — 2016-05-29T05:17:24Z
If a utf-8 string is chopped (terminated) in the middle of a multi-byte utf-8 character, popFront will generate an out-of-bounds array index. If compiled with -boundscheck=on, a popFront generates a core.exception.RangeError. With -boundscheck=off, an undetermined behavior. In the program below, in my tests the while looped forever until generating a bus error. void main(string[] args) { import std.stdio; import std.range; auto s = "aä"; auto corrupted = s[0 .. $-1]; auto n = 0; while (!corrupted.empty) { corrupted.popFront; n++; } writeln(n); } In this program, the 'ä' character is a two utf-8 sequence. Dropping the last byte leaving an incomplete utf-8 code point. The reason this is so problematic is that string processing often involves corrupted strings, in particular, strings read at run-time from input sources. In the sample program above it can be said that this is a programmer error. However, if the string is read from an outside source, the program needs to be able to defend against corrupted strings. It appears this arises problem from this code in popFront (isNarrowString), currently line 2076 in std/range/primitives.d: import core.bitop : bsr; auto msbs = 7 - bsr(~c); if ((msbs < 2) | (msbs > 6)) { //Invalid UTF-8 msbs = 1; } str = str[msbs .. $]; The msbs variable is holding the length of the utf-8 code point as indicated by the first byte. The 'str[msbs .. $]' expression assumes the string is long enough to hold the full code point. Beside being problematic for practical applications, it is inconsistent with other auto-decoding behavior. The 'front' routine will throw a std.utf.UTFException in this situation. And, popFront itself handles the case of an invalid first byte differently, by simply moving past it.
Comment #1 by ag0aep6g — 2016-05-31T15:54:13Z
Comment #2 by github-bugzilla — 2016-05-31T19:43:24Z
Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/e1af1b0b51ea9f29d4ff8076d73c03ba10bfc73c fix issue 16090 - popFront generates out-of-bounds array index on corrupted utf-8 strings https://github.com/dlang/phobos/commit/279ccd7c5c8cebfb21a3138aecf7f3a85444e538 Merge pull request #4387 from aG0aep6G/16090 fix issue 16090 - popFront generates out-of-bounds array index on cor…
Comment #3 by github-bugzilla — 2016-10-01T11:44:54Z
Commits pushed to stable at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/e1af1b0b51ea9f29d4ff8076d73c03ba10bfc73c fix issue 16090 - popFront generates out-of-bounds array index on corrupted utf-8 strings https://github.com/dlang/phobos/commit/279ccd7c5c8cebfb21a3138aecf7f3a85444e538 Merge pull request #4387 from aG0aep6G/16090