Bug 18844 – std.utf.decode skips valid character on invalid multibyte sequence

Status
NEW
Severity
enhancement
Priority
P4
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2018-05-09T10:53:17Z
Last change time
2024-12-01T16:33:39Z
Assigned to
No Owner
Creator
FeepingCreature
Moved to GitHub: phobos#9757 →

Comments

Comment #0 by default_357-line — 2018-05-09T10:53:17Z
When decoding an invalid UTF-8 string, like cast(string) [cast(ubyte) 'ä', 't'], with Yes.useReplacementDchar, std.utf.decode will advance the cursor past the letter where the multibyte sequence hit an error, even if that letter is in itself a valid start of a new byte sequence. As a result, decode will advance the index to 2, leading the string to decode as "�" when it should decode as "�t".
Comment #1 by default_357-line — 2018-05-09T10:54:41Z
Repro: string s = cast(string) [cast(ubyte) 'ä', 't']; size_t i = 0; auto ch = decode!(UseReplacementDchar.yes, string)(s, i); writefln("ch = %s, i = %s, should be 1", ch, i); ch = �, i = 2, should be 1.
Comment #2 by Ajieskola — 2023-04-26T09:40:22Z
Additional observation: the documentation is misleading with both the present and proposed behaviour. It says "If the code point is not well-formed, then a UTFException is thrown and index remains unchanged.". Well, we don't throw here since we're using replacement characters, so maybe it's a hint that the part about index remaining unchanged does not apply either. On the other hand, the documentation doesn't say what happens to the index instead. At least for me, it gave the wrong impression that index wouldn't be advanced.
Comment #3 by robert.schadek — 2024-12-01T16:33:39Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9757 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB