← Back to index | Original Bugzilla link

Bug 18844 – std.utf.decode skips valid character on invalid multibyte sequence

Status: NEW
Severity: enhancement
Priority: P4
Component: phobos
Product: D
Version: D2
Platform: x86_64
OS: Linux
Creation time: 2018-05-09T10:53:17Z
Last change time: 2024-12-01T16:33:39Z
Assigned to: No Owner
Creator: FeepingCreature

Comments

Comment #0 by default_357-line — 2018-05-09T10:53:17Z

When decoding an invalid UTF-8 string, like cast(string) [cast(ubyte) 'ä', 't'], with Yes.useReplacementDchar, std.utf.decode will advance the cursor past the letter where the multibyte sequence hit an error, even if that letter is in itself a valid start of a new byte sequence. As a result, decode will advance the index to 2, leading the string to decode as "�" when it should decode as "�t".

Comment #1 by default_357-line — 2018-05-09T10:54:41Z

Repro: string s = cast(string) [cast(ubyte) 'ä', 't']; size_t i = 0; auto ch = decode!(UseReplacementDchar.yes, string)(s, i); writefln("ch = %s, i = %s, should be 1", ch, i); ch = �, i = 2, should be 1.

Comment #2 by Ajieskola — 2023-04-26T09:40:22Z

Additional observation: the documentation is misleading with both the present and proposed behaviour. It says "If the code point is not well-formed, then a UTFException is thrown and index remains unchanged.". Well, we don't throw here since we're using replacement characters, so maybe it's a hint that the part about index remaining unchanged does not apply either. On the other hand, the documentation doesn't say what happens to the index instead. At least for me, it gave the wrong impression that index wouldn't be advanced.

Comment #3 by robert.schadek — 2024-12-01T16:33:39Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9757 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB