Comment #0 by default_357-line — 2018-05-09T10:53:17Z
When decoding an invalid UTF-8 string, like cast(string) [cast(ubyte) 'ä', 't'], with Yes.useReplacementDchar, std.utf.decode will advance the cursor past the letter where the multibyte sequence hit an error, even if that letter is in itself a valid start of a new byte sequence. As a result, decode will advance the index to 2, leading the string to decode as "�" when it should decode as "�t".
Comment #1 by default_357-line — 2018-05-09T10:54:41Z
Repro:
string s = cast(string) [cast(ubyte) 'ä', 't'];
size_t i = 0;
auto ch = decode!(UseReplacementDchar.yes, string)(s, i);
writefln("ch = %s, i = %s, should be 1", ch, i);
ch = �, i = 2, should be 1.
Comment #2 by Ajieskola — 2023-04-26T09:40:22Z
Additional observation: the documentation is misleading with both the present and proposed behaviour. It says "If the code point is not well-formed, then a UTFException is thrown and index remains unchanged.".
Well, we don't throw here since we're using replacement characters, so maybe it's a hint that the part about index remaining unchanged does not apply either. On the other hand, the documentation doesn't say what happens to the index instead. At least for me, it gave the wrong impression that index wouldn't be advanced.
Comment #3 by robert.schadek — 2024-12-01T16:33:39Z