Bug 6791 – std.algorithm.splitter random indexes utf strings
Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
Other
OS
All
Creation time
2011-10-07T22:51:00Z
Last change time
2014-02-27T04:15:52Z
Assigned to
monarchdodra
Creator
code
Comments
Comment #0 by code — 2011-10-07T22:51:09Z
Throws an UTFException.
string s = `là dove terminava quella valle`;
foreach(word; std.array.splitter(s))
writeln(word);
---
The second UTF-8 code point of 'à' is 0xA0 for which isWhite is true.
Comment #1 by hsteoh — 2013-08-18T22:22:41Z
This is caused by struct SplitterResult in std.algorithm using array slicing and array indexing to pass char (not dchar!) to the lambda. SplitterResult appears to have multiple issues: it uses array slicing without a proper signature constraint on hasSlicing, and doesn't work properly for narrow strings because it uses indexing which for narrow strings doesn't handle multibyte UTF-8 sequences properly.
It appears to be wanting a rewrite that uses only forward range primitives, or at least, an overload for narrow strings that properly take multibyte characters into account.
Comment #2 by monarchdodra — 2013-08-18T23:25:05Z
(In reply to comment #1)
> This is caused by struct SplitterResult in std.algorithm using array slicing
> and array indexing to pass char (not dchar!) to the lambda. SplitterResult
> appears to have multiple issues: it uses array slicing without a proper
> signature constraint on hasSlicing, and doesn't work properly for narrow
> strings because it uses indexing which for narrow strings doesn't handle
> multibyte UTF-8 sequences properly.
>
> It appears to be wanting a rewrite that uses only forward range primitives, or
> at least, an overload for narrow strings that properly take multibyte
> characters into account.
I had submitted a correction for this about 1 year ago, but it ended up being too big in scope (*all* splitter flavors have bugs). It also ended up being messy due to (trying to avoid) code duplication.
It might be better to just fix things little by little though, rather than not at all.
I'll fix *just* "splitter!pred": It's the easiest to fix. We'll see where we go from there.