Bug 13348 – std.uni.Grapheme is impure due to using C malloc and friends

Status
RESOLVED
Resolution
FIXED
Severity
blocker
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2014-08-21T04:46:44Z
Last change time
2018-01-05T13:31:21Z
Assigned to
No Owner
Creator
hsteoh
Blocks
7054

Comments

Comment #0 by hsteoh — 2014-08-21T04:46:44Z
std.uni.byGrapheme is impure, making it impossible to implement correct iteration over graphemes in pure code. Blocks: issue #7054.
Comment #1 by hsteoh — 2014-08-21T04:52:17Z
Ditto for graphemeStride.
Comment #2 by hsteoh — 2014-08-28T22:37:31Z
Upgrade to blocker since the fix for issue #7054 is blocked by this.
Comment #3 by astrothayne — 2016-06-02T06:15:41Z
I don't think byGrapheme can be pure in general, because, in general, the input range used doesn't necessarily have pure input methods. For example the input range could read from I/O. The solution would probably be to add pure if and only if the front, popFront, and empty methods of the inputRange are themselves pure. I'm not sure what the best way to do that would be.
Comment #4 by hsteoh — 2016-06-02T15:13:37Z
It's not a problem that byGrapheme is sometimes impure, but it should not be the one introducing the impurity. I.e., if the incoming range is pure, then byGrapheme ought to be also pure. This can be enforced by not attributing byGrapheme directly, but using a pure nothrow etc. unittest to ensure any impure/throwing/etc. code actually comes from the incoming range, not from byGrapheme itself.
Comment #5 by schveiguy — 2016-06-02T15:48:43Z
Something doesn't seem right here. byGrapheme is a template function, including a templated struct. purity should be inferred everywhere here, no? The code itself seems pretty simple, seems like all the impurity comes from Grapheme and possibly decodeGrapheme.
Comment #6 by schveiguy — 2016-06-02T16:05:06Z
OK, so pretty much all the problems stem from malloc/realloc/free being used to manage the memory of the Grapheme type. There is one obvious fix: isRegionalIndicator can be marked pure nothrow. we could move the allocation scheme to use the GC instead of C malloc. But I don't know what this does for performance. As the comments say, most uses will never use the allocation. CC Dmitry.
Comment #7 by dmitry.olsh — 2016-06-02T17:36:24Z
> OK, so pretty much all the problems stem from malloc/realloc/free being used to manage the memory of the Grapheme type. Should I just use GC?
Comment #8 by hsteoh — 2016-06-02T17:44:01Z
Here's a thought, I don't know how feasible it is, but what about using slices of the input string, and decode on demand? Of course, this doesn't work for general input ranges (have to allocate somehow in that case), but conceivably it could be done for ranges with hasSlicing? Or, for forward ranges, use .save. Obviously, this will depend on whether .save allocates, but then it's the user's problem, not ours. The advantage is that we don't have to allocate at all, at least in some cases. The disadvantage is that you'll need to decode again every time grapheme members are accessed. It may cause poor performance.
Comment #9 by schveiguy — 2016-06-02T18:16:04Z
(In reply to Dmitry Olshansky from comment #7) > Should I just use GC? At this point, if we were to fix purity, I think this is the way to go. From your comments in the code, you said that most of the time allocations will not happen. I'm sure this varies with the language being processed, but it's probably mostly true. One other nice benefit about using the GC is you can remove the requirement to copy the current grapheme if it's on the heap (and make that immutable also, so Grapheme can be passed to another thread if the wrapped range allows it), and copying the struct itself would be less expensive. If we were to wait for pure-compatible reference counting to materialize, this would be the ideal mechanism. (In reply to hsteoh from comment #8) > Here's a thought, I don't know how feasible it is, but what about using > slices of the input string, and decode on demand? Not a big fan of this idea. It puts extra demand on the wrapped range type (must be sliceable or at least forward range), and might make things slower in the case where the stack-allocated part of the union would be sufficient.
Comment #10 by greensunny12 — 2016-06-06T01:12:43Z
>> Should I just use GC? > At this point, if we were to fix purity, I think this is the way to go. From your comments in the code, you said that most of the time allocations will not happen. I'm sure this varies with the language being processed, but it's probably mostly true. How about doing a simple benchmark with a multi-lingual text and see whether there's any performance difference?
Comment #11 by dmitry.olsh — 2016-06-06T18:37:45Z
> At this point, if we were to fix purity, I think this is the way to go. From your comments in the code, you said that most of the time allocations will not happen. I'm sure this varies with the language being processed, but it's probably mostly true. Problem is there is @nogc crowd and then there is pure @safe crowd. I can't satisfy both.
Comment #12 by schveiguy — 2016-06-06T18:47:35Z
(In reply to Dmitry Olshansky from comment #11) > > At this point, if we were to fix purity, I think this is the way to go. From your comments in the code, you said that most of the time allocations will not happen. I'm sure this varies with the language being processed, but it's probably mostly true. > > Problem is there is @nogc crowd and then there is pure @safe crowd. I can't > satisfy both. Haha! good point. Until we have allocator being configurable, we can't solve both. I say leave it the way it is until we have that. @nogc is currently way more important than purity I think, and breaking existing @nogc code by making the default all of a sudden use gc, would be disruptive.
Comment #13 by github-bugzilla — 2017-09-07T09:15:05Z
Commits pushed to stable at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/187e2b769b7ab1e2509d54e31cddc007b423fc8a fix issue 13348 https://github.com/dlang/phobos/commit/fde471f8f21182b875bfbe0a6fdcef58bfa77c78 Merge pull request #5723 from DmitryOlshansky/fix-issue-13348 Fix issue 13348 - byGrapheme is not pure merged-on-behalf-of: unknown
Comment #14 by github-bugzilla — 2017-10-01T16:36:47Z
Comment #15 by github-bugzilla — 2018-01-05T13:31:21Z