← Back to index | Original Bugzilla link

Bug 13936 – groupBy must be redone

Status: RESOLVED
Resolution: FIXED
Severity: normal
Priority: P1
Component: dmd
Product: D
Version: D2
Platform: x86
OS: Mac OS X
Creation time: 2015-01-05T07:51:00Z
Last change time: 2015-08-21T01:14:41Z
Keywords: pull
Assigned to: hsteoh
Creator: andrei

Comments

Comment #0 by andrei — 2015-01-05T07:51:18Z

Comment #1 by andrei — 2015-01-05T07:54:53Z

Sorry, pressed Enter too soon. The implementation of groupBy in https://github.com/D-Programming-Language/phobos/pull/2654 must be redone from the ground up. The main use case for groupBy is to iterate each element in each group and do something. That pattern must be supported with no or minimum overhead. Currently, groupBy iterates the same range twice - once inside each group, and once again to find the next group in the range of ranges. One detail - this: in { import core.exception : RangeError; if (r.empty) throw new RangeError(); } seems superfluous seeing as r itself has whatever protection mechanism in its own implementation. Just defer to it; add an assert() at the maximum.

Comment #2 by andrei — 2015-01-05T07:58:34Z

With regards to the equivalence relation etc. - there's just too much fuss about it. Define it simple and robust: there's one quick moving range "fast" and one slow moving range "slow". Initially "slow" and "fast" are in the same position. Then move "fast" forward without moving "slow" until !pred(slow.front, fast.front). At that point fast-forward slow to fast and continue. Only "fast" needs to call popFront. For now let the caller worry about transitory vs. non transitory ranges or equivalence vs. non-equivalence predicates.

Comment #3 by bearophile_hugs — 2015-01-05T11:09:05Z

Very nice comments, Andrei.

Comment #4 by hsteoh — 2015-01-05T21:22:02Z

@andrei: The current implementation of groupBy only traverses the range once if it's an input range (non-forward). The difference between equivalence and non-equivalence relations is actually important; it's not a "fussy" issue. A non-equivalence relation requires that you always evaluate the predicate on two adjacent elements, because you cannot assume transitivity: the fact that pred(r[0], r[1]) and pred(r[1], r[2]) holds, does not imply pred(r[0], r[2]) also holds, so you have to .save every previous position of the range instead of just once per subrange (determining the end of the subrange simply by evaluating the predicate with the head element of that subrange), when you know that pred is an equivalence relation. Your fast/slow range idea is flawed, because the outer range's .front must return an independent instance of the subrange. This is an artifact of the range API. There is no way for the outer range to keep track of how far the subrange has moved ahead without introducing some kind of reference semantics to it, which becomes messy if the caller invokes .front multiple times. For example, if the user calls .front twice and .save's both subranges, then randomly iterates one or the other, what should happen? Should the outer range's 'fast' range track the first subrange, the second, or both, or neither (because they are .save'd copies)? Does advancing the first copy of the subrange also advance the second? (Which, btw, breaks the semantics of .save)? The only way for your fast/slow range idea to work is to make subranges non-forward, so that there is only one copy of the 'fast' range that gets advanced no matter which copy of it you call popFront on. But this means users can no longer call .save on subranges, which sux if they handed you a forward range precisely for that purpose.

Comment #5 by andrei — 2015-01-06T02:19:21Z

(In reply to hsteoh from comment #4) > @andrei: Thanks for answering. OK, let's see. > The current implementation of groupBy only traverses the range once if it's > an input range (non-forward). Yes - couldn't even do otherwise. Input non-forward ranges move along all together (that's somewhat implied but let's assume it's always the case). So groupBy relies on the fact that if r1 and r2 are copies of the same input/non-forward range, calling popFront in one also pops the front in the other. That's efficient and all. The challenge here is to make sure the same goes for forward ranges. The simplest way to achieve that is by using a pointer, i.e. instead of using `Range`, use `Range*`. Then copying the pointer around will refer to the same actual range. Calling popFront through one will also pop the other. But that produces garbage so a RefCounted!Range comes to mind. That would probably be a nice solution. In my opinion what we have right now penalizes forward ranges too much to be workable. > The difference between equivalence and non-equivalence relations is actually > important; it's not a "fussy" issue. A non-equivalence relation requires > that you always evaluate the predicate on two adjacent elements, because you > cannot assume transitivity: the fact that pred(r[0], r[1]) and pred(r[1], > r[2]) holds, does not imply pred(r[0], r[2]) also holds, so you have to > .save every previous position of the range instead of just once per subrange > (determining the end of the subrange simply by evaluating the predicate with > the head element of that subrange), when you know that pred is an > equivalence relation. We must go with efficiency for the common classic relational case. I think a distinct place we do not want to be in is: "Well for cool relational stuff you can use groupBy, it's very general and actually goes outside what relational algebra can do. However, if you want speed don't use it; you gotta use manual loops." We must handle the typical relational algebra treatment, and handle it well enough that handmade approaches are unnecessary. If someone has a weird predicate and a weird requirement they can use adjacentFind or handmade loops etc. We don't need to cater for them. Pushing two ranges along is inefficient and therefore unacceptable. > Your fast/slow range idea is flawed, because the outer range's .front must > return an independent instance of the subrange. This is an artifact of the > range API. There is no way for the outer range to keep track of how far the > subrange has moved ahead without introducing some kind of reference > semantics to it, which becomes messy if the caller invokes .front multiple > times. So there is no way, or there's a messy way? Big difference. I say there is a way. > For example, if the user calls .front twice and .save's both > subranges, then randomly iterates one or the other, what should happen? > Should the outer range's 'fast' range track the first subrange, the second, > or both, or neither (because they are .save'd copies)? Does advancing the > first copy of the subrange also advance the second? (Which, btw, breaks the > semantics of .save)? > > The only way for your fast/slow range idea to work is to make subranges > non-forward, so that there is only one copy of the 'fast' range that gets > advanced no matter which copy of it you call popFront on. But this means > users can no longer call .save on subranges, which sux if they handed you a > forward range precisely for that purpose. This can be handled, it's indeed not simple. There is an implementation in https://github.com/D-Programming-Language/phobos/pull/1186 that is working but a bit oddly and is probably more complicated than it needs to. Anyhow, the larger point is we want groupBy to work as close to native speed for the common case of going once through all groups and with an equivalence relation. For more complex uses, Just Fine(tm) at a small extra cost should be okay. Is this reasonable?

Comment #6 by andrei — 2015-01-08T01:21:10Z

ping pliz pliz :o)

Comment #7 by hsteoh — 2015-01-08T03:48:18Z

I don't understand your beef against isEquivRelation. It's there precisely to give users the choice of having maximal expressive power at a slight performance cost, or being restricted to only equivalence relations but have better performance. As for only iterating things once, a thought occurred to me today that instead of creating a new subrange every time the outer .front is called, we can simply keep *both* inner and outer ranges in the same struct, and .front simply returns a reference to the inner range. Multiple calls to .front returns (a reference to) the same inner range, so popping any of them will pop all of them at the same time. The inner range is then only valid until the next call to popFront() -- i.e., groupBy will return a transient range. This will give you maximum performance. However, it forces the inner range to be non-forward -- there is no way to .save it without making a copy of both the outer and inner ranges, and once you .save it, it detaches from the outer range and no longer influences it when popFront is called. There is no way to make .save work without completely breaking the idea of single traversal, because if I use .save to make n copies of an inner range, then which copy's popFront will influence the outer range? And what happens if I call popFront on the outer range and then go back to a .save'd copy of the inner range afterwards? There are endless nasty corner cases that make this extremely messy. So there you have it: a fast solution. But it comes at the cost of messy semantics (transient outer range, inner ranges have reference semantics and cannot be .save'd). The transience of the outer range is quite a major concern IMO, because many algorithms in Phobos do not deal with them correctly. Expect all the same problems with byLine to turn up all over again. Honestly, I'd rather make this single-pass implementation a *different* function (or use a different compile-time argument) that is opt-in rather than opt-out. D's motto after all is "correct by default, fast if you ask for it". Naïve use cases of groupBy should return the "safest" behaviour -- i.e., the current behaviour where nothing is transient and .save has the usual semantics. If people find that the performance is too slow, recommend groupByFast (or equivalent), with the caveat that they will have to deal with the more unconventional semantics in their code.

Comment #8 by andrei — 2015-01-08T08:07:13Z

Great going, thx. Let's beat this into shape. (In reply to hsteoh from comment #7) > I don't understand your beef against isEquivRelation. It's there precisely > to give users the choice of having maximal expressive power at a slight > performance cost, or being restricted to only equivalence relations but have > better performance. I'd say offering non-equivalence relations would be desirable only if (a) the default is to assume the predicate is an equivalence relation, and (b) implementing it doesn't take focus away from the matter at hand, which is an efficient groupBy with equivalence relations against input and forward ranges. > As for only iterating things once, a thought occurred to me today that > instead of creating a new subrange every time the outer .front is called, we > can simply keep *both* inner and outer ranges in the same struct, and .front > simply returns a reference to the inner range. Multiple calls to .front > returns (a reference to) the same inner range, so popping any of them will > pop all of them at the same time. The inner range is then only valid until > the next call to popFront() -- i.e., groupBy will return a transient range. > This will give you maximum performance. Yah, but it's not sufficient as you note below. > However, it forces the inner range to be non-forward -- there is no way to > .save it without making a copy of both the outer and inner ranges, and once > you .save it, it detaches from the outer range and no longer influences it > when popFront is called. There is no way to make .save work without > completely breaking the idea of single traversal, because if I use .save to > make n copies of an inner range, then which copy's popFront will influence > the outer range? And what happens if I call popFront on the outer range and > then go back to a .save'd copy of the inner range afterwards? There are > endless nasty corner cases that make this extremely messy. > > So there you have it: a fast solution. But it comes at the cost of messy > semantics (transient outer range, inner ranges have reference semantics and > cannot be .save'd). The transience of the outer range is quite a major > concern IMO, because many algorithms in Phobos do not deal with them > correctly. Expect all the same problems with byLine to turn up all over > again. I agree that's not a good solution. We shouldn't code it. > Honestly, I'd rather make this single-pass implementation a *different* > function (or use a different compile-time argument) that is opt-in rather > than opt-out. D's motto after all is "correct by default, fast if you ask > for it". Naïve use cases of groupBy should return the "safest" behaviour -- > i.e., the current behaviour where nothing is transient and .save has the > usual semantics. If people find that the performance is too slow, recommend > groupByFast (or equivalent), with the caveat that they will have to deal > with the more unconventional semantics in their code. "It can't be done satisfactorily well" is in a way a given - it's the baseline we're starting from. It's also not terribly interesting because it precludes continuing to look into the matter. The trick is to actually find a way to do it, and here's where I'm relying on your wits. I think it can be done in a satisfactory manner. There needs to be communication between individual groups and the mothership (range of ranges). Did you take a look at that pull request of mine? In there, each group has a serial number, and the mothership also has a serial number. That way it's easy to figure which group the mothership is currently in. Once a group reaches its end, it "calls home" (the mothership) and checks whether the serial number of the mothership is the same as the serial number of the group. If so, great - it offers the mothership the O(1) answer for the mothership's next popFront. If not, it means the mothership has already moved forward so there's nothing to do. I feel I'm not explaining this very well. Basically the proper iteration can be achieved by establishing a simple protocol between the range of ranges and each individual group. Does this make sense?

Comment #9 by hsteoh — 2015-01-09T00:21:06Z

The nice thing about assuming non-equivalence relation by default is that it's the most general: it will produce correct behaviour in all use cases with no further work required from the user. In my initial implementation of groupBy, it assumed equivalence relation by default, and led to buggy behaviour like non-terminating empty ranges or "wrong" groupings because either transitivity or reflexivity failed to hold, but the code assumed they did. My initial reaction was to define it away -- state that only equivalence relations are supported and close the bug. But upon further thought, I decided that supporting non-equivalence relations opens up more possibilities: you can now do things like grouping by maximum adjacent difference, for example, which can be useful as a kind of edge detector in a stream of data, but maximum adjacent difference is not an equivalence relation due to non-transitivity. It makes groupBy useful beyond the usual mundane equivalence relation based predicates. In any case, this is a (mostly) orthogonal issue to performance. The problem I see with your "call home" proposal is that it introduces a lot of hidden complexities that either may offset the performance gain, or introduces subtle semantic corner cases that are ugly to deal with. First of all, this would require the subranges to retain a reference to the outer range. There are several problems with this: (1) the outer range can no longer be a by-value type, because if it's a struct, structs are liable to being moved around, invalidating references to it; (2) so we have to allocate on the heap, which is already a minus given the current push for reducing Phobos allocations. (3) Which in turn means GC load, unless we introduce reference counting, which adds additional complication (plus overhead to maintain the ref count). (4) The subranges may need extra care to implement -- if they are a by-value struct type, the struct may get copied around, leaving stray references to the outer range, so either you need GC to clean up, or you need to tread into dangerous territory of struct dtors (which leads into areas with many compiler bugs and/or limitations, and fragile behaviour -- a lot of Phobos code doesn't work well with structs that have dtors because they don't have nice semantics for things like "T tmp; return tmp;" to get an empty range of the same type, for example). (5) This *still* doesn't fully solve the performance issue -- if I have a 1000-element subrange and I iterate over 900 of them and then give up and decide to move on to the next subrange, the parent range will still repeat the 1000 popFront's from the beginning of the subrange because the subrange never called home. (6) Your call-home trick falls flat on its face if somebody calls .save on the outer range. Which copy of the outer range should the subranges call home to? Using the single-subrange idea I had, does solve (5), but then it introduces poor API because the subranges can no longer be forward ranges. (In retrospect this is not surprising, since we *are* trying to optimize for the single-pass use case, so input-only ranges would be the natural solution. It's wanting *both* single-pass performance *plus* the full semantics of range-of-ranges that makes things complicated.) One idea that occurs to me is to store the subrange as a struct inside the parent range (so it never needs to call home, the parent range just picks up on where it left off and continues scanning for the next subrange), and have .save just return an independent copy of the subrange. However, this isn't a foolproof solution either, since a common idiom with forward ranges is to always work with a .save'd copy of a range, so then the user never actually uses the embedded copy of the subrange and any popFront()'s called on the .save'd copy doesn't benefit the outer range. One possible solution I can think of, that still requires the extra complexity of heap-allocating the outer range, but does solve (5): one of the possibly multiple copies of the subranges is designated as the "furthest subrange", and gets the special privilege of having a reference to it from the outer range. Each copy of the subrange "competes" for this position by keeping a pop-counter that is bumped every time popFront is called. If the pop-counter exceeds the pop-counter of the current "furthest subrange", the range that went further installs itself as the new "furthest subrange" in the outer range, replacing the old champion. (The old champion can snatch the position back if it later on has more popFront's called.) When popFront on the outer range is called, it .save's the current furthest subrange and scans for the next subrange boundary. It also bumps the serial number, which indicates to all previous subranges that they've all been demoted and can no longer compete for the furthest subrange trophy. This ensures that the outer range always does the minimum necessary work to find the next subrange. However, this requires extra overhead in popFront for each subrange (a pointer dereference, a comparison, and potentially one or more pointer assignments). At the end of the day, this may turn out making performance worse, because the optimizer is hampered by the aliasing caused by all these references and probably won't be able to generate optimal code or elide all this extra work that isn't needed in the single-pass use case. So sure, we save on the big-O characteristics (O(n) where n=length of range, instead of O(n*r) where r=number of subranges)), but it may not actually *run* faster in practice compared to a plain old for-loop. And (6) is still not solved, which I surmise is a pretty major showstopper. I still think that if you want maximal performance for the single-pass case, you should just return an input range -- and make this something chosen by the user. We can have groupBy() return a fully-general forward range, and have groupByFast() (or groupBy!(Yes.singlePass)) implement a truly optimal single-pass algorithm where .save is not implemented, thus side-stepping the complexity and reducing to basically a manually-written for-loop except it has a range API. Well actually, you *can* implement .save for the outer range if you use the embedded-subrange idea I proposed: the same struct holds both the outer range and the inner range, so .save can just copy the struct and now you have sane semantics on both copies. You just can't .save the inner range, that's all. (Well actually, you *can*... by .save'ing the outer range, since the inner range has reference semantics so .front will resume where the inner range left off. You just can't save the inner range without also saving the outer range.) In retrospect, this seems to confirm my observation that basically the current state of things is an artifact of subranges having the full (forward) range API. If the outer range only allowed a "current subrange element" cursor/iterator/what-have-you, we wouldn't have this problem, since it'd be clear that there is only a single outer range coupled with a single subrange at any time, so you don't have the many-to-many problem where somebody .save's both outer and inner ranges and now you don't know how to have them talk to each other in a sane way. By making the inner iteration a full-fledged range, all of this baggage gets inherited. Even in my proposed single-subrange idea, if you restrict the inner range to be input-only, the problem is still somewhat tractable. But once you allow .save in the inner range, all hell breaks loose. The thing is, if you knew ahead of time how the user is going to traverse the range, you can optimize for that, but since we can't know that ahead of time, either we need the user to tell us (I want full .save semantics on everything, so give me GC outer/inner ranges; or, I want single-pass only, so make everything a non-copyable input range, I don't care), or we have to cater to the general case, which unfortunately means poor performance.

Comment #10 by andrei — 2015-01-09T00:42:50Z

I'd say we reached the point where some code would help a lot more than additional discourse. A few thoughts: 1. I think we shouldn't care about non-equivalence relations. Most tasks people do are relational-inspired operations using equivalence relations. All APIs I know of only support equivalence relations. It would take a lot of evidence to default the other way. For the time being I am explicitly opposed to supporting non-relational predicates if that complicates anything in any way. 2. Yes I am thinking reference counting would be do the trick. One allocation and deallocation per big range use should be amortized nicely. 3. I am somewhat immune to arguments based on "it's complicated" or "it runs into compiler bugs". Sorry. It seems to me this is exactly the kinds of stuff we need to explore more. 4. Random-access ranges should be a lot easier to handle, so I guess a specialization would be in order. 5. I need to think about .save on the outer range, thanks. Overall I think we shouldn't continue to find reasons for which it can't be done, but instead explore and see how it can be done. As of now I consider groupBy in it current form unacceptable and a blocker for the next release.

Comment #11 by bearophile_hugs — 2015-01-09T00:47:32Z

(In reply to Andrei Alexandrescu from comment #10) > 1. I think we shouldn't care about non-equivalence relations. Most tasks > people do are relational-inspired operations using equivalence relations. > All APIs I know of only support equivalence relations. It would take a lot > of evidence to default the other way. For the time being I am explicitly > opposed to supporting non-relational predicates if that complicates anything > in any way. This was discussed in the pull request. I suggest you to take a look at those discussions. You can't ignore those discussions.

Comment #12 by hsteoh — 2015-01-09T01:01:14Z

I don't see what's the big problem with supporting non-equivalence relations, all it means is that you have to evaluate the predicate only between adjacent elements rather than between an element and the head of its subrange (which was what was done in the initial implementation). But whatever, I don't actually have any current code that relies on groupBy supporting non-equivalence relations, so I don't really care. I just don't understand why this has suddenly become such a major issue. By "it's complicated" I do not mean "it's complicated to write code for" -- that's a non-issue. What I mean is "it introduces complicated semantics, which inevitably results in obscure corner cases that have unexpected/pathological behaviour". This includes effects of similar calibre as transient ranges (they technically conform 100% to the range API but have unusual behaviour that cannot be easily dealt with). One such effect is the semantics of .save. The compiler bugs bit, I admit, is a bit a of red herring, though. :-P Don't forget that random-access ranges are still forward ranges, so you still have to deal with the semantics of .save. And I wasn't trying to find reasons it can't be done; I was just pointing out obvious flaws in your proposed solution that would introduce more problems than it solves.

Comment #13 by andrei — 2015-01-09T01:06:45Z

It goes without saying I read through the comments. Your examples are nice but are unlikely to yield a bulk of the use cases. For non-equivalence relations, I think groupByAdjacent is a fine notion and name (especially since we have findAdjacent already). Would that make sense?

Comment #14 by andrei — 2015-01-09T01:08:23Z

(In reply to hsteoh from comment #12) > By "it's complicated" I do not mean "it's complicated to write code for" -- > that's a non-issue. What I mean is "it introduces complicated semantics, > which inevitably results in obscure corner cases that have > unexpected/pathological behaviour". This includes effects of similar calibre > as transient ranges (they technically conform 100% to the range API but have > unusual behaviour that cannot be easily dealt with). One such effect is the > semantics of .save. Again: let's explore and if the semantic complications are too big we're not nailed to the floor. I'll come up with code during the weekend.

Comment #15 by hsteoh — 2015-01-09T01:16:08Z

I'm OK with breaking up the current somewhat messy code to make a separate groupByAdjacent for the non-equivalence relation implementation. In fact, I had actually implemented that, but reverted it because people objected. *shrug*

Comment #16 by public — 2015-01-09T11:23:45Z

I will need some time to read (and understand :)) all comments here and being on vacation does not really help. Can any action wait one more week please? ^_^

Comment #17 by andrei — 2015-01-10T20:05:49Z

I recreated GroupBy as I envision it from scratch in http://dpaste.dzfl.pl/93a13ee08cc1. So: * Each group has a group counter that "stamps" it to identify it from the other groups. * Within a group iteration is fast: there's no additional indirection and one iteration evaluates one empty, one predicate, and one popBack. * At the end of each group, the next group for the mothership is primed. That way the repeated O(groupLength) iteration is saved. * Each mothership allocates memory for one payload, which will be then freed when the mothership and all of its groups go away. * Copying one mothership will work but groups originating from one given mothership only help that particular mothership. The code is only a proof of concept - it doesn't account for input vs. forward ranges, has scant testing etc. Does my vision make sense?

Comment #18 by public — 2015-01-11T13:38:08Z

Some quick notes from a very quick look: > I think we shouldn't care about non-equivalence relations This is something I simply can't accept. groupBy is not a specialized linear algebra thing - it is a quite common utility that will be widely used and widely misused. In fact the reason why non-equivalence implementation has become default in second PR was exactly that some of first attempts of bearophile to experiment with predicate quickly triggered the infinite empty range behavior. I don't thing it is an acceptable for default implementation at all. Some of proposed implementation ideas make lot of sense but I feel very strong about two points: 1) non-equivalence relation support must be default groupBy behavior 2) more efficient versions should be enabled by CT argument as opposed to new function name - it is very hard to invent short names that make sense here (it can still forward to separate private implementation)

Comment #19 by public — 2015-01-11T13:38:48Z

See also original bug report https://issues.dlang.org/show_bug.cgi?id=13595

Comment #20 by andrei — 2015-01-11T17:15:37Z

(In reply to Dicebot from comment #18) > Some quick notes from a very quick look: > > > I think we shouldn't care about non-equivalence relations > > This is something I simply can't accept. groupBy is not a specialized linear > algebra thing - it is a quite common utility that will be widely used and > widely misused. groupByAdjacent should take care of that.

Comment #21 by public — 2015-01-11T17:23:59Z

And it is not good exactly for a reason mentioned above - seeing function named groupBy one would never keep looking for something named like groupByAdjacent. Resulting in broken app and frustration. Going other way around is as trivial as mentioning "use this flag for pure equivalence relation" in docs as those who know linear algebra will immediately recognize it (and even of they don't, program will still be correct, just slower)

Comment #22 by andrei — 2015-01-11T17:25:49Z

(In reply to Dicebot from comment #21) > And it is not good exactly for a reason mentioned above - seeing function > named groupBy one would never keep looking for something named like > groupByAdjacent. Resulting in broken app and frustration. > > Going other way around is as trivial as mentioning "use this flag for pure > equivalence relation" in docs as those who know linear algebra will > immediately recognize it (and even of they don't, program will still be > correct, just slower) groupBy evokes the linear algebra groupBy. groupByAdjacent is special purpose. We can't hurt everybody for the benefit of very few.

Comment #23 by andrei — 2015-01-11T17:26:21Z

s/linear/relational/

Comment #24 by public — 2015-01-11T17:30:36Z

If you consider those who know/care about math "everyone" and everyone else "few" you are of much better opinion of people than I am. Also performance penalty is not even remotely as harmful as infinite loop /broken logic so this argument goes against your proposal. Anyway, this is up to you to decide. You have been asking for opinion - here it is.

Comment #25 by andrei — 2015-01-12T18:34:54Z

ping @hsteoh

Comment #26 by hsteoh — 2015-01-12T19:48:08Z

For some reason I couldn't access the dpaste yesterday. But anyway, it works now. The first thing I note, is that this still does not address the situation where the group is incompletely iterated, then the user decides to pop the outer range. In that case you duplicate almost all of the effort of popping the entire group again. The choice of implementation also imposes an artificial limitation on group length to uint.max, and would insert an artificial group boundary at that point even if predicate() still holds true. This is unlikely to ever happen, but still, it would be nice to have some insurance for it in documentation and in throwing an exception should this case get triggered (but unfortunately that breaks nothrow, which could be a show-stopper). Or use size_t, which defers the problem to the more unlikely distant future. A similar thing happens with wrapped-around group serial numbers, but that's a far less likely case to cause problems 'cos the user would have to keep a reference to a group around for uint.max groups, which realistically won't happen unless he was deliberately trying to shoot his own foot. Now on to the implementation: - What on earth is going on with GroupBy.groupStart and GroupBy.groupCurrent?! Shouldn't there just be a single reference to the underlying range, which gets advanced to groupNext once the current Group is finished? Currently this code looks really suspect because .groupStart and .groupCurrent seem to be redundant with each other, yet .empty checks .groupCurrent whereas .front copies .groupStart. - GroupBy.front passes the current range by value which Group stores by value. This will cause broken behaviour if the range has reference semantics: Group.popFront will cause subsequent calls to GroupBy.front to return a truncated group. Previously-saved Group's will also break once the mothership advances to the next group, because they use groupStart to evaluate the predicate, which would have changed if the underlying range has reference semantics. Basically, .save should be used everywhere except where you can justify not using it, otherwise you will almost certainly get broken boundary cases. - OTOH, if the range has very short groups, then most of the performance will be bottlenecked on all those .save's every time .front is accessed. We should probably profile this thing, with at least one test case with very long groups, and one test case with very short (1-2 element) groups. And of course, test it to death. - Group doesn't implement .save? Shouldn't that be possible with this design? - Group.this assumes groupStart is non-empty. Should assert in case future changes break GroupBy.front. There are probably more issues, but I haven't looked closely enough yet. :-P I think the design seems OK (barring the issue with incompletely-iterated Groups, but I'm not sure if fixing that wouldn't introduce too much additional complexity), but the implementation is atrocious! :-P Plus, it totally doesn't work with input ranges, as you noted. In fact, input ranges totally don't need refcounting or any of that serial number fanciness -- those would just be useless overhead. I'd even propose that for input ranges we should just use my current implementation. :-D I'm almost ready to bet that performance is better in that case. They could just be side-by-side overloads.

Comment #27 by hsteoh — 2015-01-12T19:58:59Z

Anyway, another design occurred to me. For ranges with slicing, we could scan Groups in advance and just return slices from .front. The predicate only needs to be evaluated once per element pair. This can be extended to non-sliceable ranges by precomputing the length of each Group in advance: then Group.popFront won't need to evaluate the predicate again (just decrement the length), whereas now if you iterate over 10 copies of each Group, each one will have to evaluate the predicate each time.

Comment #28 by hsteoh — 2015-01-12T20:03:44Z

Another note: your current implementation looks like it could be easily extended to handle non-equivalence predicates. All you need to do is to evaluate the predicate on adjacent elements vs. with the head of the group (IOW, just advance groupStart each time in Group.popFront). I doubt it would cause too much performance hit.

Comment #29 by andrei — 2015-01-12T21:23:17Z

(In reply to hsteoh from comment #26) > The first thing I note, is that this still does not address the situation > where the group is incompletely iterated, then the user decides to pop the > outer range. In that case you duplicate almost all of the effort of popping > the entire group again. It's trivial - just add the adjustment in the Group destructor as well. > The choice of implementation also imposes an artificial limitation on group > length to uint.max, and would insert an artificial group boundary at that > point even if predicate() still holds true. This is unlikely to ever happen, > but still, it would be nice to have some insurance for it in documentation > and in throwing an exception should this case get triggered (but > unfortunately that breaks nothrow, which could be a show-stopper). Or use > size_t, which defers the problem to the more unlikely distant future. Come on, that's nitpicking. Make that size_t or ulong. Implementation may use an extra bool, too. It's not a golden standard, it's a proof of concept! > A similar thing happens with wrapped-around group serial numbers, but that's > a far less likely case to cause problems 'cos the user would have to keep a > reference to a group around for uint.max groups, which realistically won't > happen unless he was deliberately trying to shoot his own foot. > > Now on to the implementation: > > - What on earth is going on with GroupBy.groupStart and > GroupBy.groupCurrent?! Shouldn't there just be a single reference to the > underlying range, which gets advanced to groupNext once the current Group is > finished? Currently this code looks really suspect because .groupStart and > .groupCurrent seem to be redundant with each other, yet .empty checks > .groupCurrent whereas .front copies .groupStart. groupStart is there only for its front. Calling front doesn't use groupStart. > - GroupBy.front passes the current range by value which Group stores by > value. This will cause broken behaviour if the range has reference > semantics: Group.popFront will cause subsequent calls to GroupBy.front to > return a truncated group. Previously-saved Group's will also break once the > mothership advances to the next group, because they use groupStart to > evaluate the predicate, which would have changed if the underlying range has > reference semantics. Basically, .save should be used everywhere except where > you can justify not using it, otherwise you will almost certainly get broken > boundary cases. The implementation is intended for forward ranges, which are the hard case here. If there are other issues, could you give an example? > - OTOH, if the range has very short groups, then most of the performance > will be bottlenecked on all those .save's every time .front is accessed. We > should probably profile this thing, with at least one test case with very > long groups, and one test case with very short (1-2 element) groups. And of > course, test it to death. That would be great indeed. > - Group doesn't implement .save? Shouldn't that be possible with this design? It should. > - Group.this assumes groupStart is non-empty. Should assert in case future > changes break GroupBy.front. That seems like a safe assumption to me. Of course assert is appropriate. > There are probably more issues, but I haven't looked closely enough yet. :-P I'm surprised. It's a very short codebase - 128 lines all told and uses no trick and no subterfuge. All straight code. > I think the design seems OK (barring the issue with incompletely-iterated > Groups, but I'm not sure if fixing that wouldn't introduce too much > additional complexity), but the implementation is atrocious! :-P I find that hard to agree with, especially since you had predicted dire difficulties and contortions several times in your long posts dedicated to explaining how it can't be done. Frankly after 128 lines of code destroy assertions that took several walls of text, I was shooting for "Oh okay, I see what you mean." > Plus, it > totally doesn't work with input ranges, as you noted. Input non-forward ranges are the easy case. > In fact, input ranges > totally don't need refcounting or any of that serial number fanciness -- > those would just be useless overhead. I'd even propose that for input ranges > we should just use my current implementation. :-D I'm almost ready to bet > that performance is better in that case. They could just be side-by-side > overloads. That was my intent as well.

Comment #30 by andrei — 2015-01-12T21:24:18Z

(In reply to hsteoh from comment #27) > Anyway, another design occurred to me. For ranges with slicing, we could > scan Groups in advance and just return slices from .front. The predicate > only needs to be evaluated once per element pair. This can be extended to > non-sliceable ranges by precomputing the length of each Group in advance: > then Group.popFront won't need to evaluate the predicate again (just > decrement the length), whereas now if you iterate over 10 copies of each > Group, each one will have to evaluate the predicate each time. I think that's a slightly worse design. Lazy is as lazy does.

Comment #31 by andrei — 2015-01-12T21:25:19Z

(In reply to hsteoh from comment #28) > Another note: your current implementation looks like it could be easily > extended to handle non-equivalence predicates. All you need to do is to > evaluate the predicate on adjacent elements vs. with the head of the group > (IOW, just advance groupStart each time in Group.popFront). I doubt it would > cause too much performance hit. That would popFront two ranges instead of one. I still very strongly suggest we define groupBy and groupByAdjancent.

Comment #32 by hsteoh — 2015-01-12T21:35:42Z

(In reply to Andrei Alexandrescu from comment #31) > (In reply to hsteoh from comment #28) > > Another note: your current implementation looks like it could be easily > > extended to handle non-equivalence predicates. All you need to do is to > > evaluate the predicate on adjacent elements vs. with the head of the group > > (IOW, just advance groupStart each time in Group.popFront). I doubt it would > > cause too much performance hit. > > That would popFront two ranges instead of one. I still very strongly suggest > we define groupBy and groupByAdjancent. No it won't. You just assign groupNext.save to groupCurrent, then call groupNext.popFront.

Comment #33 by andrei — 2015-01-12T21:39:55Z

(In reply to hsteoh from comment #32) > (In reply to Andrei Alexandrescu from comment #31) > > (In reply to hsteoh from comment #28) > > > Another note: your current implementation looks like it could be easily > > > extended to handle non-equivalence predicates. All you need to do is to > > > evaluate the predicate on adjacent elements vs. with the head of the group > > > (IOW, just advance groupStart each time in Group.popFront). I doubt it would > > > cause too much performance hit. > > > > That would popFront two ranges instead of one. I still very strongly suggest > > we define groupBy and groupByAdjancent. > > No it won't. You just assign groupNext.save to groupCurrent, then call > groupNext.popFront. Then I must be misunderstanding something.

Comment #34 by andrei — 2015-01-14T20:15:14Z

reping @hstehoh jiust because I see him active in the forum and we need to get this and aggregate() done before the next release. @hsteoh please let me know if you want to work on this or I should. Thanks!

Comment #35 by hsteoh — 2015-01-14T20:27:33Z

My impression was that you wanted to take this over because you were less than impressed with my implementation of groupBy? Or did I misunderstand your intentions?

Comment #36 by andrei — 2015-01-14T21:02:49Z

(In reply to hsteoh from comment #35) > My impression was that you wanted to take this over because you were less > than impressed with my implementation of groupBy? Or did I misunderstand > your intentions? Oh, sorry I misunderstood. I appreciate your work, and I'm not saying this perfunctorily; this important facility has been missing for years until you brought it about, and from a working implementation no less. To put this into perspective, the existing groups() completely sucks. That said the long and short of it is it's obvious to me that http://dpaste.dzfl.pl/93a13ee08cc1 is the right design; I don't see how the extant implementation or the variants discussed herein have any notable advantage over it, but they do have a bunch of disadvantages. I was hoping I'd convince you to take it to implementation for forward (and better) ranges whilst keeping your implementation for input ranges. There's also the matter of groupByAdjacent. Please advise, thanks.

Comment #37 by hsteoh — 2015-01-14T21:16:36Z

As far as groupByAdjacent is concerned, I am still of the opinion that it belongs as part of groupBy. But since you apparently regard groupBy as release-critical, perhaps it may be wiser to defer this issue to later -- we can always extend groupBy later to encompass the additional functionality, even if it involves unhappy users filing bugs about why groupBy can't handle certain kinds of predicates. Whereas if we release with support for both now, it will be hard to retract one later (at the very least it will involve the dreaded deprecation cycle and consequent unhappy users filing regressions about their predicates breaking). So perhaps for this release we should only include groupBy with support only for equivalence relations, and then we'll have more time to fight it out over whether groupBy should be extended or groupByAdjacent should be introduced. :-P But, that aside, since you have already written most of the code, I may just copy-n-paste it and fix some of the most egregious flaws in it before submitting it. :-P No predicting how the reviewers will react, though.

Comment #38 by andrei — 2015-01-14T21:30:06Z

(In reply to hsteoh from comment #37) > As far as groupByAdjacent is concerned, I am still of the opinion that it > belongs as part of groupBy. But since you apparently regard groupBy as > release-critical, perhaps it may be wiser to defer this issue to later -- we > can always extend groupBy later to encompass the additional functionality, > even if it involves unhappy users filing bugs about why groupBy can't handle > certain kinds of predicates. Whereas if we release with support for both > now, it will be hard to retract one later (at the very least it will involve > the dreaded deprecation cycle and consequent unhappy users filing > regressions about their predicates breaking). So perhaps for this release we > should only include groupBy with support only for equivalence relations, and > then we'll have more time to fight it out over whether groupBy should be > extended or groupByAdjacent should be introduced. :-P > > But, that aside, since you have already written most of the code, I may just > copy-n-paste it and fix some of the most egregious flaws in it before > submitting it. :-P No predicting how the reviewers will react, though. Sounds like a plan. Thanks much! I'll be looking for your PR. -- Andrei

Comment #39 by hsteoh — 2015-01-15T01:16:32Z

I've run into some roadblocks: RefCounted is @system which makes groupBy @system. This can be mostly swept under the carpet by making the GroupBy ctor @trusted, but there's a further problem: RefCounted is also (inherently) impure, which makes this design unusable in pure code. :-(

Comment #40 by andrei — 2015-01-15T01:32:19Z

RefCounted can be made observably pure (by means of casts) but making it safe must wait for the scoped work. Could you please file a bug against RefCounted?

Comment #41 by hsteoh — 2015-01-15T01:34:56Z

RefCounted is also not nothrow.

Comment #42 by hsteoh — 2015-01-15T01:38:00Z

Filed: https://issues.dlang.org/show_bug.cgi?id=13983

Comment #43 by andrei — 2015-01-15T01:40:50Z

Wouldn't nothrow be deduced?

Comment #44 by hsteoh — 2015-01-15T01:46:57Z

I didn't look further into it, but the current groupBy code was inferred nothrow, but the new code was inferred as throwing, so the nothrow unittests refused to compile. The compiler error indicated that it was the RefCounted ctor/dtor that broke nothrow.

Comment #45 by andrei — 2015-01-15T01:52:42Z

Guess that counts for a new bug or an addition to the one you added. Thanks!

Comment #46 by bearophile_hugs — 2015-01-15T01:53:47Z

(In reply to Andrei Alexandrescu from comment #40) > RefCounted can be made observably pure (by means of casts) Sounds like a "trusted pure" could be useful.

Comment #47 by hsteoh — 2015-01-16T23:20:37Z

https://github.com/D-Programming-Language/phobos/pull/2878

Comment #48 by andrei — 2015-03-22T01:20:32Z

Thanks everybody for their work! Is this done now?

Comment #49 by hsteoh — 2015-06-30T15:25:16Z

Sounds like this is done now. Should we resolve this bug?

Comment #50 by hsteoh — 2015-08-21T01:14:41Z

Since nobody objects, I'm going to resolve this now.