Bug 13595 – Extend std.algorithm.groupBy to support non-equivalence relations

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
All
Creation time
2014-10-10T10:17:07Z
Last change time
2021-03-18T14:27:42Z
Keywords
pull
Assigned to
No Owner
Creator
bearophile_hugs

Comments

Comment #0 by bearophile_hugs — 2014-10-10T10:17:07Z
Perhaps I have not fully understood the usage of std.algorithm.groupBy. This is Haskell: Prelude> import Data.List Prelude Data.List> groupBy (\x y -> (mod (x*y) 3) == 0) [1,2,3,4,5,6,7,8,9] [[1],[2,3],[4],[5,6],[7],[8,9]] D code: void main() { import std.stdio, std.algorithm, std.array, std.range; [1,2,3,4,5,6,7,8,9] .groupBy!((x, y) => ((x * y) % 3) == 0) .take(20) .writeln; } Output of the D code: [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []] Please close down this issue if this behavour is expected and correct.
Comment #1 by hsteoh — 2014-10-24T15:10:26Z
Definitely a bug, the returned subranges should not have empty elements.
Comment #2 by hsteoh — 2014-10-24T15:21:13Z
Actually, I take that back. This is not a bug (w.r.t. the current documentation) because the predicate is expected to be an equivalence relation, but (x,y)=>((x*y)%3)==0 is not a reflexive relation. So the issue is more, should groupBy be extended to support non-equivalence relations? I have considered this before but decided against it because it would be more inefficient to implement. However, allowing non-equivalence relations between adjacent elements *does* allow for much more interesting groupings (e.g., you could implement word breaks with it), so arguably it should be supported somehow. Marking this as as enhancement request.
Comment #3 by hsteoh — 2014-11-04T19:18:48Z
https://github.com/D-Programming-Language/phobos/pull/2654 Note that the Haskell example is actually wrong, because the Haskell groupBy function also assumes that the predicate is an equivalence relation. It just so happens that Haskell's groupBy only assumes transitivity, whereas the original implementation of std.algorithm.groupBy also assumes reflexivity, which causes an infinite loop when that assumption fails to hold. In the above referenced PR, groupBy now defaults to assuming a non-equivalence relation, meaning that the predicate is evaluated for every pair of adjacent elements, so the OP's predicate (x,y) => (x*y)%3 == 0 will produce the grouping [[1], [2,3,4], [5,6,7], [8,9]], because both 2*3 and 3*4 are divisible by 3, as are 5*6 and 6*7. (Notice that Haskell's groupBy evaluates the predicate against the first element of the subrange, thus it checks 2*3 and 2*4 instead of 2*3 and 3*4 for the second subrange.)
Comment #4 by hsteoh — 2014-11-04T19:21:22Z
P.S. Due to the extra performance overhead in evaluating the predicate for every adjacent pair of elements (and also additional logic to ensure correctness), groupBy now takes a compile-time parameter that can be used to get the original performance when the predicate is an equivalence relation. This way we have the best of both worlds. :-)
Comment #5 by bearophile_hugs — 2014-11-04T23:18:34Z
(In reply to hsteoh from comment #4) > P.S. Due to the extra performance overhead in evaluating the predicate for > every adjacent pair of elements (and also additional logic to ensure > correctness), groupBy now takes a compile-time parameter that can be used to > get the original performance when the predicate is an equivalence relation. > This way we have the best of both worlds. :-) Thank you. But there's a risk of "Boolean Blindness" (http://existentialtype.wordpress.com/2011/03/15/boolean-blindness/ ), so perhaps it's better to use std.typecons.Flag instead of a bool. Regarding groupBy, I hope someday we'll also have a hashGroupBy or something similar (Issue 9842 ).
Comment #6 by github-bugzilla — 2014-12-01T02:34:13Z
Commit pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/5986e740873d6af793610381a942df78ab41f137 Merge pull request #2654 from quickfur/issue13595b Issue 13595: Extend groupBy to handle non-equivalence relations.
Comment #7 by bearophile_hugs — 2014-12-01T10:50:50Z
As second try I've used groupBy to find the longest anagrams in a dictionary of words. "words.txt" is a text file that contains one different word each line. void main() { import std.stdio, std.algorithm, std.string, std.file, std.array; auto anags = "words.txt" .readText .split .schwartzSort!((string w) => w.dup.representation.sort().release) .groupBy!((w1, w2) => w1.dup.representation.sort().equal(w2.dup.representation.sort())) .map!array .array .sort!((g1, g2) => g1.length > g2.length) .release .groupBy!((g1, g2) => g1.length == g2.length) .front; writefln("%(%s\n%)", anags); } If I replace "(string w)" with "w" in the first schwartzSort I receive the errors: ...\dmd2\src\phobos\std\algorithm.d(4659,5): Error: field _prev must be initialized in constructor, because it is nested struct ...\dmd2\src\phobos\std\algorithm.d(4770,51): Error: template instance anagrams1c.main.groupBy!(__lambda2, cast(Flag)false, SortedRange!(string[], (a, b) => binaryFun!less(unaryFun!transform(a), unaryFun!transform(b)))) error instantiating anagrams1c.d(9,10): instantiated from here: groupBy!((w1, w2) => w1.dup.representation.sort().equal(w2.dup.representation.sort()), SortedRange!(string[], (a, b) => binaryFun!less(unaryFun!transform(a), unaryFun!transform(b)))) If I remove the last ".release" (at line 13) I receive the similar errors: ...\dmd2\src\phobos\std\algorithm.d(4659,5): Error: field _prev must be initialized in constructor, because it is nested struct ...\dmd2\src\phobos\std\algorithm.d(4770,51): Error: template instance anagrams1c.main.groupBy!(__lambda4, cast(Flag)false, SortedRange!(string[][], __lambda3)) error instantiating anagrams1c.d(14,10): instantiated from here: groupBy!((g1, g2) => g1.length == g2.length, SortedRange!(string[][], __lambda3)) Is this a problem of groupBy?
Comment #8 by hsteoh — 2014-12-01T15:48:44Z
It looks like a compiler bug, in both cases the type returned by schwartzSort is exactly the same, but somehow the compiler treats it differently.
Comment #9 by bearophile_hugs — 2014-12-01T16:03:40Z
(In reply to hsteoh from comment #8) > It looks like a compiler bug, in both cases the type returned by > schwartzSort is exactly the same, but somehow the compiler treats it > differently. Yes, it looks like a compiler bug worth a new different bug report. But so far it looks not easy to pinpoint. later I'll open a new bug report and I'll try to reduce the problem a little.
Comment #10 by hsteoh — 2014-12-01T16:09:42Z
Here's a somewhat reduced case: ----- void main() { import std.algorithm, std.string; auto anags = ["abc", "def"] .map!((a) => a) .groupBy!((w1, w2) => w1.dup.representation.sort().equal(w2.dup.representation.sort())) ; } ----- Changing .map!((a) => a) to .map!((string a) => a) fixes the problem.
Comment #11 by hsteoh — 2014-12-01T16:10:30Z
P.S. the groupBy predicate can also be simplified to just: .groupBy!((w1, w2) => w1 == w2)
Comment #12 by github-bugzilla — 2015-02-18T03:40:11Z
Comment #13 by dlang-bot — 2021-03-18T14:27:42Z
dlang/phobos pull request #7794 " Fix Issue 13595 - added splitWhen" was merged into master: - 8122718d96167b84e2e073191e9c63766c7d08c0 by Ate Eskola: Fix Issue 13595 - Non-amended commit just to trigger the bot as rebasing didn't work https://github.com/dlang/phobos/pull/7794