Comment #0 by bearophile_hugs — 2014-09-01T09:14:18Z
This is a reduction of a performance problem I have found (ketmar on the newsgroup D.learn has found it). This is the D version:
- - - - - - - - - - -
import core.stdc.stdio;
int main() {
long[int] aa;
for (int i = 0; i < 50000; i++)
aa[i] = i;
long total = 0;
while (aa.length) {
int k = aa.byKey.front;
long v = aa.byValue.front;
aa.remove(k);
total += k + v;
}
printf("%lld\n", total);
return 0;
}
- - - - - - - - - - -
The C++ version:
#include <stdio.h>
#include <map>
int main() {
std::map<int, long long> aa;
for (int i = 0; i < 50000; i++)
aa[i] = i;
long long total = 0;
while (!aa.empty()) {
int k = aa.begin()->first;
long long v = aa.begin()->second;
aa.erase(aa.begin());
total += k + v;
}
printf("%lld\n", total);
return 0;
}
- - - - - - - - - - -
The run-time of the C++ version is about 0.05 seconds (the output is 2499950000), while the D code runs in about 3.8 seconds (76 times slower).
To show the performance problems don't come from the associative array creation, and that they don't come from the associative array remove, or from the long sum, here is a very similar version that just avoids byKey/byValue (idea suggested by ketmar), note that this code allocates an eager array with aa.keys. This runs in about 0.06 seconds, sufficiently similar to the C++ version:
- - - - - - - - - - -
import core.stdc.stdio;
int main() {
long[int] aa;
for (int i = 0; i < 50000; i++)
aa[i] = i;
long total = 0;
foreach (int k; aa.keys) {
long v = aa[k];
aa.remove(k);
total += k + v;
}
printf("%lld\n", total);
return 0;
}
- - - - - - - - - - -
Comment #1 by bearophile_hugs — 2014-09-01T09:17:01Z
The exact problem was found by Daniel Kozak.
Comment #2 by ketmar — 2014-09-01T10:48:59Z
Created attachment 1407
simple 'first used bucket' caching for AAs
adding 'first used bucket' caching to aa halves execution time for the sample.
seems that we can't do better with current AAs and without hurting overall AA performance. aa.remove() hurts caching and there is no sense to recalculate cache on each remove.
p.s. please note that this patch is not heavily tested, it's just a quick 'hack-in'.
Comment #3 by ketmar — 2014-09-01T11:37:43Z
Created attachment 1408
simple 'first used bucket' caching for AAs with '_aaDelX' recaching
here is another patch which tries to revalidate cache in `_aaDelX()`. this makes aa.remove() little slower but… but execution time for both samples are nearly identical now.
and for bearophile's sample here ( http://forum.dlang.org/thread/[email protected] ) we have some small improvement against the previous patch too (~0.4 seconds).
Comment #4 by ketmar — 2014-09-01T12:01:54Z
Created attachment 1409
simple 'first used bucket' caching for AAs
aaaaaand we have a WINNER!
this final patch should not hurt 'normal' AA usage (it will not update 'first used cache' on each operation) yet speds up this syntetic test (~3 seconds --> !0.03 seconds) and code from the forum (~3.5 seconds --> 1.1 seconds).
Comment #5 by bearophile_hugs — 2014-09-01T12:18:12Z
(In reply to Ketmar Dark from comment #4)
and code from the forum (~3.5 seconds --> 1.1 seconds).
For me the C++ version of the code runs in about 0.20 seconds, so there's still a significant performance difference.
Comment #6 by ketmar — 2014-09-01T12:35:12Z
(In reply to bearophile_hugs from comment #5)
> For me the C++ version of the code runs in about 0.20 seconds, so there's
> still a significant performance difference.
yes, AA still needs to search for the "first item" many times, caching just makes it faster in some cases. this can't be changed without complete rewrite of AA (and with performance hit for other use cases).
"get first item" will always be costly for AAs, that's the way AAs works. there are alot of 'buckets' for items, many of which are empty. to get *known* item from AA we can choose the necessary bucket with simple '%' operation, but to find "first available" item we must loop thru all buckets until we find the used one.
my patch does caching of "first used bucket" index (and tries to keep that in sync when AA contents changes), but sometimes i have to just "reset" cache, or get/remove operations will become unnecessarely slow.
besides, alot of adding/removing between 'byKey' calls will update cache for no reason at all, doing unnecessary calculations.
i tried to "amortise" that bad cases by not updating cache when there was no calls to byKey/byValue for the given AA, and by resetting cache on AA rehashing. this allows us to has almost no slowdowns when we using AAs without iterating and we can just call aa.rehash before adding/deleting big amount of data to turn off caching (besides, adding alot of data will eventually hit rehashing too).
i can't think out anything better for now. trying to speed up our use case will lead only to code bloating.
Comment #7 by blah38621 — 2014-09-01T17:44:57Z
It would be nice if this were done as a PR on github, as it would be much
easier to review.
Comment #8 by ketmar — 2014-09-01T17:57:44Z
(In reply to Orvid King from comment #7)
> It would be nice if this were done as a PR on github, as it would be much
> easier to review.
sorry, each time i'm thinking about registering at github god kills one kitten.
Comment #9 by bearophile_hugs — 2014-09-02T01:00:27Z
(In reply to Ketmar Dark from comment #8)
> but to find "first available" item we must loop thru all buckets
> until we find the used one.
Right. (But using an unordered_map in the C++ code from the newsgroup the C++ code only gets a little more than twice slower, that seems about twice faster than the patched D code. I don't know the source of such performance difference.)
> sorry, each time i'm thinking about registering at github god kills one
> kitten.
Thank you for your work. Probably someone else will create a pull request with this.
Comment #10 by safety0ff.bugz — 2014-09-02T06:01:24Z
(In reply to bearophile_hugs from comment #9)
>
> Right. (But using an unordered_map in the C++ code from the newsgroup the
> C++ code only gets a little more than twice slower, that seems about twice
> faster than the patched D code. I don't know the source of such performance
> difference.)
>
C++ unordered_map uses templates and can do all kinds of things the D AA's currently can't.
It can inline/devirtualize the comparison functions, it can store the key and value together, etc.
D's AAs essentially go through a C interface, and so this limits what it can do.
I don't think there's a great data structure for this task out of the box in D:
- Using std.container's red-black tree isn't ideal because there's no equivalent to std::map's bracket operator (without doing multiple lookups.)
- You can't roll a bag (linked list) into the hash table because since key and value are stored separately you cannot map a reference to a value to a reference to the key (without storing a copy of the key inside the value and then doing a lookup.)
As for ketmar's patch, I don't think we should introduce a slowdown in _aaDelX.
The first entry can be simply treated as a hint, meaning that the first bucket is greater or equal to the hint.
I think _aaDelX should leave the cache value alone since _aaGetFirstEntry checks that the bucket is non-null anyway.
Furthermore, I don't think the plus one indexing is necessary.
Comment #11 by ketmar — 2014-09-02T06:10:18Z
(In reply to bearophile_hugs from comment #9)
> Right. (But using an unordered_map in the C++ code from the newsgroup the
> C++ code only gets a little more than twice slower, that seems about twice
> faster than the patched D code. I don't know the source of such performance
> difference.)
we can make D code alot faster if we'll add one pointer to each AA Entry, for example. but i think that this is just a wasted memory.
i have some other ideas, but they requires either additional memory or more support code in add/remove functions (thus making adding and removing elements slower in general).
but i'm still thinking. maybe i'll do some more patches soon. or someone who already knows the solution but don't bother to write it 'cause he never sees use cases for that. ;-)
i'm still thinking about adding `aa.frontKey` and `aa.frontValue`, 'cause i don't like the fact that using AA in foreach() turns on caching. but i believe that adding new methods will zero merging chances. "we already have too many methods and runtime hooks, there is no sense in adding two more, blah-blah-blah".
Comment #12 by ketmar — 2014-09-02T06:27:48Z
(In reply to safety0ff.bugz from comment #10)
> As for ketmar's patch, I don't think we should introduce a slowdown in
> _aaDelX.
cache bookkeeping turns on only when the code used byKey/byValue at least once (i.e. directly or in foreach loop). this is not that frequent operations for AAs. and user cannot modify AA in `foreach(k; aa.byKeys)` or `foreach(v; aa.byValues)` anyway.
the second thing is that cache bookkeeping is in effect only when "first used" bucket becomes empty.
and any rehash operation will turn off caching too.
without that code in _aaDelX syntetic test from this report is ALOT slower (~1 second vs ~0.3 seconds) and code from the forum is somewhat slower (~1.5 seconds vs ~1.1 seconds).
yet i'm sure that overhead for `aa.remove()` is so small that it can be ignored.
ok, we need some performance tests here. somebody?
> The first entry can be simply treated as a hint, meaning that the first
> bucket is greater or equal to the hint.
i've tried this 'hinting' strategy too, and it gives ~1.1 and ~1.9 seconds on samples. don't like that.
> I think _aaDelX should leave the cache value alone since _aaGetFirstEntry
> checks that the bucket is non-null anyway.
see above.
> Furthermore, I don't think the plus one indexing is necessary.
it's a matter of taste, i think. i need "cache turned off" flag, and zero is fine. sure we can use "size_t.max", for example, but then we can't use this trick:
if (i+1 < aa.impl.firstUsedBucket) aa.impl.firstUsedBucket = i+1;
and have to write
if (aa.impl.firstUsedBucket != size_t.max && i < aa.impl.firstUsedBucket) aa.impl.firstUsedBucket = i;
don't like it. ;-)
Comment #13 by safety0ff.bugz — 2014-09-02T06:41:11Z
(In reply to Ketmar Dark from comment #12)
> cache bookkeeping turns on only when the code used byKey/byValue at least
> once (i.e. directly or in foreach loop). this is not that frequent
> operations for AAs.
Ah, okay, I missed that it wasn't always on.
And this time, I remembered to set you as author, unlike the previous patches I turned into PRs. I know you said it doesn't matter, but still. :-)
Comment #17 by schveiguy — 2014-09-26T15:27:43Z
(In reply to Ketmar Dark from comment #12)
> (In reply to safety0ff.bugz from comment #10)
> > Furthermore, I don't think the plus one indexing is necessary.
> it's a matter of taste, i think. i need "cache turned off" flag, and zero is
> fine. sure we can use "size_t.max", for example, but then we can't use this
> trick:
>
> if (i+1 < aa.impl.firstUsedBucket) aa.impl.firstUsedBucket = i+1;
>
> and have to write
>
> if (aa.impl.firstUsedBucket != size_t.max && i < aa.impl.firstUsedBucket)
> aa.impl.firstUsedBucket = i;
>
> don't like it. ;-)
Hm... I think you can just use 0. When caching is off, you need to start looking at bucket 0 anyway.
Comment #18 by ketmar — 2014-09-26T21:38:15Z
(In reply to Steven Schveighoffer from comment #17)
> Hm... I think you can just use 0. When caching is off, you need to start
> looking at bucket 0 anyway.
please read _aaDelX() code too. trick is harmless, neat looking and yeah, confuses some readers who aren't read the whole code (and so they will not 'fix' the thing they not fully understand ;-).
to keep the long story short: i like this trick and will not remove it. but it doesn't mean that anyone else can't took my code and rewrite it in any other way. ;-)
Comment #19 by ketmar — 2014-09-29T16:54:13Z
Created attachment 1441
simple 'first used bucket' caching for AAs
found a bug in remove(), updated patch.
2hsteoh: time to update PR. ;-)
Comment #20 by bearophile_hugs — 2014-09-29T21:14:58Z
(In reply to Ketmar Dark from comment #18)
> i like this trick and will not remove it.
I have not taken a look at the code, but in general leaving tricks in the code that some people find not intuitive introduces technical debt in the code. It's not a practice for a wise programmer.
Comment #21 by ketmar — 2014-09-30T00:39:47Z
it's widely known trick. at least among us, old farts.
Comment #22 by hsteoh — 2014-09-30T21:14:02Z
PR updated.
Comment #23 by code — 2014-10-01T08:37:53Z
I don't really agree with this patch.
HashTables are unordered and optimized for random indexing by a key.
If you run a tight loop and constantly query front (abusing byKey) then of course that's expensive O(N^2/2). It's like taking a singly linked list and removing the last element until it's empty.
Instead we should offer a nice and generic remove function that takes a predicate and allows to batch remove entries.
Comment #24 by code — 2014-10-01T08:46:38Z
In other words, this optimizes the following code
foreach (_; 0 .. 1_000_000)
aa.byKey();
but pessimizes the following
aa.remove(key);
It does NOT improve iterating byKey.
So unless there is a valid use case to call aa.byKey.front in a loop this would only improve a pointless microbenchmark.
Batch deletion is much better done with a dedicated remove function.
aa.remove((k, v) => pred(k));
Any volunteers for implementing the latter?
Comment #25 by ketmar — 2014-10-01T08:50:34Z
and we can make it alot less expensive without losing speed. and all this with only 4/8 bytes of overhead for single AA (not for single AA entry, but for the whole AA).
the only questionable thing with this patch is that it assumes that AA is mutable. but we can't have AA literals anyway, AFAIK.
and no, it's not really pessimizes remove() if noone used .byKey/.byValue on AA, and even if that properties was used, next rehashing will turn off caching again. so i daresay that remove() slowdown is not noticable at all.
Comment #26 by bearophile_hugs — 2014-10-01T09:43:37Z
(In reply to Martin Nowak from comment #24)
> So unless there is a valid use case to call aa.byKey.front in a loop this
> would only improve a pointless microbenchmark.
See here, it's not just a pointless microbenchmark:
http://forum.dlang.org/thread/[email protected]
As shown by others, there are ways to rewrite that code to avoid most of the very slow part, but popping the "first" item from an an associative array is a sufficiently common operation (I have written and seen plenty of Python code that does it).
Comment #27 by hsteoh — 2014-10-01T18:07:50Z
@bearophile: perhaps that's an indication that what you want is actually an *ordered* map, not a hash map?
Comment #29 by bearophile_hugs — 2014-10-01T19:46:15Z
(In reply to hsteoh from comment #27)
> @bearophile: perhaps that's an indication that what you want is actually an
> *ordered* map, not a hash map?
Nope. If you take a look at the link, that code doesn't need an ordered container. The "pop" operation just needs to give one arbitrary item.
Comment #30 by schveiguy — 2014-10-01T20:57:29Z
bearophile or ketmar, can you test the new PR? The timings I get on my system are not very different from the original.
Comment #31 by code — 2014-10-01T22:20:28Z
(In reply to bearophile_hugs from comment #29)
> Nope. If you take a look at the link, that code doesn't need an ordered
> container. The "pop" operation just needs to give one arbitrary item.
Take an array or a stack for that.
Comment #32 by bearophile_hugs — 2014-10-01T22:37:36Z
(In reply to Martin Nowak from comment #31)
> (In reply to bearophile_hugs from comment #29)
> > Nope. If you take a look at the link, that code doesn't need an ordered
> > container. The "pop" operation just needs to give one arbitrary item.
>
> Take an array or a stack for that.
Nope, the associative array nature is needed for other reasons by the code. And keeping the same keys in two collections in parallel is a waste.
Comment #33 by ketmar — 2014-10-02T04:04:03Z
(In reply to Steven Schveighoffer from comment #30)
> bearophile or ketmar, can you test the new PR?
for what reason? see #2 and #3, i was tried the variant without caching in remove(). i'm telling again that remove() is NOT slowed down by any noticable time, and if you doing alot of removes which all happen to hit the first used bucket, and each such hit empties that bucket, and you used .byKey/.byValue before… this is very unusual scenario, and if you *know* that it goes like this, you can force rehashing.
yet i'm not interested in defending the code: i integrated it in my local builds since i published the patch and it's ok for me. i don't really care if it will make to mainline in one form on another.
Comment #34 by schveiguy — 2014-10-02T10:53:47Z
(In reply to Ketmar Dark from comment #33)
> (In reply to Steven Schveighoffer from comment #30)
> > bearophile or ketmar, can you test the new PR?
> for what reason? see #2 and #3, i was tried the variant without caching in
> remove().
This PR should have identical performance as your latest for the code in question, but not cause any slowdown for remove otherwise. But my box for some reason doesn't even exhibit the original problem with no patch. The speedup on my system is not measurable. So I need someone to test to make sure it's actually fixing the issue properly who does experience the problem.
>i'm telling again that remove() is NOT slowed down by any
> noticable time, and if you doing alot of removes which all happen to hit the
> first used bucket, and each such hit empties that bucket, and you used
> .byKey/.byValue before… this is very unusual scenario, and if you *know*
> that it goes like this, you can force rehashing.
As has been discussed, this isn't an acceptable tradeoff. Forcing a rehash for this seems unintuitive, users will not get it. Besides, there is no reason to recalculate cache unless you need it.
> yet i'm not interested in defending the code: i integrated it in my local
> builds since i published the patch and it's ok for me. i don't really care
> if it will make to mainline in one form on another.
Hm... people here have made a lot of effort to work with your patch given your objections to github. Normally I would suggest that unless you submit a PR, you will not get this code included. Your response confirms that should have been the correct action, and I'll remember that from now on.
Comment #35 by ketmar — 2014-10-02T12:43:33Z
(In reply to Steven Schveighoffer from comment #34)
> This PR should have identical performance as your latest for the code in
> question, but not cause any slowdown for remove otherwise.
sorry, i misread your patch (yes, i've looked at it before posting my answer). yet you doing a slightly different thing and i don't want to argue. that's why i'm saying that i don't care: both variants are good, and my own has some good points for my use cases. that's why i'm not interesting in defending my code — not that i don't care about changes or so.
> Hm... people here have made a lot of effort to work with your patch given
> your objections to github.
i can't remember when i forced anybody to do anything with this patch. yet if somebody feels bad about it, he can ask for refund: i'll gladly return his money.
> Normally I would suggest that unless you submit a
> PR, you will not get this code included.
i don't really care. i'm putting my patches here for those who interested and they are free to do anything they want with my code. i know that bugzilla patches have very little chance to be included in mainline. i just want to share my vision in the form of the code for others to decide.
Comment #36 by schveiguy — 2014-10-06T15:30:49Z
I can reproduce the issue, I was accidentally using bearophile's second D implementation, which does not call byKey/byValue every time through the loop, but instead uses aa.keys. oops!
Comment #37 by github-bugzilla — 2014-10-10T21:03:10Z
Comment #38 by bearophile_hugs — 2014-10-10T21:43:30Z
The original D program ( http://forum.dlang.org/thread/[email protected] ) was about 30 times slower than the equivalent C++ code. Now the D code is only 5 times slower than the C++ code. So the situation is much better now.
But the two programs are not exactly equivalent: the C++ code uses a tree-based map while the D code uses the built-in hash-based associative array.
Comment #39 by schveiguy — 2014-10-10T21:59:52Z
Bearophile,
With the tree-based solution, when the front element is removed, the "next" element is 1 hop away.
But with hash, it can be far away in the table. So you likely are better off using a tree-based map, something like RedBlackTree if you want to come close to C++ performance.
There is also the issue that C++ uses a template-based map, so it can optimize the code, inline comparisons, etc. The AA-based solution cannot do this.
Comment #40 by ketmar — 2014-10-10T22:39:29Z
2Steven Schveighoffer: thanks for your work.
Comment #41 by bearophile_hugs — 2014-10-10T23:05:22Z
(In reply to Steven Schveighoffer from comment #39)
> With the tree-based solution, when the front element is removed, the "next"
> element is 1 hop away.
>
> But with hash, it can be far away in the table. So you likely are better off
> using a tree-based map, something like RedBlackTree if you want to come
> close to C++ performance.
>
> There is also the issue that C++ uses a template-based map, so it can
> optimize the code, inline comparisons, etc. The AA-based solution cannot do
> this.
I understand all this (the C++ program with a hash_map is 2.5 times faster than the current D situation).
> So you likely are better off
> using a tree-based map, something like RedBlackTree if you want to come
> close to C++ performance.
I have not tried to use the Phobos RedBlackTree in this program, it's an interesting test.
Comment #42 by github-bugzilla — 2015-02-18T03:38:07Z