← Back to index | Original Bugzilla link

Bug 5623 – Slow GC with large heaps

Status: RESOLVED
Resolution: FIXED
Severity: normal
Priority: P2
Component: druntime
Product: D
Version: D2
Platform: Other
OS: Windows
Creation time: 2011-02-20T12:10:00Z
Last change time: 2011-08-27T07:29:44Z
Keywords: patch, performance
Assigned to: sean
Creator: dsimcha

Attachments

ID	Filename	Summary	Content-Type	Size
918	gcx.diff	The patch.	text/plain	28892
919	gcx.d	The whole gcx.d file, in case you have trouble applying the patch.	text/x-dsrc	86998
920	gcx2.diff	New patch.	text/plain	29335
921	gcx.d	New gcx.d	text/x-dsrc	87271

Comments

Comment #0 by dsimcha — 2011-02-20T12:10:57Z

Created attachment 918 The patch. Here's a GC benchmark and its performance on the stock GC. import std.stdio, std.datetime, core.memory, std.conv; void main(string[] args) { if(args.length < 2) { stderr.writeln("Need size."); return; } immutable mul = to!size_t(args[1]); auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN); auto sw = StopWatch(autoStart); GC.collect(); immutable msec = sw.peek.msecs; writefln("Collected a %s megabyte heap in %s milliseconds.", mul, msec); } Outputs for various sizes (pre-patch): Collected a 10 megabyte heap in 1 milliseconds. Collected a 50 megabyte heap in 4 milliseconds. Collected a 200 megabyte heap in 16 milliseconds. Collected a 500 megabyte heap in 41 milliseconds. Collected a 1000 megabyte heap in 80 milliseconds. Collected a 5000 megabyte heap in 397 milliseconds. Collected a 10000 megabyte heap in 801 milliseconds. Collected a 30000 megabyte heap in 2454 milliseconds. Collected a 50000 megabyte heap in 4096 milliseconds. Attached is a patch that creates separate large and small object pools, only stores the GC bits for the large object pool at pagesize offsets, not 16-byte offsets, and stores, rather than computes, offsets for B_PAGEPLUS pages. So far, all unit tests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled. Here are the new benchmarks (post-patch): Collected a 10 megabyte heap in 0 milliseconds. Collected a 50 megabyte heap in 0 milliseconds. Collected a 250 megabyte heap in 1 milliseconds. Collected a 500 megabyte heap in 0 milliseconds. Collected a 1000 megabyte heap in 1 milliseconds. Collected a 5000 megabyte heap in 3 milliseconds. Collected a 10000 megabyte heap in 6 milliseconds. Collected a 30000 megabyte heap in 16 milliseconds. Collected a 50000 megabyte heap in 26 milliseconds.

Comment #1 by dsimcha — 2011-02-20T12:12:02Z

Created attachment 919 The whole gcx.d file, in case you have trouble applying the patch. Here's the whole gcx.d file, in case you have trouble applying the patch. (These things never quite work for me.)

Comment #2 by bugzilla — 2011-02-20T13:03:57Z

More explanation from David from the n.g.: I've found a way to speed up the GC massively on large heaps without excessive ripple effects. Technically it's still O(N), but with about a hundred fold smaller constant in the case of large heaps with most stuff not scanned. Now, I think the O(N) (where N is the total size of the heap) term has such a small constant that it's for almost all practcal purposes the GC is O(S) (where S is the size of the scanned portion of the heap). It also no longer has any O(N^2) pathological case (which I had discovered while reading the code). So far all unittests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled. Please test some other code so we can wring out the corner cases in time for the next release. Basically all I did was diverge the Pool struct slightly into large and small object sub-varieties. The large object sub-variety is used to allocate objects of at least a page. It only stores gcbits at page-size offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest B_PAGE bin so that they can be found in O(1). I also added a field to the Pool struct so that the number of free pages in a pool can be tracked in O(1). This should drastically lessen the time it takes to perform large allocations on large heaps. Right now a free memory region is found by a linear search through the pools in the case of large allocations. Unfortunately, I don't see any easy way to fix this. This patch at least allows short circuiting a large number of pools, if there isn't enough free space in the whole pool, let alone contiguous space.

Comment #3 by bearophile_hugs — 2011-02-20T13:19:15Z

A very simple benchmark, in D and Java. You may use the D version with and without your patch, and then the Java version too, and show the three timings, with N = 15 or 18 or more: http://codepad.org/yMGK34cb http://codepad.org/DIOeYn6p

Comment #4 by dsimcha — 2011-02-20T14:48:45Z

(In reply to comment #3) > A very simple benchmark, in D and Java. You may use the D version with and > without your patch, and then the Java version too, and show the three timings, > with N = 15 or 18 or more: > > http://codepad.org/yMGK34cb > http://codepad.org/DIOeYn6p I didn't both(In reply to comment #3) > A very simple benchmark, in D and Java. You may use the D version with and > without your patch, and then the Java version too, and show the three timings, > with N = 15 or 18 or more: > > http://codepad.org/yMGK34cb > http://codepad.org/DIOeYn6p Using n = 18: D (with patch): 62 seconds D (without patch): 68 seconds Java: 6 seconds I'm not sure what this told us that we didn't already know, though. We already knew Java's GC is much better than D's. My patch is meant to address issues with large object allocation, not small object. (Large is defined as at least one full page. If no large objects are allocated for the duration of the program, then my patch has virtually no effect.) This benchmark does tons of small object allocation and no large object allocation. The small difference between with and without patch may even just be noise.

Comment #5 by bearophile_hugs — 2011-02-20T15:07:17Z

(In reply to comment #4) > I'm not sure what this told us that we didn't already know, though. Thank you for the timings. This synthetic benchmark tells us something important: that for a different situation (but an important one, because that benchmark is quite revealing, despite being so simple) your patch doesn't make the GC significantly worse (it may even improve things a bit).

Comment #6 by bearophile_hugs — 2011-02-20T15:11:18Z

(In reply to comment #4) > We already knew Java's GC is much better than D's. The purpose of a 3-legged comparison: the Java timing is used as a rough zero-scale, to allow an estimate of the absolute difference between the other two timings.

Comment #7 by dsimcha — 2011-02-20T18:29:08Z

Created attachment 920 New patch. Here's a new patch that adds one small but important optimization that I forgot. Now that we're storing the number of pages for B_PAGE stuff, we can skip over large blocks in O(1) time when doing allocPages().

Comment #8 by dsimcha — 2011-02-20T18:29:43Z

Created attachment 921 New gcx.d

Comment #9 by dsimcha — 2011-02-21T16:12:54Z

Here's another benchmark. This one's designed more to be similar to reasonably common scientific computing/large allocation intensive use cases with moderately large overall heap size, rather than to highlight the specific performance problem at hand. import std.random, core.memory, std.datetime, std.stdio; enum nIter = 1000; void main() { auto ptrs = new void*[1024]; auto sw = StopWatch(autoStart); // Allocate 1024 large blocks with size uniformly distributed between 1 // and 128 kilobytes. foreach(i; 0..nIter) { foreach(ref ptr; ptrs) { ptr = GC.malloc(uniform(1024, 128 * 1024 + 1), GC.BlkAttr.NO_SCAN); } } writefln("Done %s iter in %s milliseconds.", nIter, sw.peek.msecs); } With patch: Done 1000 iter in 7410 milliseconds. Without patch: Done 1000 iter in 38737 milliseconds. Memory usage is about the same (based on looking at task manager): About 88 megs w/ patch, about 92 without. To verify that this patch doesn't have deleterious effects when allocating lots of objects close to the border between "large" and "small", I also tried it with uniform(1024, 8 * 1024 + 1) and got: With patch: Done 1000 iter in 2122 milliseconds. Without patch: Done 1000 iter in 4153 milliseconds. The relatively small difference here isn't surprising, as there are no huge allocations being done (max size is 8k). Again, the difference in memory usage was negligible (within epsilon of 12 MB for both).

Comment #10 by sean — 2011-02-22T15:01:51Z

I think the separation of pools for large and small allocations is a good thing. In fact, the current collector will return entirely free pools to the OS at the end of a collection cycle, so the two are already logically separate. I can't think of a case where performance would be worse than before, but I'll give the patch a once-over to be sure.

Comment #11 by schveiguy — 2011-02-23T20:10:03Z

Cursory glance at the patch, it looks like it won't affect array appending. BTW, I had a very similar thought with storing the value to be able to jump back to find the PAGEPLUS start a while ago, but I thought of a different method. First, the Bins value is already stored for every page, it's an int, and we're using exactly 13 of the 4 billion possible values. My idea was to remove B_PAGEPLUS from the enum. If the Bins value was anything other than the given enums, it would be a number of pages to jump back + B_MAX. This saves having to keep/update a separate array. In addition, your statement that we only get 16 TB of space doesn't matter. It means the *jump size* is 16 TB. That is, if you exceed 16 TB of space for a block, then you just store the maximum. The algorithm just has to be adjusted to jump back that amount, then check the page at that location (which will also know how far to jump back), and continue on. Can you imagine how awesome the performance would be on a system with a 16TB block with the linear search? ;) I think this patch should be applied (will be voting shortly).

Comment #12 by schveiguy — 2011-02-23T20:12:09Z

For some reason was marked as assigned, but not to anyone. I'm guessing you wanted it assigned to you Sean, since you change the bug to Assigned?

Comment #13 by dsimcha — 2011-02-23T20:23:23Z

One logistical point: Since I rebuilt my mental model of how the GC works, I've come up with a few other small ideas for optimization. These don't have nearly the impact that this patch does (they only have an effect of a few percent). I think we should make this patch a high priority for the next release, since the effect is huge. For the smaller optimizations, I'll fork the druntime git and commit them as I get around to testing and benchmarking, and we can merge then back later.

Comment #14 by dsimcha — 2011-02-23T20:26:47Z

(In reply to comment #11) > In addition, your statement that we only get 16 TB of space doesn't matter. It > means the *jump size* is 16 TB. That is, if you exceed 16 TB of space for a > block, then you just store the maximum. The algorithm just has to be adjusted > to jump back that amount, then check the page at that location (which will also > know how far to jump back), and continue on. > I'd rather just assume that, in the near future, noone is ever going to allocate 16 TB in one allocation and in the distant future, we can switch to size_t, though hopefully we'll have a better GC by the time anyone has enough RAM to allocate 16 TB in one allocation.

Comment #15 by dsimcha — 2011-02-24T17:56:20Z

https://github.com/dsimcha/druntime/wiki/Druntime-GC-Optimization-Fork

Comment #16 by schveiguy — 2011-02-24T18:45:01Z

(In reply to comment #15) > https://github.com/dsimcha/druntime/wiki/Druntime-GC-Optimization-Fork from that wiki page: Also note that a Tree2 benchmark also exists, but it seems to run in either 12 or 0 seconds, randomly, no matter what patches are applied, for reasons I don't understand. this pertains to bug 5650 I have seen the same anomaly (although mine must be slower hardware, it varies between 20+ and .9 seconds). My theory is that false pointers are keeping the elements in memory, so the GC never frees them. It is *definitely* the GC's fullCollect that is causing the slowdown, because while debugging (and printing out every 100 loops), you can ctrl-c to pause, and it's always in the collect cycle. Basically, the program is so deterministic, that the only think I can think of that changes between good and bad runs is the address space given to the heap by the OS. I sort of tested this theory by adding this line to the code: writeln((new int[1]).ptr); Here are the results of running a bunch of times (the times follow the address printout): 112E40 real 0m26.723s user 0m26.554s sys 0m0.052s A50E40 real 0m0.911s user 0m0.908s sys 0m0.000s 26FE40 real 0m23.852s user 0m23.737s sys 0m0.096s 112E40 real 0m20.139s user 0m20.065s sys 0m0.040s 58BE40 real 0m19.932s user 0m19.841s sys 0m0.080s EBDE40 real 0m0.897s user 0m0.880s sys 0m0.012s 724E40 real 0m25.801s user 0m25.762s sys 0m0.024s 3F2E40 real 0m0.907s user 0m0.904s sys 0m0.000s AC9E40 real 0m0.891s user 0m0.884s sys 0m0.000s DA4E40 real 0m0.906s user 0m0.888s sys 0m0.016s 26FE40 real 0m29.869s user 0m29.770s sys 0m0.084s 799E40 real 0m0.900s user 0m0.896s sys 0m0.000s 58DE40 real 0m39.999s user 0m39.802s sys 0m0.152s 138E40 real 0m34.000s user 0m33.906s sys 0m0.032s 65CE40 real 0m19.246s user 0m19.201s sys 0m0.032s 1B0E40 real 0m28.394s user 0m28.350s sys 0m0.028s D62E40 real 0m0.910s user 0m0.900s sys 0m0.008s AB6E40 real 0m0.904s user 0m0.904s sys 0m0.000s 26FE40 real 0m38.978s user 0m38.834s sys 0m0.124s 367E40 real 0m27.100s user 0m27.010s sys 0m0.076s 9DEE40 real 0m0.899s user 0m0.888s sys 0m0.004s 112E40 real 0m40.536s user 0m40.419s sys 0m0.088s 401E40 real 0m0.901s user 0m0.896s sys 0m0.000s A18E40 real 0m0.911s user 0m0.900s sys 0m0.004s 7A1E40 real 0m0.908s user 0m0.904s sys 0m0.004s 112E40 real 0m26.441s user 0m26.330s sys 0m0.100s 611E40 real 0m23.135s user 0m23.041s sys 0m0.068s 3D7E40 real 0m0.905s user 0m0.900s sys 0m0.000s 138E40 real 0m38.311s user 0m38.242s sys 0m0.044s 112E40 real 0m24.372s user 0m24.314s sys 0m0.028s 270E40 real 0m34.142s user 0m33.998s sys 0m0.092s 9ACE40 real 0m0.911s user 0m0.908s sys 0m0.004s C8DE40 real 0m0.898s user 0m0.892s sys 0m0.000s 284E40 real 0m20.744s user 0m20.621s sys 0m0.096s 3E0E40 real 0m0.910s user 0m0.900s sys 0m0.004s 386E40 real 0m20.044s user 0m19.921s sys 0m0.108s Most of the time, the smaller the number, the more likely it is to run slowly. There are, however, some outliers. What does this data mean? It may mean nothing, but it does seem to strongly correlate with address selection. I can't think of any other reason the code would be so drastically different from one run to the next. Does this mean false pointers are the problem? Not sure, but it's all I can think of for now.

Comment #17 by dsimcha — 2011-08-27T07:29:44Z

I'm marking this as resolved. My pull request was merged a long time ago and Steve's issue is probably related to false pointers.