There are some times where you know you are going to allocate a bunch of blocks from the GC, but you need to deal with them one at a time, and you don't want them tied to one giant array for GC purposes.
A good example is an associative array:
int[int] aa;
foreach(i; 0 .. 20_000_000) aa[i] = i;
Each assignment requires a new GC allocation, which is unavoidable. However, we could streamline the allocation:
1. If we told the GC that in the next few lines we were going to need 20 million AA elements of size X, it could set aside this data in a specialized place that would not be collected, and would not be handed out elsewhere (probably a new page bit is needed).
2. Each allocation for this purpose, until the preallocated cache is exhausted, would simply return the next one, and not need to do any collections.
3. The GC could optimize locking. For instance it may only need to release for possible collection the elements a page at a time, so only one lock/unlock when it flips those bits. The idea is once you reserve, you plan to quickly use those blocks.
A strawman high level API:
auto preallocate(T)(size_t quantity);
usage:
x = GC.preallocate!int(100_000); // give me a block of preallocated ints
foreach(i; 0 .. 100_000)
{
assert(x.remaining + 1 + i == 100_000);
int *p = x.allocNext(); // @nogc
}
When x goes out of scope, any remaining preallocated blocks that weren't used are freed.
Comment #1 by andrei — 2017-10-06T20:10:55Z
This seems like a tall order. It's essentially pushing a specific management need on to the GC, complicating everyone's life in the process.
If an application needs to preallocate, there's an API for that already - it's called GC.allocate() :o). Then the application needs to do its own bookkeeping.
Comment #2 by schveiguy — 2017-10-06T20:42:40Z
The use case is different than for GC.allocate.
I want to fill in a structure of nodes, I know I'm going to fill it in with 10,000 elements, so I'm going to allocate them all in a loop. But I don't want the GC to keep the ENTIRE thing in memory if just one element is still pointed at. And I don't want to run the GC constantly in my tight loop allocating each node.
Think of a tree constructor, or an AA constructor.
Essentially it's the same as array.reserve, but for individual blocks instead of a contiguous single block.
Comment #3 by andrei — 2017-10-06T20:56:49Z
>The use case is different than for GC.allocate.
The semi-joke was this is all needed.
>I want to fill in a structure of nodes, I know I'm going to fill it in with 10,000 elements, so I'm going to allocate them all in a loop. But I don't want the GC to keep the ENTIRE thing in memory if just one element is still pointed at. And I don't want to run the GC constantly in my tight loop allocating each node.
The solution is to prefill a freelist, then let go of the extra elements remaining.
>Think of a tree constructor, or an AA constructor.
>Essentially it's the same as array.reserve, but for individual blocks instead of a contiguous single block.
That's not the job of the allocator. It's the job of the user of the allocator. The allocator gives you memory. You organize it as you find fit.
Comment #4 by schveiguy — 2017-10-06T21:17:10Z
Stevens-MacBook-Pro:testd steves$ cat testpreallocate.d
struct S
{
S *next;
}
void main()
{
version(freelist)
{
S* head = null;
foreach(i; 0 .. 20_000_000)
head = new S(head);
}
version(preallocate)
{
S* head = null;
auto x = new S[20_000_000];
foreach(i; 0 .. 20_000_000)
{
x[i] = S(head);
head = &(x[i]);
}
}
}
Stevens-MacBook-Pro:testd steves$ dmd -O -release testpreallocate.d -version=freelist
Stevens-MacBook-Pro:testd steves$ time ./testpreallocate
real 0m1.869s
user 0m1.750s
sys 0m0.114s
Stevens-MacBook-Pro:testd steves$ dmd -O -release testpreallocate.d -version=preallocate
Stevens-MacBook-Pro:testd steves$ time ./testpreallocate
real 0m0.111s
user 0m0.045s
sys 0m0.062s
The point is that the GC is not well-equipped to handle the tight allocation loop.
The second version has the drawback that all 20 million elements will remain in memory as long as there is one element still alive.
What I'm looking for is something that has the performance (or similar, I realize it won't be as good) of the second, but can collect each block individually like the first.
Comment #5 by r.sagitario — 2017-10-07T07:37:01Z
A slightly simpler API could be to add allocating N same-sized chunks from the GC that returns them in a free-list-like chain.
I agree with Andrei that we should not complicate the GC interface for every possible allocation pattern a user might want to optimize for, though.
If you call GC.reserve(20_000_000*(Key.sizeof+Value.sizeof)) before inserting elements, there should be no collection while filling the AA.
If we add thread local free-lists to the GC, the overhead of allocating these from the GC instead of caching them in the AA would be rather small.
Comment #6 by schveiguy — 2017-10-07T12:39:04Z
(In reply to Rainer Schuetze from comment #5)
> A slightly simpler API could be to add allocating N same-sized chunks from
> the GC that returns them in a free-list-like chain.
I think an array is best, since the data already is an array, and a free-list requires building a linked list before-hand.
> I agree with Andrei that we should not complicate the GC interface for every
> possible allocation pattern a user might want to optimize for, though.
Hm... I'm trying to find the most generic interface for this. It's not necessarily limited to AA, as allocating a bunch of blocks in a loop isn't an uncommon thing.
If there was a way to "re-flavor" the blocks from one giant block into individual ones, then we could do this outside the GC.
> If you call GC.reserve(20_000_000*(Key.sizeof+Value.sizeof)) before
> inserting elements, there should be no collection while filling the AA.
Limiting the collection in the example I posted shaves off about 300-400msec, but it still is 1.5 seconds vs. 0.1 for the local array version.
> If we add thread local free-lists to the GC, the overhead of allocating
> these from the GC instead of caching them in the AA would be rather small.
Agreed, I think keeping the lists inside the GC is the most useful, and does not expose any implementation details to the user.
Comment #7 by safety0ff.bugz — 2017-10-11T18:21:21Z
(In reply to Steven Schveighoffer from comment #6)
> > If we add thread local free-lists to the GC, the overhead of allocating
> > these from the GC instead of caching them in the AA would be rather small.
>
> Agreed, I think keeping the lists inside the GC is the most useful, and does
> not expose any implementation details to the user.
I think this is the only sane way of doing it. i.e. without over constraining the GC implementation or exposing the user to implementation details.
I think the API should obviously should have the size of the elements as a separate arguments and provide as much information as malloc e.g.:
GC.preload(size_t num, size_t size, uint ba = 0, const TypeInfo ti = null)
Otherwise it relies on hope that the GC will pull from the same pool.
Comment #8 by schveiguy — 2017-10-11T18:37:32Z
(In reply to safety0ff.bugz from comment #7)
> I think the API should obviously should have the size of the elements as a
> separate arguments and provide as much information as malloc e.g.:
> GC.preload(size_t num, size_t size, uint ba = 0, const TypeInfo ti = null)
Internally, I think something like this is needed. What I am looking for, though, is the high-level API of "I have 20 million T's I want to allocate, give me them properly allocated and typed one at a time". I don't necessarily even think you need the TypeInfo for the low level API. Perhaps the bit attributes can even be done as you allocate each block.
Comment #9 by safety0ff.bugz — 2017-10-11T20:26:55Z
(In reply to Steven Schveighoffer from comment #8)
> Internally, I think something like this is needed. What I am looking for,
> though, is the high-level API of "I have 20 million T's I want to allocate,
> give me them properly allocated and typed one at a time". I don't
> necessarily even think you need the TypeInfo for the low level API. Perhaps
> the bit attributes can even be done as you allocate each block.
Here's the rough sketch of one idea:
void[] gc_nmalloc(size_t num, size_t size, void[] prev = [], uint ba = 0, const TypeInfo ti = null)
// prev is used for temporary pinning/unpinning a memory range
struct GC // in core.memory
{
auto NMalloc(T)(size_t num) { // convenience function
int loop(int delegate(T*) dg) {
void[] prev; size_t remain = num;
scope(exit) gc_nmalloc(0,T.sizeof,prev); // unpin last chunk
int result;
while (remain) {
prev = gc_nmalloc(remain, T.sizeof, prev, ...);
auto chunk = cast(T*)prev.ptr[0 .. prev.length/T.sizeof];
foreach (ref e; chunk)
if ((result = dg(&e))) return result;
remain -= chunk.length;
}
return result;
}
return &loop;
}
}
So user code could look like:
foreach (e; GC.NMalloc(T)(20_000_000))
// do something with allocated e
Comment #10 by andrei — 2017-10-24T21:32:55Z
(In reply to Steven Schveighoffer from comment #8)
> What I am looking for,
> though, is the high-level API of "I have 20 million T's I want to allocate,
> give me them properly allocated and typed one at a time".
This issue should be closed. A good allocator should be able to handle repeated allocations of the same size without this kind of hint.
Comment #11 by schveiguy — 2017-10-25T12:52:18Z
(In reply to Andrei Alexandrescu from comment #10)
> This issue should be closed. A good allocator should be able to handle
> repeated allocations of the same size without this kind of hint.
I'm all for a solution that doesn't need hints. As far as I can tell, that solution hasn't been disclosed yet. If you know how to write a "good allocator" I really think you should do it, it would do wonders for D's GC performance. Certainly, feel free to take this ER down in the event that becomes available.
Comment #12 by r.sagitario — 2017-10-28T15:53:28Z
The current GC already does something in that regard as it pre-assigns a new memory page to a free-list of elements of the size of the last request.
The GC locking overhead can be reduced by helping the inliner a bit with some assembly. I have tried that for Win64 recently and the benchmarks showed an improvement of about 5%.
Comment #13 by stanislav.blinov — 2018-10-28T13:30:48Z
Created attachment 1713
Example of paged allocation with GC
Jumping in a year later almost to the hour, I'll start with a disclaimer: I agree with Andrei that such functionality in general is best left for a custom allocator API.
That said, having a straightforward utility for chunked GC allocations in Phobos or DRuntime would benefit users who don't necessarily want to import or deal with custom allocators for "just that one allocation" (scripts, prototyping, etc.).
Attached is an implementation of the sketch suggested in previous comments; see comments in the code about certain implementation/DRuntime quirks. This implementation on my machine is about 10x faster than piecewise allocation with both dmd and ldc; it's still slower, of course, than a wholesale preallocation, but at least should be on the same order of magnitude, unless you go for astronomical amounts of instances.
Comment #14 by robert.schadek — 2024-12-07T13:37:39Z