← Back to index | Original Bugzilla link

Bug 2313 – Poor array ~= append performance

Status: NEW
Severity: normal
Priority: P3
Component: dmd
Product: D
Version: D1 (retired)
Platform: x86
OS: Windows
Creation time: 2008-08-25T10:09:30Z
Last change time: 2024-12-13T17:48:47Z
Keywords: performance
Assigned to: Walter Bright
Creator: Lionello Lunesu

Attachments

ID	Filename	Summary	Content-Type	Size
272	gcmemcpy.asm	assembly for _d_arrayappendcT using memcpy	text/plain	1835
273	gcbyte[].asm	assembly for _d_arrayappendcT using the original byte[] copy	text/plain	1932

Comments

Comment #0 by lio+bugzilla — 2008-08-25T10:09:30Z

See original thread. Appending an element to an array is very slow. There are three reasons: 1) __d_arrayappendcT is used for all types of arrays, resulting in less than optimal performance when sizeof(item) > 1; 2) std.gc.capacity calls sizeOfNoSync which in turn calls findPool. Complexity of this call is O(m+n), n = number of pools, m = size of block; 3) sizeOfNoSync records the last result in a (global) cache to improve the performance of the case "for() ar~=item;" When appending to two arrays, this cache is useless, resulting in the O(m+n) code path described above. A possible solution to 1) might be to create custom append routines for each array type (similar to the custom routines for the array operations, comparison, hashing, etc.) This way, an array of int[] can simply add an int. Or, the __d_arrayappendcT code should check the size of the item and invoke different code (possibly using mmx/sse when applicable.) 2) might be solved by using the fact that pooltable is always sorted; this would bring the complexity down to O(m + log n). Ideally the size for each allocation is recorded, either in a separate array (per pool) or right before the allocation itself. This would result in a complexity of O(log n) resp. O(1), minimizing the impact of the cache miss as mentioned in 3).

Comment #1 by lio+bugzilla — 2008-08-26T09:41:53Z

Some stats. Using bearophile's test program from the original post in newsgroup, with n = 100_000_000: dmd v2.018 -O -inline -release Default Phobos: 10,72 seconds Commented gc.d line 915: 4,26 seconds Replaced line 915 with memcpy: 5,63 seconds Line 915 is: # x.ptr[length * sizeelem .. newsize] = argp[0 .. sizeelem]; where both x and argp are byte[] Why is byte[] = byte[] slower than memcpy? Perhaps that array assignment should also be part of the run-time library, perhaps just using memcpy?

Comment #2 by lio+bugzilla — 2008-08-26T10:58:33Z

For the record, when changing the loop to.. #int count = 0; #for(int i; i < n; i++) a[count++] = i; ..it takes 0,43 seconds. (Same flags, n as before.) Adding std.gc.capacity(a.ptr) to the loop: 2,73 seconds.

Comment #3 by 2korden — 2008-08-26T11:08:04Z

Original thread is here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=75410

Comment #4 by bugzilla — 2008-08-27T01:05:56Z

Why is byte[] = byte[] slower than memcpy? The answer isn't very simple. Consider this program: import std.c.string; long timer() { asm { naked ; rdtsc ; ret ; } } void test1(byte[] a, byte[] b) { a[] = b[]; } void test2(byte[] a, byte[] b) { memcpy(a.ptr, b.ptr, a.length); } void main() { for (int i = 4; i < 100_000_000; i *= 2) { auto a = new byte[i]; auto b = new byte[i]; auto start = timer(); test1(a, b); auto end = timer(); auto r1 = end - start; start = timer(); test2(a, b); end = timer(); auto r2 = end - start; printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2); } } Running this program produces: i: 4, []=[]: 144, memcpy: 568 i: 8, []=[]: 144, memcpy: 300 i: 16, []=[]: 172, memcpy: 324 i: 32, []=[]: 204, memcpy: 344 i: 64, []=[]: 288, memcpy: 276 i: 128, []=[]: 288, memcpy: 272 i: 256, []=[]: 352, memcpy: 364 i: 512, []=[]: 372, memcpy: 424 i: 1024, []=[]: 552, memcpy: 564 i: 2048, []=[]: 684, memcpy: 1384 i: 4096, []=[]: 1344, memcpy: 1772 i: 8192, []=[]: 2900, memcpy: 3216 i: 16384, []=[]: 5292, memcpy: 5472 i: 32768, []=[]: 11496, memcpy: 10388 i: 65536, []=[]: 29484, memcpy: 27480 i: 131072, []=[]: 110464, memcpy: 67784 i: 262144, []=[]: 655580, memcpy: 562400 i: 524288, []=[]: 1204124, memcpy: 1107256 i: 1048576, []=[]: 2364588, memcpy: 2272552 i: 2097152, []=[]: 4516440, memcpy: 4417764 i: 4194304, []=[]: 8996992, memcpy: 8817176 i: 8388608, []=[]: 20223908, memcpy: 17717748 i: 16777216, []=[]: 35774952, memcpy: 36094652 i: 33554432, []=[]: 71008068, memcpy: 71246896 i: 67108864, []=[]: 142982284, memcpy: 145473300 There's not much of a consistent conclusion to be drawn from that.

Comment #5 by lio+bugzilla — 2008-08-27T01:15:11Z

You're right. I'll double check my own results tonight.

Comment #6 by lio+bugzilla — 2008-08-27T06:37:53Z

I've checked my results, and memcpy still beats []=[] by a landslide. Here are the results: Gold (using 'prior knowledge'): # (cast(int*)x.ptr)[length] = *cast(int*)argp; 4193ms. Silver: # memcpy(x.ptr + length * sizeelem, argp, sizeelem); 5450ms. DNF: # x.ptr[length * sizeelem .. newsize] = argp[0 .. sizeelem]; 10270ms. I'll attach the .asm files.

Comment #7 by lio+bugzilla — 2008-08-27T06:39:09Z

Created attachment 272 assembly for _d_arrayappendcT using memcpy

Comment #8 by lio+bugzilla — 2008-08-27T06:39:43Z

Created attachment 273 assembly for _d_arrayappendcT using the original byte[] copy

Comment #9 by robert.schadek — 2024-12-13T17:48:47Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17774 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB