I've just been trying to read zlib-compressed data from a file and decompress it block by block, but it kept throwing a ZlibException, when the same data successfully decompresses if done in one go. I've spent ages reducing the problem.
----------
import std.stream, std.stdio, std.zlib;
const size_t BLOCK_SIZE = 1024;
void main(string[] a) {
scope File file = new File(a[1], FileMode.In);
scope UnCompress uc = new UnCompress;
void[] ucData;
ubyte[] block = new ubyte[BLOCK_SIZE];
while (!file.eof) {
block.length = file.read(block);
writefln(block.length);
ucData ~= uc.uncompress(block);
}
ucData ~= uc.flush();
writefln("Finished: %d", ucData.length);
}
----------
C:\Users\Stewart\Documents\Programming\D\Tests\bugs>rdmd zlib_blocks.d comp.bin
1024
1024
Error: data error
----------
I then found that, if I move the declaration of block inside the while loop (thereby reallocating it each time), then it works.
----------
1024
1024
335
Finished: 16790
----------
Presumably, when the second block is fed to UnCompress, it tries to read data from the first block as well by re-reading the original memory location. But this memory has been overwritten with the second block. In other words, UnCompress is keeping and relying on a reference to memory it doesn't own.
You could argue that this is a limitation of the zlib implementation and it's the caller's responsibility to keep the blocks separate in memory. But such a limitation would have to be documented.
According to a quick test, alternating between two buffers seems to work. Assuming that it does work in the general case, a possible solution is to add something like this to the documentation for std.zlib.UnCompress.uncompress:
"The contents of buf must not be changed between this call and the next call to uncompress. Thus the buffer may not be immediately re-used for the next block of compressed data; however, alternating between two buffers is permissible."
Comment #1 by smjg — 2009-07-19T15:37:37Z
Created attachment 427
Sample compressed data file
This is the data file I used with the testcase program. For the curious ones among you, it's the content extracted from a PNG file's IDAT chunk.
Comment #2 by andrej.mitrovich — 2011-05-24T22:05:17Z
Greetings, I come from the future.
Here's a modern implementation of your sample:
import std.zlib;
import std.stdio;
const size_t BLOCK_SIZE = 1024;
void main(string[] a)
{
auto file = File(a[1], "r");
auto uc = new UnCompress();
void[] ucData;
ubyte[] block = new ubyte[BLOCK_SIZE];
foreach (ubyte[] buffer; file.byChunk(BLOCK_SIZE))
{
writeln(buffer.length);
ucData ~= uc.uncompress(buffer);
}
ucData ~= uc.flush();
writefln("Finished: %s", ucData.length);
}
Still errors out. But I have a hunch it has something to do with buffer being reused by file.byChunk, and zlib might internally be storing a pointer to the buffer while the GC might deallocate it in the meantime.
Something like that, because if you .dup your buffer, you won't get errors anymore:
import std.zlib;
import std.stdio;
const size_t BLOCK_SIZE = 1024;
void main(string[] a)
{
auto file = File(a[1], "r");
auto uc = new UnCompress();
void[] ucData;
ubyte[] block = new ubyte[BLOCK_SIZE];
foreach (ubyte[] buffer; file.byChunk(BLOCK_SIZE))
{
writeln(buffer.length);
ucData ~= uc.uncompress(buffer.dup);
}
ucData ~= uc.flush();
writefln("Finished: %s", ucData.length);
}
It might just be that zlib expects all data passed in to be valid while you use the UnCompress() class. I have no other explanation.
Comment #3 by justin — 2014-07-11T21:21:10Z
The DEFLATE decompression algorithm relies on the results of previous blocks, as it tries to reuse the encoding tree. From the RFC: "Note that a duplicated string reference may refer to a string in a previous block; i.e., the backward distance may cross one or more block boundaries. However a distance cannot refer past the beginning of the output stream." (http://www.w3.org/Graphics/PNG/RFC-1951#huffman)
So I think the bug should be clarified to this function allowing block reuse in the first place.
Comment #4 by justin — 2014-07-11T21:23:59Z
I should note that some amount of block reuse is possible if the blocks are sufficiently distant; the maximum distance for a back reference is 32,768 bytes.
Comment #5 by smjg — 2014-07-11T23:17:28Z
(In reply to Justin Whear from comment #3)
> The DEFLATE decompression algorithm relies on the results of previous
> blocks, as it tries to reuse the encoding tree. From the RFC: "Note that a
> duplicated string reference may refer to a string in a previous block; i.e.,
> the backward distance may cross one or more block boundaries.
If I'm not mistaken, deflate blocks are independent of chunks that the datastream may be split into for decompression. (OK, maybe "block" was the wrong word in my original bug report.)
That said,
(In reply to Justin Whear from comment #4)
> I should note that some amount of block reuse is possible if the blocks are
> sufficiently distant; the maximum distance for a back reference is 32,768
> bytes.
This says to me that it is a window of 32768 bytes (or maybe 32769 or 32770 bytes) that needs to be kept in memory at a given time, regardless of block or chunk boundaries. So why did I find that alternating between buffers works even if the buffers are much smaller than this? (Indeed, I've a vague recollection of finding that it works if each of the buffers is a single byte.)
Comment #6 by alverste — 2016-02-11T12:10:38Z
(In reply to Stewart Gordon from comment #5)
> (In reply to Justin Whear from comment #3)
> > The DEFLATE decompression algorithm relies on the results of previous
> > blocks, as it tries to reuse the encoding tree. From the RFC: "Note that a
> > duplicated string reference may refer to a string in a previous block; i.e.,
> > the backward distance may cross one or more block boundaries.
>
> If I'm not mistaken, deflate blocks are independent of chunks that the
> datastream may be split into for decompression. (OK, maybe "block" was the
> wrong word in my original bug report.)
>
> That said,
> (In reply to Justin Whear from comment #4)
> > I should note that some amount of block reuse is possible if the blocks are
> > sufficiently distant; the maximum distance for a back reference is 32,768
> > bytes.
>
> This says to me that it is a window of 32768 bytes (or maybe 32769 or 32770
> bytes) that needs to be kept in memory at a given time, regardless of block
> or chunk boundaries. So why did I find that alternating between buffers
> works even if the buffers are much smaller than this? (Indeed, I've a vague
> recollection of finding that it works if each of the buffers is a single
> byte.)
To your knowledge, is this fixed in std/zlib or etc/c/zlib?
Or did you use a workaround?
Comment #7 by github-bugzilla — 2017-09-14T16:52:56Z