Bug 3191 – std.zlib.UnCompress errors if buffer is reused

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P2
Component
phobos
Product
D
Version
D1 (retired)
Platform
x86
OS
Windows
Creation time
2009-07-19T15:32:39Z
Last change time
2017-10-16T09:57:39Z
Assigned to
No Owner
Creator
Stewart Gordon

Attachments

IDFilenameSummaryContent-TypeSize
427comp.binSample compressed data fileapplication/octet-stream2383

Comments

Comment #0 by smjg — 2009-07-19T15:32:39Z
I've just been trying to read zlib-compressed data from a file and decompress it block by block, but it kept throwing a ZlibException, when the same data successfully decompresses if done in one go. I've spent ages reducing the problem. ---------- import std.stream, std.stdio, std.zlib; const size_t BLOCK_SIZE = 1024; void main(string[] a) { scope File file = new File(a[1], FileMode.In); scope UnCompress uc = new UnCompress; void[] ucData; ubyte[] block = new ubyte[BLOCK_SIZE]; while (!file.eof) { block.length = file.read(block); writefln(block.length); ucData ~= uc.uncompress(block); } ucData ~= uc.flush(); writefln("Finished: %d", ucData.length); } ---------- C:\Users\Stewart\Documents\Programming\D\Tests\bugs>rdmd zlib_blocks.d comp.bin 1024 1024 Error: data error ---------- I then found that, if I move the declaration of block inside the while loop (thereby reallocating it each time), then it works. ---------- 1024 1024 335 Finished: 16790 ---------- Presumably, when the second block is fed to UnCompress, it tries to read data from the first block as well by re-reading the original memory location. But this memory has been overwritten with the second block. In other words, UnCompress is keeping and relying on a reference to memory it doesn't own. You could argue that this is a limitation of the zlib implementation and it's the caller's responsibility to keep the blocks separate in memory. But such a limitation would have to be documented. According to a quick test, alternating between two buffers seems to work. Assuming that it does work in the general case, a possible solution is to add something like this to the documentation for std.zlib.UnCompress.uncompress: "The contents of buf must not be changed between this call and the next call to uncompress. Thus the buffer may not be immediately re-used for the next block of compressed data; however, alternating between two buffers is permissible."
Comment #1 by smjg — 2009-07-19T15:37:37Z
Created attachment 427 Sample compressed data file This is the data file I used with the testcase program. For the curious ones among you, it's the content extracted from a PNG file's IDAT chunk.
Comment #2 by andrej.mitrovich — 2011-05-24T22:05:17Z
Greetings, I come from the future. Here's a modern implementation of your sample: import std.zlib; import std.stdio; const size_t BLOCK_SIZE = 1024; void main(string[] a) { auto file = File(a[1], "r"); auto uc = new UnCompress(); void[] ucData; ubyte[] block = new ubyte[BLOCK_SIZE]; foreach (ubyte[] buffer; file.byChunk(BLOCK_SIZE)) { writeln(buffer.length); ucData ~= uc.uncompress(buffer); } ucData ~= uc.flush(); writefln("Finished: %s", ucData.length); } Still errors out. But I have a hunch it has something to do with buffer being reused by file.byChunk, and zlib might internally be storing a pointer to the buffer while the GC might deallocate it in the meantime. Something like that, because if you .dup your buffer, you won't get errors anymore: import std.zlib; import std.stdio; const size_t BLOCK_SIZE = 1024; void main(string[] a) { auto file = File(a[1], "r"); auto uc = new UnCompress(); void[] ucData; ubyte[] block = new ubyte[BLOCK_SIZE]; foreach (ubyte[] buffer; file.byChunk(BLOCK_SIZE)) { writeln(buffer.length); ucData ~= uc.uncompress(buffer.dup); } ucData ~= uc.flush(); writefln("Finished: %s", ucData.length); } It might just be that zlib expects all data passed in to be valid while you use the UnCompress() class. I have no other explanation.
Comment #3 by justin — 2014-07-11T21:21:10Z
The DEFLATE decompression algorithm relies on the results of previous blocks, as it tries to reuse the encoding tree. From the RFC: "Note that a duplicated string reference may refer to a string in a previous block; i.e., the backward distance may cross one or more block boundaries. However a distance cannot refer past the beginning of the output stream." (http://www.w3.org/Graphics/PNG/RFC-1951#huffman) So I think the bug should be clarified to this function allowing block reuse in the first place.
Comment #4 by justin — 2014-07-11T21:23:59Z
I should note that some amount of block reuse is possible if the blocks are sufficiently distant; the maximum distance for a back reference is 32,768 bytes.
Comment #5 by smjg — 2014-07-11T23:17:28Z
(In reply to Justin Whear from comment #3) > The DEFLATE decompression algorithm relies on the results of previous > blocks, as it tries to reuse the encoding tree. From the RFC: "Note that a > duplicated string reference may refer to a string in a previous block; i.e., > the backward distance may cross one or more block boundaries. If I'm not mistaken, deflate blocks are independent of chunks that the datastream may be split into for decompression. (OK, maybe "block" was the wrong word in my original bug report.) That said, (In reply to Justin Whear from comment #4) > I should note that some amount of block reuse is possible if the blocks are > sufficiently distant; the maximum distance for a back reference is 32,768 > bytes. This says to me that it is a window of 32768 bytes (or maybe 32769 or 32770 bytes) that needs to be kept in memory at a given time, regardless of block or chunk boundaries. So why did I find that alternating between buffers works even if the buffers are much smaller than this? (Indeed, I've a vague recollection of finding that it works if each of the buffers is a single byte.)
Comment #6 by alverste — 2016-02-11T12:10:38Z
(In reply to Stewart Gordon from comment #5) > (In reply to Justin Whear from comment #3) > > The DEFLATE decompression algorithm relies on the results of previous > > blocks, as it tries to reuse the encoding tree. From the RFC: "Note that a > > duplicated string reference may refer to a string in a previous block; i.e., > > the backward distance may cross one or more block boundaries. > > If I'm not mistaken, deflate blocks are independent of chunks that the > datastream may be split into for decompression. (OK, maybe "block" was the > wrong word in my original bug report.) > > That said, > (In reply to Justin Whear from comment #4) > > I should note that some amount of block reuse is possible if the blocks are > > sufficiently distant; the maximum distance for a back reference is 32,768 > > bytes. > > This says to me that it is a window of 32768 bytes (or maybe 32769 or 32770 > bytes) that needs to be kept in memory at a given time, regardless of block > or chunk boundaries. So why did I find that alternating between buffers > works even if the buffers are much smaller than this? (Indeed, I've a vague > recollection of finding that it works if each of the buffers is a single > byte.) To your knowledge, is this fixed in std/zlib or etc/c/zlib? Or did you use a workaround?
Comment #7 by github-bugzilla — 2017-09-14T16:52:56Z
Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/5cf20bd8773e0f746c74b19137a03d699cdfe28b Fixed issues 3191 and 9505 std.zlib.UnCompress.uncompress() now consumes as much of the input buffer as possible and extends / reallocates the output buffer accordingly It also sets inputEnded = 1 when Z_STREAM_END is returned from inflate() so that additional data after the compressed stream is not consumed https://github.com/dlang/phobos/commit/8e47bfc54c106b222835db3c3a37e81b52ab2f04 Merge pull request #5720 from kas-luthor/fix-zlib Fix zlib issues 3191, 9505 and 8779 merged-on-behalf-of: MetaLang <[email protected]>
Comment #8 by github-bugzilla — 2017-10-16T09:57:39Z