Bug 1832 – reading/writing an archive causes data loss; std.zip horribly broken

Status
RESOLVED
Resolution
FIXED
Severity
regression
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2008-02-13T19:09:00Z
Last change time
2014-02-16T12:07:26Z
Keywords
bounty
Assigned to
nobody
Creator
default_357-line

Attachments

IDFilenameSummaryContent-TypeSize
1290patch.txtMy patch as explained in bug's commenttext/plain2840

Comments

Comment #0 by default_357-line — 2008-02-13T19:09:13Z
Consider, if you will, the following source. import std.zip, std.file; void main() { auto arch = new ZipArchive; auto test = new ArchiveMember; test.name = "foob"; test.expandedData = [cast(ubyte) 2, 3, 4, 5]; arch.addMember(test); auto data = arch.build; write("test.zip", data); auto arch2 = new ZipArchive(data); // arch2.expand(arch2.directory["foob"]); // this fixes it auto data2 = arch2.build; write("test2.zip", data2); // test2.zip contains a single file, "foob", zero bytes. It's horribly empty in there. } It seems when Phobos builds the archive data without decompressing in-between, it doesn't consider the already existing compressed data to be of any particular relevance and just recompresses the (empty) uncompressed data. This bug just cost me two days of work. Glee! --feep
Comment #1 by advmail — 2013-11-16T15:13:36Z
It seems it's a bug on build() function. build() function assumes that all ArchiveEntry as expandedData to be compressed. So it cycle over elements and compress expandedData to compressedData. If one of ArchiveEntry is just compressed and never expanded, expandedData will be empty, and build() tries to compress empty data. I've fixed it in this way: For each element I check if compressedSize > 0. If true it means that we have compressed data available, so we can copy it from buffer to field .compressedData and we can skip compression for that element (usually it read data from .expandedData, compress it and then copy to .compressedData). Moreover we have not to check for expandedSize or CRC: we just have them from header and we can't rely on expandedData field that is empty. IMHO std.zip design it's not perfect. Every element copy compressData[] from "global" data[] (so we have memory filled x 2). Moreover it's not useful to store expandedData[] inside every element. It could be memory expensive (hey, we're trying go compress it!), and probably it's just a copy of data we store elsewhere. I think we should compress data on the fly, and don't store internally expandedData. And that we just have to store compressedData one time.
Comment #2 by advmail — 2013-11-16T15:54:04Z
Created attachment 1290 My patch as explained in bug's comment See previous comment
Comment #3 by andrei — 2013-11-16T16:58:10Z
Please redo your patch as a pull request at https://github.com/D-Programming-Language/. Thanks!
Comment #4 by advmail — 2013-11-17T00:53:12Z
(In reply to comment #3) > Please redo your patch as a pull request at > https://github.com/D-Programming-Language/. Thanks! https://github.com/D-Programming-Language/phobos/pull/1697
Comment #5 by github-bugzilla — 2013-11-21T20:04:42Z
Commits pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/95af4e3c8c78020109af75c98ed5478c1b813ba7 Fixed phobos bug 1832 https://github.com/D-Programming-Language/phobos/commit/289c10521ae8f4c36cf00cbf71dac8ae9265809e Merge pull request #1697 from trikko/master Issue 1832 - reading/writing an archive causes data loss; std.zip horribly broken
Comment #6 by bugzilla — 2013-11-24T15:52:09Z
Resolved for D2, not for D1.
Comment #7 by code — 2013-12-05T20:46:34Z
I now get a segfault for the following code. import std.zip; void main() { auto a1 = new ZipArchive(); auto m1 = new ArchiveMember(); m1.expandedData = new ubyte[](10); a1.addMember(m1); a1.build(); auto a2 = new ZipArchive(); a2.addMember(m1); a2.build(); // segfaults }
Comment #8 by advmail — 2013-12-06T00:47:18Z
This is another bug. m1 is owned by a1. m1._compressedData refers to data inside a1. So you're mixing data from two different objects/archives. I don't think it's a good idea. When you build a2, it tries to read compressed data from a2 data, but it should read from a1. There's a simply workaround, but again, there's a problem on library design. If m1 is just compressed, we can read content it has inside _compressedData, if any. It filled it before with right reference to a1 data. This solve your bug, but probalby don't solve the mixing problem and of course it's not an elegant solution! (In reply to comment #7) > I now get a segfault for the following code. > > import std.zip; > > void main() > { > auto a1 = new ZipArchive(); > auto m1 = new ArchiveMember(); > m1.expandedData = new ubyte[](10); > a1.addMember(m1); > a1.build(); > auto a2 = new ZipArchive(); > a2.addMember(m1); > a2.build(); // segfaults > }
Comment #9 by code — 2013-12-06T01:07:58Z
(In reply to comment #8) > This is another bug. m1 is owned by a1. m1._compressedData refers to data > inside a1. So you're mixing data from two different objects/archives. I don't > think it's a good idea. When you build a2, it tries to read compressed data > from a2 data, but it should read from a1. > I see. > There's a simply workaround, but again, there's a problem on library design. If > m1 is just compressed, we can read content it has inside _compressedData, if > any. It filled it before with right reference to a1 data. This solve your bug, > but probalby don't solve the mixing problem and of course it's not an elegant > solution! > I think m1._compressedData is simply a slice to (constant) GC memory which was allocated when reading a1. So here shouldn't be any problem to copy this data into a new zip archive.
Comment #10 by advmail — 2013-12-06T01:22:46Z
(In reply to comment #9) > (In reply to comment #8) > > This is another bug. m1 is owned by a1. m1._compressedData refers to data > > inside a1. So you're mixing data from two different objects/archives. I don't > > think it's a good idea. When you build a2, it tries to read compressed data > > from a2 data, but it should read from a1. > > > I see. > > > There's a simply workaround, but again, there's a problem on library design. If > > m1 is just compressed, we can read content it has inside _compressedData, if > > any. It filled it before with right reference to a1 data. This solve your bug, > > but probalby don't solve the mixing problem and of course it's not an elegant > > solution! > > > I think m1._compressedData is simply a slice to (constant) GC memory which was > allocated when reading a1. So here shouldn't be any problem to copy this data > into a new zip archive. Yes it is. But we're going to use data compressed/stored in/owned by another archive. That seems a mistake to me. I know we can do it, of course. Tonight I'll fix it then.
Comment #11 by advmail — 2013-12-06T12:54:09Z
I think it's not so easy to fix this "cross-archive" member management. Consider for example expand() function. It uses data[] from ZipArchive, but uses offset from ArchiveMember, that was created from another data[]. BTW it's not the same bug as the original one, I guess. My vote goes for a full replacement of std.zip
Comment #12 by code — 2014-01-12T12:11:55Z
(In reply to comment #11) > I think it's not so easy to fix this "cross-archive" member management. The regression was fixed with https://github.com/D-Programming-Language/phobos/pull/1775.
Comment #13 by mk — 2014-01-12T12:59:44Z
Was this really fixed ? I still get zero length data (code from comment #0) mk@bid:~/work/bugs$ unzip -lv ziptest1.zip Archive: ziptest1.zip Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 4 Stored 4 0% 1980-00-00 00:00 9d0d9845 ziptest.bin -------- ------- --- ------- 4 4 0% 1 file mk@bid:~/work/bugs$ unzip -lv ziptest2.zip Archive: ziptest2.zip Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 4 Stored 0 100% 1980-00-00 00:00 00000000 ziptest.bin -------- ------- --- ------- 4 0 100% 1 file
Comment #14 by andrej.mitrovich — 2014-02-16T05:56:28Z
Comment #15 by github-bugzilla — 2014-02-16T12:02:03Z
Commits pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/1a3f9ab273dd6280cb9b4eda4a0da50d8c89fa88 fix Issue 1832 - rebuilding ZipArchive looses data - Already need to set compressedData slice when reading the directory index not only during expand. https://github.com/D-Programming-Language/phobos/commit/4bdd5f2c6ada5bc0ed37214f0e9d017f50ca8a83 Merge pull request #1870 from MartinNowak/fix1832 fix Issue 1832 - rebuilding ZipArchive looses data
Comment #16 by andrej.mitrovich — 2014-02-16T12:07:26Z
Marking as D2 only since Phobos D1 bugs will not be fixed.