Sunday 8 July 2012

lrzip-0.613

lrzip 0.612 has been out in the wild for a while now and the good news is that there have been very few bug reports in that time. After allowing enough accumulated issues collect in my inbox, I've created a pure-bugfix maintenance release in version 0.613:

long-range-zip 0.613

One bug of note was that the md5 calculation on files that had compressed blocks greater than 4GB in size was wrong. This was very suspicious for a 32 bit overflow error. Indeed Serge Belyshev did some excellent detective work and found the culprit to be in the glibc implementation of md5, which is used by lrzip. This only affects using the md5 library components, not the md5sum command line utility which uses a different rolling algorithm so glibc userspace never hit it. The bug in question was amusing in the way it shows one of the many naive ways we dealt with 32 bit limitations in the past. It assumed anything larger than a 32bit chunk was just 2^31 + (chunk size modulo 2^31). That means it would never work with a chunk larger than 2^32. The fix has been pushed upstream and is now incorporated into lrzip.

Another bug, as reported on this blog by a commenter, was that of creating corrupt very small archives (less than 64 bytes). This has been fixed by disabling the back end compression when the chunk is less than 64 bytes and just using the rzip first stage.

A lot of the other work in this release was just getting it to compile  on osx. Numerous issues showed up as always, and I didn't have access to an osx machine on the previous release to fix it. This time I used my wife's laptop ;) . One of the issues, for example, was that osx didn't see itself as #ifdef unix, which I thought was a little amusing. Another unexpected surprise was that the default osx filesystem is not case sensitive which caused a conflict lrzip.h vs Lrzip.h. Alas I have no other BSDs to try compiling it on so I'm not sure if they're fixed with this.

Interestingly, I still have to disable md5 calculation on the osx build. The md5 is calculated the same on compression and decompression within lrzip, but it disagrees with the result returned from the ports version of md5! This defeats the whole purpose of including md5 in it since the point of it is to have a command line result to compare to. I'm guessing there's an endianness dispute there somewhere and haven't ever tracked it down, since osx has done an endian flip in the past. lrzip still uses crc32 checking of each block internally so it's not like there isn't any integrity checking.

Finally what would a release be without some new benchmarks? Nothing performance-wise has changed in lrzip since the last version, but I have access to a 12 thread CPU machine with 32GB of ram now, so I did some quick benchmarks with the classic 10GB virtual image I've been using till now.


Compression  Size           Percentage  Compress Time  Decompress Time
None         10737418240     100.0
gzip          2772899756      25.8       3m56s          2m15s
pbzip2        2705814394      25.2       1m41s          1m46s
lrzip         1095337763      10.2       2m54s          2m21s
Note that with enough ram and CPU, lrzip is actually faster than gzip (which does compression in place) and comparable on decompression, despite a huge increase in compression. pbzip2 is faster than both but its compression is almost no better than gzip.

3 comments:

  1. Hmm, lrztar in version 0.613 gives me an error:

    lrztar -Uz test_directory
    Cannot have -o and -O or -S together
    Fatal error - exiting

    altough lrztar from 0.612 worked perfectly.

    ReplyDelete
  2. Darn, did I break it? I guess I should rush a .614 release when I can. Thanks for bug report.

    ReplyDelete
  3. No problem. Oh, and good job with the BFS ans BFQ. I just can't imagine living without them.

    ReplyDelete