Comparing Compression
Latest update:
Do you benchmark compression tools (like xz or zstd) on your own data,
or do you rely on common wisdom? The best result for an uncompressed
300MB XFS image from the previous post was achieved by bzip2, which is
rarely used nowadays. How does one quickly check a chunk of data
against N popular compressors?
E.g., an unpacked tarball of Emacs 29.2 source code consists of 6791
files with a total size of 276MB. If you were to distribute it as a
.tar.something archive, which compression tool would be the optimal
choice? We can easily write a small utility that answers this
question.
$ ./comprtest ~/opt/src/emacs/emacs-29.2 | tee table
tar: Removing leading `/' from member names
szip 0.59 56.98 126593557
gzip 9.21 72.70 80335332
compress 3.57 57.45 125217137
bzip2 17.28 78.08 64509672
rzip 17.61 79.50 60336377
lzip 113.61 81.67 53935898
lzop 0.67 57.14 126121462
xz 111.03 81.89 53295220
brotli 13.10 78.14 64336399
zstd 1.13 73.77 77179446
comprtest
is a 29 LOC long shell script. The 2nd column here
indicates time in seconds, the 3rd column displays
, representing space saving in % (higher % is better), & the 4th column
shows the final result in bytes.
Then we can sort the table by the 3rd column & draw a bar chart:
$ sort -nk3 table | cpp -P plot.gp | gnuplot -persist
If you're wondering why all of a sudden the C preprocessor becomes
part of it, read on.
comprtest
expects either a file as an argument or a directory (in
which case it creates a plain .tar of it first). Additional optional
arguments specify which compressors to use:
$ ./comprtest /usr/libexec/gdb gzip brotli
gzip 0.60 61.17 6054706
brotli 1.17 65.84 5325408
The gist of the script involves looping over a list of
compressors:
archivers='szip gzip compress bzip2 rzip lzip lzop xz brotli zstd'
…
for c in ${@:-$archivers}; do
echo $c
case $c in
szip ) args='< "$input" > $output' ;;
rzip ) args='-k -o $output "$input"' ;;
brotli ) args='-6 -c "$input" > $output' ;;
* ) args='-c "$input" > $output'
esac
eval "time -p $c $args" 2>&1 | awk '/real/ {print $2}'
osize=`wc -c < $output`
echo $isize $osize | awk '{print 100*(1-$2/($1==0?$2:$1))}'
echo $osize
rm $output
done | xargs -n4 printf "%-8s %11.2f %6.2f %15d\n"
- Not every archive tool has gzip-compatible CLI.
- We are using a default compression level for each tool with the
exception of
brotli
, as its default level 11 is excruciatingly
slow.
- szip is an interface to the Snappy algorithm. Your distro probably
doesn't have it in its repos, hence run
cargo install szip
. Everything else should be available via dnf/apt.
Bar charts are generated by a gnuplot script:
$ cat plot.gp
$data <<E
#include "/dev/stdin"
E
set key tmargin
set xtics rotate by -30 left
set y2tics
set ylabel "Seconds"
set y2label "%"
set style data histograms
set style fill solid
plot $data using 2 axis x1y1 title "Time", \
"" using 3:xticlabels(1) axis x1y2 title "Space saving"
Here is where the C preprocessor comes in handy: without an injected
"datablock" it won't be possible to draw a graph with 2 ordinates when
reading data from stdin.
In an attempt to demonstrate that xz is not always the best choice, I
benchmarked a bunch of XML files (314MB):
$ ./comprtest ~/Downloads/emacs.stackexchange.com.tar
szip 0.59 63.70 119429565
gzip 7.18 77.59 73724710
compress 4.03 67.17 108015563
bzip2 21.37 83.36 54751478
rzip 17.42 85.93 46304199
lzip 119.70 85.06 49151518
lzop 0.67 63.63 119667058
xz 125.80 85.55 47559464
brotli 13.56 82.52 57509978
zstd 1.07 79.40 67766890
Tags: ойті
Authors: ag