2023-04-07
software
Input: a 10GB file containing 10^10 "a" characters. Your task is to compress it as much as you can. How hard can it be? How do the well-known (and mostly-known) generic compression utilities perform? What's the minimum you can do? What's the best we can do with the a-priori knowledge? What does this tell us? Read on for a small analysis.
Let’s just naively compress these with a few tools, with their default parameters, as well as with "maximum compression" flags. The results on a machine with SSD, 4 Xeon CPU cores and enough RAM is:
Compressor | Output size | Ratio | Compression time (mm:ss) | Compression memory (MB) | Best output size | Best ratio | Best compression time (mm:ss) | Best compression memory (MB) | |
---|---|---|---|---|---|---|---|---|---|
ZIP | 9.704.980 | 99,90295% | 1:22 | ~2 | 9.704.986 | 99,90295% | 1:22 | ~2 | |
gz | 9.704.732 | 99,90295% | 1:21 | ~1 | 9.704.732 | 99,90295% | 1:21 | ~1 | |
7z | 1.467.206 | 99,98533% | 0:56 | (4) | ~473 | 1.458.769 | 99,98541% | 0:50 | ~1600 |
xz | 1.454.652 | 99,98545% | 4:00 | ~97 | 1.454.652 | 99,98545% | 6:36 | ~625 | |
kgb | 1.369.711 | 99,98630% | 324:51 | (!) | ~12 | N/A | N/A | N/A | N/A |
RAR (5.5) | 517.323 | 99,99483% | 2:58 | ~433 | 517.322 | 99,99483% | 2:50 | ~97 | |
ARJ (3.10) | 51.544 | 99,99948% | 1:43 | ~2 | 51.543 | 99,99948% | 1:49 | ~2 | |
bz2 (lbzip2) | 377.820 | 99,99622% | 0:35 | ~4 | 377.820 | 99,99622% | 0:19 | ~12 | |
bz2 | 6.995 | 99,99993% | 1:56 | ~9 | 6.995 | 99,99993% | 1:54 | ~9 |
There are a few things to note here:
kgb: kgb_arch_posix_by_slawek.cpp:1195: int Mixer::predict(int): Assertion 'sum>0' failed
)Since fiddling with compression parameters didn’t help, I was curious how much entropy the output files contain. It turns out that [in some cases] they are pretty repetitive. So, let’s try to compress the compressed outcomes again - with the same tool - to see how far we get!
Most of these tools refuse to touch files that are seemingly already compressed (i.e. have the same extension) so some renaming was in order. Memory/CPU use and compression time are not very exciting now, so I’m ignoring those.
I also verified that these double-compressed files, after uncompressing once, result in the original single-compressed file. They do.
Compressor | Second round output size |
---|---|
ZIP | 9.705.152 |
xz | 1.454.652 |
kgb | 1.368.801 |
gz | 24.674 |
gz (third compression) | 1.161 |
7z | 878 |
RAR (5.5) | 507 |
ARJ (3.10) | 462 |
bz2 | 207 |
There are several observations here:
Since we know exactly what the output (original file) is, can we write a program that outputs just that? Of course we can! In the worst case our program would just contain the original file, which is not very novel. In this particular case the output is trivial, so the code can be trivial as well. In particular, in Python:
print("a"*10**10, end="")
That is 25 characters (26, with a new line at the end). There’s a lot of overhead here.
I’m sure there’s a language there in which the following is syntactically correct:
"a"*10^10
If so, the code ("compressed") size is 9 bytes. That’s hard to beat!
In this trivial example the main takeaway is that if you know your input well in advance, and can design a compressor for that specific input (or a class of inputs of that kind), then you can do much better than general compression algorithms. That’s not surprising of course, but the real question is whether one can build a compressor that, using whatever it takes to analyse its input, does best.
One fascinating idea in this topic is the Hutter Prize - where your task is to compress part of human knowledge (essentially text) as best as you can, by making a specific compressor for that text. You get a prize if you do better than the ones before you. If you got this far in reading, I urge you to check it out!