Compressing A Trivial File

2023-04-07
software

Input: a 10GB file containing 10^10 "a" characters. Your task is to compress it as much as you can. How hard can it be? How do the well-known (and mostly-known) generic compression utilities perform? What's the minimum you can do? What's the best we can do with the a-priori knowledge? What does this tell us? Read on for a small analysis.

The naive approach

Let’s just naively compress these with a few tools, with their default parameters, as well as with "maximum compression" flags. The results on a machine with SSD, 4 Xeon CPU cores and enough RAM is:

Compressor Output size Ratio Compression time (mm:ss) Compression memory (MB) Best output size Best ratio Best compression time (mm:ss) Best compression memory (MB)
ZIP 9.704.980 99,90295% 1:22 ~2 9.704.986 99,90295% 1:22 ~2
gz 9.704.732 99,90295% 1:21 ~1 9.704.732 99,90295% 1:21 ~1
7z 1.467.206 99,98533% 0:56 (4) ~473 1.458.769 99,98541% 0:50 ~1600
xz 1.454.652 99,98545% 4:00 ~97 1.454.652 99,98545% 6:36 ~625
kgb 1.369.711 99,98630% 324:51 (!) ~12 N/A N/A N/A N/A
RAR (5.5) 517.323 99,99483% 2:58 ~433 517.322 99,99483% 2:50 ~97
ARJ (3.10) 51.544 99,99948% 1:43 ~2 51.543 99,99948% 1:49 ~2
bz2 (lbzip2) 377.820 99,99622% 0:35 ~4 377.820 99,99622% 0:19 ~12
bz2 6.995 99,99993% 1:56 ~9 6.995 99,99993% 1:54 ~9

There are a few things to note here:

  • The parameters I used for maximum compression are the following. Granted, likely not all the things that are available.
    • lbzip2 -n4 (uses 4CPUs)
    • bzip2 –best
    • arj -jm
    • rar -m5 -mdg
    • xz -9e
    • 7z -mmt4 -mx9
    • gzip -9
    • zip -9
    • kgb -9
  • (4) means the process used 4 CPU cores
  • (!) means just that: "wow!"
  • 7z and lbzip2 use 4 (all?) CPUs out of the box, the others are single threaded
  • kgb takes its time, the rest are in the same ballpark for execution time
  • kgb with the additional flags aborted with an internal error (kgb: kgb_arch_posix_by_slawek.cpp:1195: int Mixer::predict(int): Assertion 'sum>0' failed)
  • bz2 wins for output size by a far margin - but see below
  • I could not convince lbzip2 to use only one CPU. The output is bigger than bzip2 (unexpected) - though the processing time decreased (expected).
  • In general fiddling with compression flags didn't do much, except:
    • rar: no change (less memory use though?)
    • xz: no change (only more CPU+memory)
    • 7z: negligible change (faster, much more memory use)

Do it again!

Since fiddling with compression parameters didn’t help, I was curious how much entropy the output files contain. It turns out that [in some cases] they are pretty repetitive. So, let’s try to compress the compressed outcomes again - with the same tool - to see how far we get!

Most of these tools refuse to touch files that are seemingly already compressed (i.e. have the same extension) so some renaming was in order. Memory/CPU use and compression time are not very exciting now, so I’m ignoring those.

I also verified that these double-compressed files, after uncompressing once, result in the original single-compressed file. They do.

Compressor Second round output size
ZIP 9.705.152
xz 1.454.652
kgb 1.368.801
gz 24.674
gz (third compression) 1.161
7z 878
RAR (5.5) 507
ARJ (3.10) 462
bz2 207

There are several observations here:

  • zip, xz and kgb are unable to do better
  • The others achieve a lot again, meaning in the first run they didn’t really do their best
  • gz in particular benefits from a third round as well!
  • 7z, rar, arj, bz2 achieve results that are probably close to the optimum for general purpose compressors

Python to the rescue!

Since we know exactly what the output (original file) is, can we write a program that outputs just that? Of course we can! In the worst case our program would just contain the original file, which is not very novel. In this particular case the output is trivial, so the code can be trivial as well. In particular, in Python:

print("a"*10**10, end="")

That is 25 characters (26, with a new line at the end). There’s a lot of overhead here.

Doing even better

I’m sure there’s a language there in which the following is syntactically correct:

"a"*10^10

If so, the code ("compressed") size is 9 bytes. That’s hard to beat!

Zooming out - what a-priori knowledge about the input can do for compression

In this trivial example the main takeaway is that if you know your input well in advance, and can design a compressor for that specific input (or a class of inputs of that kind), then you can do much better than general compression algorithms. That’s not surprising of course, but the real question is whether one can build a compressor that, using whatever it takes to analyse its input, does best.

One fascinating idea in this topic is the Hutter Prize - where your task is to compress part of human knowledge (essentially text) as best as you can, by making a specific compressor for that text. You get a prize if you do better than the ones before you. If you got this far in reading, I urge you to check it out!


© Kistel