| Compressor | bpb | Data size | Parameters | Verified | Notes |
|---|---|---|---|---|---|
| Unigram baseline | 5.04 | 1M | 0 | — | Byte frequency only |
| gzip -9 | ~3.0 | 100M | 0 | No | LZ77 + Huffman |
| PPM-C v1 (baseline) | 3.68 | 500K | 0 | No | Parent + delta, no exclusion |
| PPM-C v2 (+ exclusion) | 3.32 | 500K | 0 | No | −0.36 bpb from exclusion |
| PPM-C v3 (+ escape) | 3.05 | 500K | 0 | No | −0.27 bpb from PPM-C escape |
| PPM-C v4 (+ online) | 2.57 | 500K | 0 | No | −0.48 bpb from online trie update |
| PPM-C v5 (combined + tuned) | 2.26 | 500K | 0 | Yes (Lean) | All three components + tuning |
| xz | ~2.2 | 100M | 0 | No | LZMA2 |
| bzip2 | ~2.1 | 100M | 0 | No | BWT + Huffman |
| PPM-C v6 (+ skip-gram, 1M) | 2.16 | 1M | 0 | Yes (Lean) | Skip-gram context blending |
| PPM-C v6 (full enwik8) | 1.964 | 100M | 0 | Yes (Lean) | Same algorithm, full data |
| BGC0 (Zig runtime) | 2.18 | 2M | 0 | Yes (Lean spec) | Native implementation of Conway byte machine |
| NNCP | ~1.4 | 100M | ~10⁶ | No | Neural network compression |
| cmix | ~1.2 | 100M | ~10⁹ | No | Context mixing ensemble |
| Shannon entropy rate | ~1.0–1.3 | — | — | — | Theoretical limit (estimated) |
Starting from v5 at 2.26 bpb on 500K bytes, 20 controlled single-variable experiments:
| # | Experiment | Δ (bpb) | Result |
|---|---|---|---|
| 1 | Bayesian escape | +0.730 | FAIL |
| 2 | CDF 12-bit quantization | −0.044 | PASS |
| 3 | KT estimator | +2.170 | FAIL |
| 4 | MDL context pruning | −0.058 | PASS |
| 5 | Exclusion order reversal | — | BUGGY |
| 6 | Exact rational arithmetic | 0.000 | NEUTRAL |
| 7 | c-theorem prior on depth | +0.045 | FAIL |
| 8 | Forward filter (γ=0.99) | −0.024 | PASS |
| 9 | Lookahead coding | 0.000 | NEUTRAL |
| 10 | Entry count pruning | +0.075 | FAIL |
| 11 | Context inheritance | +0.031 | FAIL |
| 12 | Count aging (halflife) | +1.050 | FAIL |
| 13 | Skip-gram (α=0.2) | −0.089 | PASS |
| 14 | Dual-parent (α=0.1) | −0.028 | PASS |
| 15 | Local alphabet restriction | 0.000 | NEUTRAL |
| 16 | Bigram trie | +0.034 | FAIL |
| 17 | Elimination coding | — | BUGGY |
| 18 | Second-order escape | −0.006 | PASS |
| 19 | Sliding window (64k) | +0.669 | FAIL |
| 20 | Depth ensemble | 0.000 | NEUTRAL |
Summary: 7 pass, 7 fail, 4 neutral, 2 buggy. The PPM-C escape u/(t+u) is untouchable — every alternative tested makes things worse.
| Component | Δ at 500K | Δ at 1M |
|---|---|---|
| Skip-gram | −0.089 | −0.030 |
| MDL pruning | −0.058 | −0.010 |
| Dual-parent | −0.028 | −0.013 |
| CDF 12-bit | −0.044 | −0.008 |
| Forward filter | −0.024 | −0.005 |
All deltas shrink with more data. At 1M, combining five winners increases bpb by +0.036 relative to skip-gram alone. Improvements that compose at small scale can interfere at large scale.
The PPM-C compressor is formalized in Lean 4 with zero sorry's across three files. The roundtrip theorem (decode ∘ encode = id) is proved over exact rational arithmetic, yielding the first machine-verified Kolmogorov complexity upper bound: K(x) ≤ 2.16|x| + O(1).
@misc{hoekstra2026benchmark,
author = {Hoekstra, Richard},
title = {Byte-Level Compression Benchmark on enwik8},
year = {2026},
url = {https://richardhoekstra.nl/data/compression-benchmark.html}
}