Byte-level compression benchmark

All measurements on enwik8 (English Wikipedia). PPM-C with zero learned parameters.

Main comparison

CompressorbpbData sizeParametersVerifiedNotes
Unigram baseline5.041M0Byte frequency only
gzip -9~3.0100M0NoLZ77 + Huffman
PPM-C v1 (baseline)3.68500K0NoParent + delta, no exclusion
PPM-C v2 (+ exclusion)3.32500K0No−0.36 bpb from exclusion
PPM-C v3 (+ escape)3.05500K0No−0.27 bpb from PPM-C escape
PPM-C v4 (+ online)2.57500K0No−0.48 bpb from online trie update
PPM-C v5 (combined + tuned)2.26500K0Yes (Lean)All three components + tuning
xz~2.2100M0NoLZMA2
bzip2~2.1100M0NoBWT + Huffman
PPM-C v6 (+ skip-gram, 1M)2.161M0Yes (Lean)Skip-gram context blending
PPM-C v6 (full enwik8)1.964100M0Yes (Lean)Same algorithm, full data
BGC0 (Zig runtime)2.182M0Yes (Lean spec)Native implementation of Conway byte machine
NNCP~1.4100M~10⁶NoNeural network compression
cmix~1.2100M~10⁹NoContext mixing ensemble
Shannon entropy rate~1.0–1.3Theoretical limit (estimated)

PPM-C ablation: 20 experiments

Starting from v5 at 2.26 bpb on 500K bytes, 20 controlled single-variable experiments:

#ExperimentΔ (bpb)Result
1Bayesian escape+0.730FAIL
2CDF 12-bit quantization−0.044PASS
3KT estimator+2.170FAIL
4MDL context pruning−0.058PASS
5Exclusion order reversalBUGGY
6Exact rational arithmetic0.000NEUTRAL
7c-theorem prior on depth+0.045FAIL
8Forward filter (γ=0.99)−0.024PASS
9Lookahead coding0.000NEUTRAL
10Entry count pruning+0.075FAIL
11Context inheritance+0.031FAIL
12Count aging (halflife)+1.050FAIL
13Skip-gram (α=0.2)−0.089PASS
14Dual-parent (α=0.1)−0.028PASS
15Local alphabet restriction0.000NEUTRAL
16Bigram trie+0.034FAIL
17Elimination codingBUGGY
18Second-order escape−0.006PASS
19Sliding window (64k)+0.669FAIL
20Depth ensemble0.000NEUTRAL

Summary: 7 pass, 7 fail, 4 neutral, 2 buggy. The PPM-C escape u/(t+u) is untouchable — every alternative tested makes things worse.

Scaling behavior

ComponentΔ at 500KΔ at 1M
Skip-gram−0.089−0.030
MDL pruning−0.058−0.010
Dual-parent−0.028−0.013
CDF 12-bit−0.044−0.008
Forward filter−0.024−0.005

All deltas shrink with more data. At 1M, combining five winners increases bpb by +0.036 relative to skip-gram alone. Improvements that compose at small scale can interfere at large scale.

Formal verification

The PPM-C compressor is formalized in Lean 4 with zero sorry's across three files. The roundtrip theorem (decode ∘ encode = id) is proved over exact rational arithmetic, yielding the first machine-verified Kolmogorov complexity upper bound: K(x) ≤ 2.16|x| + O(1).

Cite this data

@misc{hoekstra2026benchmark,
  author = {Hoekstra, Richard},
  title = {Byte-Level Compression Benchmark on enwik8},
  year = {2026},
  url = {https://richardhoekstra.nl/data/compression-benchmark.html}
}