Papers · Atlas · Data

Byte-level compression benchmark

Name: Byte-Level Compression Benchmark on enwik8
Creator: Richard Hoekstra
Published: 2026-04

All measurements on enwik8 (English Wikipedia). PPM-C with zero learned parameters.

Main comparison

Compressor	bpb	Data size	Parameters	Verified	Notes
Unigram baseline	5.04	1M	0	—	Byte frequency only
gzip -9	~3.0	100M	0	No	LZ77 + Huffman
PPM-C v1 (baseline)	3.68	500K	0	No	Parent + delta, no exclusion
PPM-C v2 (+ exclusion)	3.32	500K	0	No	−0.36 bpb from exclusion
PPM-C v3 (+ escape)	3.05	500K	0	No	−0.27 bpb from PPM-C escape
PPM-C v4 (+ online)	2.57	500K	0	No	−0.48 bpb from online trie update
PPM-C v5 (combined + tuned)	2.26	500K	0	Yes (Lean)	All three components + tuning
xz	~2.2	100M	0	No	LZMA2
bzip2	~2.1	100M	0	No	BWT + Huffman
PPM-C v6 (+ skip-gram, 1M)	2.16	1M	0	Yes (Lean)	Skip-gram context blending
PPM-C v6 (full enwik8)	1.964	100M	0	Yes (Lean)	Same algorithm, full data
BGC0 (Zig runtime)	2.18	2M	0	Yes (Lean spec)	Native implementation of Conway byte machine
NNCP	~1.4	100M	~10⁶	No	Neural network compression
cmix	~1.2	100M	~10⁹	No	Context mixing ensemble
Shannon entropy rate	~1.0–1.3	—	—	—	Theoretical limit (estimated)

PPM-C ablation: 20 experiments

Starting from v5 at 2.26 bpb on 500K bytes, 20 controlled single-variable experiments:

#	Experiment	Δ (bpb)	Result
1	Bayesian escape	+0.730	FAIL
2	CDF 12-bit quantization	−0.044	PASS
3	KT estimator	+2.170	FAIL
4	MDL context pruning	−0.058	PASS
5	Exclusion order reversal	—	BUGGY
6	Exact rational arithmetic	0.000	NEUTRAL
7	c-theorem prior on depth	+0.045	FAIL
8	Forward filter (γ=0.99)	−0.024	PASS
9	Lookahead coding	0.000	NEUTRAL
10	Entry count pruning	+0.075	FAIL
11	Context inheritance	+0.031	FAIL
12	Count aging (halflife)	+1.050	FAIL
13	Skip-gram (α=0.2)	−0.089	PASS
14	Dual-parent (α=0.1)	−0.028	PASS
15	Local alphabet restriction	0.000	NEUTRAL
16	Bigram trie	+0.034	FAIL
17	Elimination coding	—	BUGGY
18	Second-order escape	−0.006	PASS
19	Sliding window (64k)	+0.669	FAIL
20	Depth ensemble	0.000	NEUTRAL

Summary: 7 pass, 7 fail, 4 neutral, 2 buggy. The PPM-C escape u/(t+u) is untouchable — every alternative tested makes things worse.

Scaling behavior

Component	Δ at 500K	Δ at 1M
Skip-gram	−0.089	−0.030
MDL pruning	−0.058	−0.010
Dual-parent	−0.028	−0.013
CDF 12-bit	−0.044	−0.008
Forward filter	−0.024	−0.005

All deltas shrink with more data. At 1M, combining five winners increases bpb by +0.036 relative to skip-gram alone. Improvements that compose at small scale can interfere at large scale.

Formal verification

The PPM-C compressor is formalized in Lean 4 with zero sorry's across three files. The roundtrip theorem (decode ∘ encode = id) is proved over exact rational arithmetic, yielding the first machine-verified Kolmogorov complexity upper bound: K(x) ≤ 2.16|x| + O(1).

Cite this data

@misc{hoekstra2026benchmark,
  author = {Hoekstra, Richard},
  title = {Byte-Level Compression Benchmark on enwik8},
  year = {2026},
  url = {https://richardhoekstra.nl/data/compression-benchmark.html}
}