Papers · Atlas · Data

Systematic byte-level compression with a machine-verified Kolmogorov bound

Companion to the Hodge-Epsilon Program

Richard Hoekstra · April 2026

The approach

Start with a baseline PPM-C predictor on enwik8. Run twenty controlled single-variable experiments, each changing exactly one component against a fixed baseline. Document every winner and every honest negative. Then formalize the result in Lean 4 to produce a machine-verified Kolmogorov complexity upper bound.

A compressor that achieves r bits per byte on a string x certifies K(x) ≤ r · |x| + O(1), where K(x) is the Kolmogorov complexity. This makes compression a rigorous, if one-sided, probe of structure.

Main result 3.68 → 2.16 bpb on enwik8. A 41% reduction with zero learned parameters and O(n) runtime. K(x) ≤ 2.16|x| + O(1), machine-verified in Lean 4 with zero sorry's.

Version history

The path from v1 (3.68 bpb) to v6 (2.16 bpb) involved three discrete core improvements, followed by tuning and a final single-component refinement.

Version	Change	Delta (bpb)	Cumulative (bpb)	Reduction
v1	Baseline (parent+delta)	—	3.68	—
v2	+ exclusion	−0.36	3.32	−10%
v3	+ PPM-C escape	−0.27	3.05	−17%
v4	+ online trie update	−0.48	2.57	−30%
v5	Combined + tuned	−0.31	2.26	−39%
v6	+ skip-gram (1M)	−0.10	2.16	−41%

The combined effect (−1.42 bpb, v1→v5) exceeds the sum of individual improvements (−1.11 bpb) because the components interact positively: exclusion makes escape more accurate, and online update makes both exclusion and escape track the evolving distribution.

Three algebraic components

The compressor's architecture has three load-bearing components, each with a distinct algebraic character (as intuitive descriptions, not formal properties):

Exclusion (idempotent). Once a byte is excluded at depth d, it stays excluded at all shorter depths. Applying exclusion twice changes nothing. It removes already-predicted probability mass from shorter contexts, preventing double-counting. Worth −0.36 bpb.

PPM-C escape (nilpotent). The escape probability u/(t+u) transfers residual mass from depth d to depth d−1. The escape mass is consumed in a single transfer — a nilpotent operation. This is the maximum-likelihood estimator for the probability that the next byte is novel given the current context. Worth −0.27 bpb.

Online update (involution). The encoder and decoder apply identical trie updates after each byte, reflecting a mirror symmetry that guarantees decodability. The same update applied twice returns to the same relative state. Worth −0.48 bpb.

Twenty experiments

Starting from v5 at 2.26 bpb on 500k bytes, twenty controlled experiments in two rounds. Each experiment changes exactly one component. The abstract states "seven yield improvements and thirteen produce honest negatives," but the body table reveals the finer classification: 7 pass, 7 fail, 4 neutral, 2 buggy. Both counts are presented here.

#	Experiment	Round	Delta (bpb)	Result
1	Bayesian escape	1	+0.730	FAIL
2	CDF 12-bit quantization	1	−0.044	PASS
3	KT estimator	1	+2.170	FAIL
4	MDL context pruning	1	−0.058	PASS
5	Exclusion order reversal	1	—	BUGGY
6	Exact rational arithmetic	1	0.000	NEUTRAL
7	c-theorem prior on depth	1	+0.045	FAIL
8	Forward filter (γ=0.99)	1	−0.024	PASS
9	Lookahead coding	1	0.000	NEUTRAL
10	Entry count pruning	1	+0.075	FAIL
11	Context inheritance	2	+0.031	FAIL
12	Count aging (halflife)	2	+1.050	FAIL
13	Skip-gram (α=0.2)	2	−0.089	PASS
14	Dual-parent (α=0.1)	2	−0.028	PASS
15	Local alphabet restriction	2	0.000	NEUTRAL
16	Bigram trie	2	+0.034	FAIL
17	Elimination coding	2	—	BUGGY
18	Second-order escape	2	−0.006	PASS
19	Sliding window (64k)	2	+0.669	FAIL
20	Depth ensemble	2	0.000	NEUTRAL

The pattern: the PPM-C escape mechanism is a local optimum, and improvements come from the surroundings — which contexts exist, how probabilities are quantized, what additional experts contribute.

What works

Skip-gram context blending (#13, −0.089 bpb). The best single improvement. In addition to the contiguous context c_i−D...c_i−1, a skip-gram context that omits one position is queried and blended with weight α=0.2. This captures correlations beyond D=5 without the sparsity cost of extending to D=10. English text has strong positional dependencies (word endings predict word beginnings two words later) that contiguous contexts miss.

MDL context pruning (#4, −0.058 bpb). Contexts visited fewer than a threshold number of times are pruned, guided by the Minimum Description Length principle: a context is worth keeping only if its predictive benefit exceeds the cost of encoding its existence. This lets D=6 beat D=5 by removing overfitting contexts.

CDF 12-bit quantization (#2, −0.044 bpb). Restricting the CDF to 12-bit precision provides Laplace-like smoothing as a side effect. When a byte has very small probability, the quantization floor rounds it up to 1/4096, close to what a Laplace prior would assign.

Dual-parent blending (#14, −0.028 bpb). In addition to the suffix parent context, a prefix context is consulted. The suffix captures the immediate past; the prefix captures the beginning of the current phrase. Blending with α=0.1.

Forward filter (#8, −0.024 bpb). A Bayesian depth selector maintaining a posterior over which depth is most informative, with exponential forgetting (γ=0.99). Adaptively weights longer contexts in structured regions and falls back in noisy regions.

Second-order escape (#18, −0.006 bpb). The escape probability is adjusted based on the number of recent unique continuations rather than the lifetime count. Small but consistent.

The untouchable escape

The most important negative finding: every attempt to replace PPM-C's u/(t+u) escape formula fails.

Alternative	Delta (bpb)
Bayesian escape (Beta posterior)	+0.730
KT estimator	+2.170
c-theorem prior on depth	+0.045

PPM-C escape is the ML estimator for the probability that the next byte is novel given the current context. Deviations are penalized immediately. You can surround the escape with better context selection, but you cannot touch the escape itself.

Scaling: 500k vs 1M

All individual deltas shrink at larger scale — longer data means more context, reducing the marginal value of each trick:

Component	Delta at 500k	Delta at 1M
Skip-gram	−0.089	−0.030
MDL pruning	−0.058	−0.010
Dual-parent	−0.028	−0.013
CDF 12-bit	−0.044	−0.008
Forward filter	−0.024	−0.005

The central scaling lesson: improvements that compose at small scale can interfere at large scale. At 500k, the five winners combine to −0.127 bpb (less than their sum of −0.243, indicating mild negative interaction). At 1M, the combination increases bpb by +0.036 relative to skip-gram alone. The mechanism is competition for the same signal: MDL prunes contexts the skip-gram expert relies on, and the forward filter's depth selection conflicts with dual-parent blending because both attempt to reweight the same escape cascade. Exhaustive ablation over all 2⁵ = 32 subsets at 1M confirmed that skip-gram alone is the best single-component configuration. The final v6 result (2.16 bpb) uses only skip-gram.

Negative results

Catastrophic failures (delta > +0.5 bpb)

KT estimator (#3, +2.170). The Krichevsky-Trofimov estimator systematically overestimates escape probability for small counts, leading to massive coding overhead.

Count aging (#12, +1.050). Halflife-based count decay destroys the trie's long-term statistics. PPM-C needs lifetime counts; recency hurts.

Bayesian escape (#1, +0.730). Replacing u/(t+u) with a Beta posterior overcomplicates what is already the ML estimator.

Sliding window (#19, +0.669). Restricting the trie to the last 64k bytes removes rare-but-informative long-range contexts.

Moderate failures (delta +0.01 to +0.1)

Entry pruning (#10, +0.075). Pruning low-count entries removes useful singletons. c-theorem prior (#7, +0.045). Biasing toward shorter contexts fights the data when longer contexts are informative. Bigram trie (#16, +0.034). A separate bigram model adds noise when the main trie already captures bigrams at depth ≥ 2. Context inheritance (#11, +0.031). Initializing child counts from parent counts creates correlation that exclusion cannot correct.

Neutral experiments (delta = 0.000)

Exact rationals (#6): confirms 32-bit integer arithmetic introduces no measurable error. Lookahead coding (#9): requires information the decoder does not have. Local alphabet (#15): exclusion already handles unseen bytes. Depth ensemble (#20): averaging across depths performs identically to PPM-C's sequential fallback, confirming PPM-C is already an optimal depth combination.

Curvature-gated mixture failure

A mixture of a global trie and a local sliding window, gated by a curvature signal:

Model	bpb
Trie alone	3.147
Window alone	4.945
Curvature-gated mixture	3.528
Mixture + momentum	3.517

The mixture is 0.38 bpb worse than the trie alone. When one component is much stronger than the other (1.8 bpb quality gap), mixing is pure dilution.

The Lean certificate

Three Lean 4 files formalize the compressor and its correctness, with zero sorry's across all three. (A fourth pre-existing file, PPMBayes.lean, carries one sorry on a Bayes-optimality claim that does not affect the roundtrip chain.)

ArithCoder.lean (0 sorry's)

Defines arithmetic coding over exact rationals (Q). CoderState is an interval [lo, hi) with 0 ≤ lo < hi ≤ 1. encodeStep narrows the interval by the predicted probability. Key results: encodeStep_width (new width = old width × P(symbol)) and encodeStep_contained (new interval ⊆ old interval). Because both encoder and decoder use rational arithmetic with no rounding, they compute identical intervals at every step.

PPMExclusion.lean (0 sorry's)

Implements PPM-C prediction with exclusion. Key results: ppmcEscape_bounded (0 ≤ escape ≤ 1), exclusion_zero_on_excluded (excluded symbols receive exactly zero mass), and CountTable.update (online update preserves distribution validity).

KolmogorovBound.lean (0 sorry's)

The main theorem:

compressor_roundtrip (D : Nat) (data : List Byte) : let compressed := compressBytes (D := D) data decompressBytes (D := D) compressed.coder data.length = data

For any context depth D and any byte sequence, decompressing the compressed output recovers the original. The proof proceeds by induction on the byte list, requiring at each step: (1) exact arithmetic (no rounding drift), (2) deterministic prediction (encoder and decoder see the same distribution), and (3) encoder/decoder symmetry (the decoder finds the unique symbol whose cumulative range contains the current point).

The CompressionCertificate structure bundles original data, compressed size, and machine-checked roundtrip proof:

structure CompressionCertificate (D : Nat) where original : List Byte compressedBits : Nat roundtripProof : decompressBytes (D := D) (compressBytes (D := D) original).coder original.length = original bitsPerByte : Q := compressedBits / original.length

Combined with the empirical 2.16 bpb, this yields K(enwik8_prefix) ≤ 2.16 · |x| + c, where c = |decompressor| is on the order of 10⁴ bits. For |x| = 10⁸, the overhead c/|x| < 0.0001.

What is proved vs what is stated

Theorem	File	Statement
`ppmcEscape_bounded`	PPMExclusion	0 ≤ escape ≤ 1
`exclusion_zero_on_excluded`	PPMExclusion	Excluded symbols get zero mass
`encodeStep_width`	ArithCoder	Width shrinks by P(symbol)
`encodeStep_contained`	ArithCoder	New interval ⊆ old
`compressor_roundtrip`	KolmogorovBound	decode . encode = id

Why exact rationals

Arithmetic coding over floating-point works in practice, but "sufficient precision" is not a statement Lean can check. Over the rationals, the interval [lo, hi) is represented exactly. Multiplication and addition are exact. The decoder computes (p − lo) / width with no error. The roundtrip is a consequence of arithmetic identities, not precision analysis. Empirically, float64 accumulates an error of ~3.55 × 10⁻¹⁵ per byte — negligible over 10⁸ bytes, but "negligible" is not "zero," and the Lean proof requires zero.

Final numbers in context

Compressor	bpb (enwik8)	Parameters	Verified
gzip	~3.0	0	No
v1 baseline	3.68	0	No
v5 tuned	2.26	0	Yes
bzip2	~2.1	0	No
v6 final (1M)	2.16	0	Yes
v6 (full enwik8, 100M)	1.964	0	Yes
Neural (NNCP/cmix)	1.2–1.4	10⁶–10⁹	No
Shannon entropy rate	~1.0–1.3	—	—

Update (April 2026): On the full enwik8 (100M bytes), the same v6 compressor achieves 1.964 bpb — beating bzip2 and approaching the 1.2–1.4 range of neural compressors, with zero learned parameters.

Runtime: O(n) in input length, O(256^D) worst-case memory. Single CPU core, no GPU. The bound is loose (the true K(x) for English text is likely near 1.0–1.3 bpb), but it is proved, not estimated. The remaining ~0.9 bpb gap represents non-local information that n-gram models fundamentally cannot capture: long-range topic coherence, syntactic dependencies spanning dozens of words, and world knowledge.

Competitive with bzip2. Well behind neural compressors. The value lies in the certificate, not the rate. Each future improvement to the compressor yields a tighter certified bound through the same certificate structure.

Full paper (HTML) PDF