Systematic byte-level compression with a machine-verified Kolmogorov bound

Companion to the Hodge-Epsilon Program
Richard Hoekstra · April 2026

The approach

Start with a baseline PPM-C predictor on enwik8. Run twenty controlled single-variable experiments, each changing exactly one component against a fixed baseline. Document every winner and every honest negative. Then formalize the result in Lean 4 to produce a machine-verified Kolmogorov complexity upper bound.

A compressor that achieves r bits per byte on a string x certifies K(x) ≤ r · |x| + O(1), where K(x) is the Kolmogorov complexity. This makes compression a rigorous, if one-sided, probe of structure.

Main result 3.68 → 2.16 bpb on enwik8. A 41% reduction with zero learned parameters and O(n) runtime. K(x) ≤ 2.16|x| + O(1), machine-verified in Lean 4 with zero sorry's.

Version history

The path from v1 (3.68 bpb) to v6 (2.16 bpb) involved three discrete core improvements, followed by tuning and a final single-component refinement.

VersionChangeDelta (bpb)Cumulative (bpb)Reduction
v1Baseline (parent+delta)3.68
v2+ exclusion−0.363.32−10%
v3+ PPM-C escape−0.273.05−17%
v4+ online trie update−0.482.57−30%
v5Combined + tuned−0.312.26−39%
v6+ skip-gram (1M)−0.102.16−41%

The combined effect (−1.42 bpb, v1→v5) exceeds the sum of individual improvements (−1.11 bpb) because the components interact positively: exclusion makes escape more accurate, and online update makes both exclusion and escape track the evolving distribution.

Three algebraic components

The compressor's architecture has three load-bearing components, each with a distinct algebraic character (as intuitive descriptions, not formal properties):

Exclusion (idempotent). Once a byte is excluded at depth d, it stays excluded at all shorter depths. Applying exclusion twice changes nothing. It removes already-predicted probability mass from shorter contexts, preventing double-counting. Worth −0.36 bpb.

PPM-C escape (nilpotent). The escape probability u/(t+u) transfers residual mass from depth d to depth d−1. The escape mass is consumed in a single transfer — a nilpotent operation. This is the maximum-likelihood estimator for the probability that the next byte is novel given the current context. Worth −0.27 bpb.

Online update (involution). The encoder and decoder apply identical trie updates after each byte, reflecting a mirror symmetry that guarantees decodability. The same update applied twice returns to the same relative state. Worth −0.48 bpb.

Twenty experiments

Starting from v5 at 2.26 bpb on 500k bytes, twenty controlled experiments in two rounds. Each experiment changes exactly one component. The abstract states "seven yield improvements and thirteen produce honest negatives," but the body table reveals the finer classification: 7 pass, 7 fail, 4 neutral, 2 buggy. Both counts are presented here.

#ExperimentRoundDelta (bpb)Result
1Bayesian escape1+0.730FAIL
2CDF 12-bit quantization1−0.044PASS
3KT estimator1+2.170FAIL
4MDL context pruning1−0.058PASS
5Exclusion order reversal1BUGGY
6Exact rational arithmetic10.000NEUTRAL
7c-theorem prior on depth1+0.045FAIL
8Forward filter (γ=0.99)1−0.024PASS
9Lookahead coding10.000NEUTRAL
10Entry count pruning1+0.075FAIL
11Context inheritance2+0.031FAIL
12Count aging (halflife)2+1.050FAIL
13Skip-gram (α=0.2)2−0.089PASS
14Dual-parent (α=0.1)2−0.028PASS
15Local alphabet restriction20.000NEUTRAL
16Bigram trie2+0.034FAIL
17Elimination coding2BUGGY
18Second-order escape2−0.006PASS
19Sliding window (64k)2+0.669FAIL
20Depth ensemble20.000NEUTRAL

The pattern: the PPM-C escape mechanism is a local optimum, and improvements come from the surroundings — which contexts exist, how probabilities are quantized, what additional experts contribute.

What works

Skip-gram context blending (#13, −0.089 bpb). The best single improvement. In addition to the contiguous context ci−D...ci−1, a skip-gram context that omits one position is queried and blended with weight α=0.2. This captures correlations beyond D=5 without the sparsity cost of extending to D=10. English text has strong positional dependencies (word endings predict word beginnings two words later) that contiguous contexts miss.

MDL context pruning (#4, −0.058 bpb). Contexts visited fewer than a threshold number of times are pruned, guided by the Minimum Description Length principle: a context is worth keeping only if its predictive benefit exceeds the cost of encoding its existence. This lets D=6 beat D=5 by removing overfitting contexts.

CDF 12-bit quantization (#2, −0.044 bpb). Restricting the CDF to 12-bit precision provides Laplace-like smoothing as a side effect. When a byte has very small probability, the quantization floor rounds it up to 1/4096, close to what a Laplace prior would assign.

Dual-parent blending (#14, −0.028 bpb). In addition to the suffix parent context, a prefix context is consulted. The suffix captures the immediate past; the prefix captures the beginning of the current phrase. Blending with α=0.1.

Forward filter (#8, −0.024 bpb). A Bayesian depth selector maintaining a posterior over which depth is most informative, with exponential forgetting (γ=0.99). Adaptively weights longer contexts in structured regions and falls back in noisy regions.

Second-order escape (#18, −0.006 bpb). The escape probability is adjusted based on the number of recent unique continuations rather than the lifetime count. Small but consistent.

The untouchable escape

The most important negative finding: every attempt to replace PPM-C's u/(t+u) escape formula fails.

AlternativeDelta (bpb)
Bayesian escape (Beta posterior)+0.730
KT estimator+2.170
c-theorem prior on depth+0.045

PPM-C escape is the ML estimator for the probability that the next byte is novel given the current context. Deviations are penalized immediately. You can surround the escape with better context selection, but you cannot touch the escape itself.

Scaling: 500k vs 1M

All individual deltas shrink at larger scale — longer data means more context, reducing the marginal value of each trick:

ComponentDelta at 500kDelta at 1M
Skip-gram−0.089−0.030
MDL pruning−0.058−0.010
Dual-parent−0.028−0.013
CDF 12-bit−0.044−0.008
Forward filter−0.024−0.005
The central scaling lesson: improvements that compose at small scale can interfere at large scale. At 500k, the five winners combine to −0.127 bpb (less than their sum of −0.243, indicating mild negative interaction). At 1M, the combination increases bpb by +0.036 relative to skip-gram alone. The mechanism is competition for the same signal: MDL prunes contexts the skip-gram expert relies on, and the forward filter's depth selection conflicts with dual-parent blending because both attempt to reweight the same escape cascade. Exhaustive ablation over all 25 = 32 subsets at 1M confirmed that skip-gram alone is the best single-component configuration. The final v6 result (2.16 bpb) uses only skip-gram.

Negative results

Catastrophic failures (delta > +0.5 bpb)

KT estimator (#3, +2.170). The Krichevsky-Trofimov estimator systematically overestimates escape probability for small counts, leading to massive coding overhead.

Count aging (#12, +1.050). Halflife-based count decay destroys the trie's long-term statistics. PPM-C needs lifetime counts; recency hurts.

Bayesian escape (#1, +0.730). Replacing u/(t+u) with a Beta posterior overcomplicates what is already the ML estimator.

Sliding window (#19, +0.669). Restricting the trie to the last 64k bytes removes rare-but-informative long-range contexts.

Moderate failures (delta +0.01 to +0.1)

Entry pruning (#10, +0.075). Pruning low-count entries removes useful singletons. c-theorem prior (#7, +0.045). Biasing toward shorter contexts fights the data when longer contexts are informative. Bigram trie (#16, +0.034). A separate bigram model adds noise when the main trie already captures bigrams at depth ≥ 2. Context inheritance (#11, +0.031). Initializing child counts from parent counts creates correlation that exclusion cannot correct.

Neutral experiments (delta = 0.000)

Exact rationals (#6): confirms 32-bit integer arithmetic introduces no measurable error. Lookahead coding (#9): requires information the decoder does not have. Local alphabet (#15): exclusion already handles unseen bytes. Depth ensemble (#20): averaging across depths performs identically to PPM-C's sequential fallback, confirming PPM-C is already an optimal depth combination.

Curvature-gated mixture failure

A mixture of a global trie and a local sliding window, gated by a curvature signal:

Modelbpb
Trie alone3.147
Window alone4.945
Curvature-gated mixture3.528
Mixture + momentum3.517

The mixture is 0.38 bpb worse than the trie alone. When one component is much stronger than the other (1.8 bpb quality gap), mixing is pure dilution.

The Lean certificate

Three Lean 4 files formalize the compressor and its correctness, with zero sorry's across all three. (A fourth pre-existing file, PPMBayes.lean, carries one sorry on a Bayes-optimality claim that does not affect the roundtrip chain.)

ArithCoder.lean (0 sorry's)

Defines arithmetic coding over exact rationals (Q). CoderState is an interval [lo, hi) with 0 ≤ lo < hi ≤ 1. encodeStep narrows the interval by the predicted probability. Key results: encodeStep_width (new width = old width × P(symbol)) and encodeStep_contained (new interval ⊆ old interval). Because both encoder and decoder use rational arithmetic with no rounding, they compute identical intervals at every step.

PPMExclusion.lean (0 sorry's)

Implements PPM-C prediction with exclusion. Key results: ppmcEscape_bounded (0 ≤ escape ≤ 1), exclusion_zero_on_excluded (excluded symbols receive exactly zero mass), and CountTable.update (online update preserves distribution validity).

KolmogorovBound.lean (0 sorry's)

The main theorem:

compressor_roundtrip (D : Nat) (data : List Byte) : let compressed := compressBytes (D := D) data decompressBytes (D := D) compressed.coder data.length = data

For any context depth D and any byte sequence, decompressing the compressed output recovers the original. The proof proceeds by induction on the byte list, requiring at each step: (1) exact arithmetic (no rounding drift), (2) deterministic prediction (encoder and decoder see the same distribution), and (3) encoder/decoder symmetry (the decoder finds the unique symbol whose cumulative range contains the current point).

The CompressionCertificate structure bundles original data, compressed size, and machine-checked roundtrip proof:

structure CompressionCertificate (D : Nat) where original : List Byte compressedBits : Nat roundtripProof : decompressBytes (D := D) (compressBytes (D := D) original).coder original.length = original bitsPerByte : Q := compressedBits / original.length

Combined with the empirical 2.16 bpb, this yields K(enwik8_prefix) ≤ 2.16 · |x| + c, where c = |decompressor| is on the order of 104 bits. For |x| = 108, the overhead c/|x| < 0.0001.

What is proved vs what is stated

TheoremFileStatement
ppmcEscape_boundedPPMExclusion0 ≤ escape ≤ 1
exclusion_zero_on_excludedPPMExclusionExcluded symbols get zero mass
encodeStep_widthArithCoderWidth shrinks by P(symbol)
encodeStep_containedArithCoderNew interval ⊆ old
compressor_roundtripKolmogorovBounddecode . encode = id

Why exact rationals

Arithmetic coding over floating-point works in practice, but "sufficient precision" is not a statement Lean can check. Over the rationals, the interval [lo, hi) is represented exactly. Multiplication and addition are exact. The decoder computes (p − lo) / width with no error. The roundtrip is a consequence of arithmetic identities, not precision analysis. Empirically, float64 accumulates an error of ~3.55 × 10−15 per byte — negligible over 108 bytes, but "negligible" is not "zero," and the Lean proof requires zero.

Final numbers in context

Compressorbpb (enwik8)ParametersVerified
gzip~3.00No
v1 baseline3.680No
v5 tuned2.260Yes
bzip2~2.10No
v6 final (1M)2.160Yes
v6 (full enwik8, 100M)1.9640Yes
Neural (NNCP/cmix)1.2–1.4106–109No
Shannon entropy rate~1.0–1.3
Update (April 2026): On the full enwik8 (100M bytes), the same v6 compressor achieves 1.964 bpb — beating bzip2 and approaching the 1.2–1.4 range of neural compressors, with zero learned parameters.

Runtime: O(n) in input length, O(256D) worst-case memory. Single CPU core, no GPU. The bound is loose (the true K(x) for English text is likely near 1.0–1.3 bpb), but it is proved, not estimated. The remaining ~0.9 bpb gap represents non-local information that n-gram models fundamentally cannot capture: long-range topic coherence, syntactic dependencies spanning dozens of words, and world knowledge.

Competitive with bzip2. Well behind neural compressors. The value lies in the certificate, not the rate. Each future improvement to the compressor yields a tighter certified bound through the same certificate structure.
Full paper (HTML) PDF