Start with a baseline PPM-C predictor on enwik8. Run twenty controlled single-variable experiments, each changing exactly one component against a fixed baseline. Document every winner and every honest negative. Then formalize the result in Lean 4 to produce a machine-verified Kolmogorov complexity upper bound.
A compressor that achieves r bits per byte on a string x certifies K(x) ≤ r · |x| + O(1), where K(x) is the Kolmogorov complexity. This makes compression a rigorous, if one-sided, probe of structure.
The path from v1 (3.68 bpb) to v6 (2.16 bpb) involved three discrete core improvements, followed by tuning and a final single-component refinement.
| Version | Change | Delta (bpb) | Cumulative (bpb) | Reduction |
|---|---|---|---|---|
| v1 | Baseline (parent+delta) | — | 3.68 | — |
| v2 | + exclusion | −0.36 | 3.32 | −10% |
| v3 | + PPM-C escape | −0.27 | 3.05 | −17% |
| v4 | + online trie update | −0.48 | 2.57 | −30% |
| v5 | Combined + tuned | −0.31 | 2.26 | −39% |
| v6 | + skip-gram (1M) | −0.10 | 2.16 | −41% |
The combined effect (−1.42 bpb, v1→v5) exceeds the sum of individual improvements (−1.11 bpb) because the components interact positively: exclusion makes escape more accurate, and online update makes both exclusion and escape track the evolving distribution.
The compressor's architecture has three load-bearing components, each with a distinct algebraic character (as intuitive descriptions, not formal properties):
Exclusion (idempotent). Once a byte is excluded at depth d, it stays excluded at all shorter depths. Applying exclusion twice changes nothing. It removes already-predicted probability mass from shorter contexts, preventing double-counting. Worth −0.36 bpb.
PPM-C escape (nilpotent). The escape probability u/(t+u) transfers residual mass from depth d to depth d−1. The escape mass is consumed in a single transfer — a nilpotent operation. This is the maximum-likelihood estimator for the probability that the next byte is novel given the current context. Worth −0.27 bpb.
Online update (involution). The encoder and decoder apply identical trie updates after each byte, reflecting a mirror symmetry that guarantees decodability. The same update applied twice returns to the same relative state. Worth −0.48 bpb.
Starting from v5 at 2.26 bpb on 500k bytes, twenty controlled experiments in two rounds. Each experiment changes exactly one component. The abstract states "seven yield improvements and thirteen produce honest negatives," but the body table reveals the finer classification: 7 pass, 7 fail, 4 neutral, 2 buggy. Both counts are presented here.
| # | Experiment | Round | Delta (bpb) | Result |
|---|---|---|---|---|
| 1 | Bayesian escape | 1 | +0.730 | FAIL |
| 2 | CDF 12-bit quantization | 1 | −0.044 | PASS |
| 3 | KT estimator | 1 | +2.170 | FAIL |
| 4 | MDL context pruning | 1 | −0.058 | PASS |
| 5 | Exclusion order reversal | 1 | — | BUGGY |
| 6 | Exact rational arithmetic | 1 | 0.000 | NEUTRAL |
| 7 | c-theorem prior on depth | 1 | +0.045 | FAIL |
| 8 | Forward filter (γ=0.99) | 1 | −0.024 | PASS |
| 9 | Lookahead coding | 1 | 0.000 | NEUTRAL |
| 10 | Entry count pruning | 1 | +0.075 | FAIL |
| 11 | Context inheritance | 2 | +0.031 | FAIL |
| 12 | Count aging (halflife) | 2 | +1.050 | FAIL |
| 13 | Skip-gram (α=0.2) | 2 | −0.089 | PASS |
| 14 | Dual-parent (α=0.1) | 2 | −0.028 | PASS |
| 15 | Local alphabet restriction | 2 | 0.000 | NEUTRAL |
| 16 | Bigram trie | 2 | +0.034 | FAIL |
| 17 | Elimination coding | 2 | — | BUGGY |
| 18 | Second-order escape | 2 | −0.006 | PASS |
| 19 | Sliding window (64k) | 2 | +0.669 | FAIL |
| 20 | Depth ensemble | 2 | 0.000 | NEUTRAL |
The pattern: the PPM-C escape mechanism is a local optimum, and improvements come from the surroundings — which contexts exist, how probabilities are quantized, what additional experts contribute.
Skip-gram context blending (#13, −0.089 bpb). The best single improvement. In addition to the contiguous context ci−D...ci−1, a skip-gram context that omits one position is queried and blended with weight α=0.2. This captures correlations beyond D=5 without the sparsity cost of extending to D=10. English text has strong positional dependencies (word endings predict word beginnings two words later) that contiguous contexts miss.
MDL context pruning (#4, −0.058 bpb). Contexts visited fewer than a threshold number of times are pruned, guided by the Minimum Description Length principle: a context is worth keeping only if its predictive benefit exceeds the cost of encoding its existence. This lets D=6 beat D=5 by removing overfitting contexts.
CDF 12-bit quantization (#2, −0.044 bpb). Restricting the CDF to 12-bit precision provides Laplace-like smoothing as a side effect. When a byte has very small probability, the quantization floor rounds it up to 1/4096, close to what a Laplace prior would assign.
Dual-parent blending (#14, −0.028 bpb). In addition to the suffix parent context, a prefix context is consulted. The suffix captures the immediate past; the prefix captures the beginning of the current phrase. Blending with α=0.1.
Forward filter (#8, −0.024 bpb). A Bayesian depth selector maintaining a posterior over which depth is most informative, with exponential forgetting (γ=0.99). Adaptively weights longer contexts in structured regions and falls back in noisy regions.
Second-order escape (#18, −0.006 bpb). The escape probability is adjusted based on the number of recent unique continuations rather than the lifetime count. Small but consistent.
The most important negative finding: every attempt to replace PPM-C's u/(t+u) escape formula fails.
| Alternative | Delta (bpb) |
|---|---|
| Bayesian escape (Beta posterior) | +0.730 |
| KT estimator | +2.170 |
| c-theorem prior on depth | +0.045 |
PPM-C escape is the ML estimator for the probability that the next byte is novel given the current context. Deviations are penalized immediately. You can surround the escape with better context selection, but you cannot touch the escape itself.
All individual deltas shrink at larger scale — longer data means more context, reducing the marginal value of each trick:
| Component | Delta at 500k | Delta at 1M |
|---|---|---|
| Skip-gram | −0.089 | −0.030 |
| MDL pruning | −0.058 | −0.010 |
| Dual-parent | −0.028 | −0.013 |
| CDF 12-bit | −0.044 | −0.008 |
| Forward filter | −0.024 | −0.005 |
KT estimator (#3, +2.170). The Krichevsky-Trofimov estimator systematically overestimates escape probability for small counts, leading to massive coding overhead.
Count aging (#12, +1.050). Halflife-based count decay destroys the trie's long-term statistics. PPM-C needs lifetime counts; recency hurts.
Bayesian escape (#1, +0.730). Replacing u/(t+u) with a Beta posterior overcomplicates what is already the ML estimator.
Sliding window (#19, +0.669). Restricting the trie to the last 64k bytes removes rare-but-informative long-range contexts.
Entry pruning (#10, +0.075). Pruning low-count entries removes useful singletons. c-theorem prior (#7, +0.045). Biasing toward shorter contexts fights the data when longer contexts are informative. Bigram trie (#16, +0.034). A separate bigram model adds noise when the main trie already captures bigrams at depth ≥ 2. Context inheritance (#11, +0.031). Initializing child counts from parent counts creates correlation that exclusion cannot correct.
Exact rationals (#6): confirms 32-bit integer arithmetic introduces no measurable error. Lookahead coding (#9): requires information the decoder does not have. Local alphabet (#15): exclusion already handles unseen bytes. Depth ensemble (#20): averaging across depths performs identically to PPM-C's sequential fallback, confirming PPM-C is already an optimal depth combination.
A mixture of a global trie and a local sliding window, gated by a curvature signal:
| Model | bpb |
|---|---|
| Trie alone | 3.147 |
| Window alone | 4.945 |
| Curvature-gated mixture | 3.528 |
| Mixture + momentum | 3.517 |
The mixture is 0.38 bpb worse than the trie alone. When one component is much stronger than the other (1.8 bpb quality gap), mixing is pure dilution.
Three Lean 4 files formalize the compressor and its correctness, with zero sorry's across all three. (A fourth pre-existing file, PPMBayes.lean, carries one sorry on a Bayes-optimality claim that does not affect the roundtrip chain.)
Defines arithmetic coding over exact rationals (Q). CoderState is an interval [lo, hi) with 0 ≤ lo < hi ≤ 1. encodeStep narrows the interval by the predicted probability. Key results: encodeStep_width (new width = old width × P(symbol)) and encodeStep_contained (new interval ⊆ old interval). Because both encoder and decoder use rational arithmetic with no rounding, they compute identical intervals at every step.
Implements PPM-C prediction with exclusion. Key results: ppmcEscape_bounded (0 ≤ escape ≤ 1), exclusion_zero_on_excluded (excluded symbols receive exactly zero mass), and CountTable.update (online update preserves distribution validity).
The main theorem:
For any context depth D and any byte sequence, decompressing the compressed output recovers the original. The proof proceeds by induction on the byte list, requiring at each step: (1) exact arithmetic (no rounding drift), (2) deterministic prediction (encoder and decoder see the same distribution), and (3) encoder/decoder symmetry (the decoder finds the unique symbol whose cumulative range contains the current point).
The CompressionCertificate structure bundles original data, compressed size, and machine-checked roundtrip proof:
Combined with the empirical 2.16 bpb, this yields K(enwik8_prefix) ≤ 2.16 · |x| + c, where c = |decompressor| is on the order of 104 bits. For |x| = 108, the overhead c/|x| < 0.0001.
| Theorem | File | Statement |
|---|---|---|
ppmcEscape_bounded | PPMExclusion | 0 ≤ escape ≤ 1 |
exclusion_zero_on_excluded | PPMExclusion | Excluded symbols get zero mass |
encodeStep_width | ArithCoder | Width shrinks by P(symbol) |
encodeStep_contained | ArithCoder | New interval ⊆ old |
compressor_roundtrip | KolmogorovBound | decode . encode = id |
Arithmetic coding over floating-point works in practice, but "sufficient precision" is not a statement Lean can check. Over the rationals, the interval [lo, hi) is represented exactly. Multiplication and addition are exact. The decoder computes (p − lo) / width with no error. The roundtrip is a consequence of arithmetic identities, not precision analysis. Empirically, float64 accumulates an error of ~3.55 × 10−15 per byte — negligible over 108 bytes, but "negligible" is not "zero," and the Lean proof requires zero.
| Compressor | bpb (enwik8) | Parameters | Verified |
|---|---|---|---|
| gzip | ~3.0 | 0 | No |
| v1 baseline | 3.68 | 0 | No |
| v5 tuned | 2.26 | 0 | Yes |
| bzip2 | ~2.1 | 0 | No |
| v6 final (1M) | 2.16 | 0 | Yes |
| v6 (full enwik8, 100M) | 1.964 | 0 | Yes |
| Neural (NNCP/cmix) | 1.2–1.4 | 106–109 | No |
| Shannon entropy rate | ~1.0–1.3 | — | — |
Runtime: O(n) in input length, O(256D) worst-case memory. Single CPU core, no GPU. The bound is loose (the true K(x) for English text is likely near 1.0–1.3 bpb), but it is proved, not estimated. The remaining ~0.9 bpb gap represents non-local information that n-gram models fundamentally cannot capture: long-range topic coherence, syntactic dependencies spanning dozens of words, and world knowledge.