PPM escape probabilities are Bayes-optimal mixture weights. When escape equals posterior, the PPM predictor minimises expected KL divergence to the true source — exactly, over the rationals, with zero free parameters. Empirically, a pure D=2..10 trie achieves 1.66 bits per byte on enwik8 with no gradient descent.
Let the alphabet have size A, let D be the maximum context depth, and let the true source be a mixture:
Ptrue(x | ctx) = Σd=0D wd* Pd(x | ctx[D−d : D])
where Pd is a Markov source at depth d and w* is the mixing prior. The PPM predictor with weight vector w is:
PPPM(x | ctx) = Σd=0D wd Pd(x | ctx[D−d : D])
The Lean proof is one-line algebra plus standard non-negativity of KL (ppm_kl_zero + kl_nonneg).
PPMBayes.lean carries distributions as structures over the rationals:
structure Dist (A : Nat) where
prob : Fin A → Q
nonneg : ∀ i, 0 ≤ prob i
sum_one : Σ i : Fin A, prob i = 1
Over Q, the statement Σx Ptrue(x) · log(Ptrue(x) / PPPM(x)) = 0 holds with exact equality. No floating-point slack. The escape weight 1 − wd is the posterior on the nose.
PPM tries at depths D ∈ {2, ..., 10} on 10, 20 and 50 million bytes of enwik8. No learned parameters. Only count-and-blend.
| Corpus size | D∈{2..6} | D∈{2..8} | D∈{2..10} |
|---|---|---|---|
| 10 M | 2.03 bpb | 1.78 bpb | 1.66 bpb |
| 20 M | 2.00 bpb | 1.74 bpb | 1.63 bpb |
| 50 M | 1.96 bpb | 1.70 bpb | 1.60 bpb |
| System | Parameters | bpb |
|---|---|---|
| zpaq (best tuning) | 0 | 1.38 |
| GPT-2 small (pretrained) | 124 M | 1.17 |
| deep_trie D∈{2..10}, 10 M | 0 | 1.66 |
| deep_trie D∈{2..10}, 50 M | 0 | 1.60 |
| 6-layer Transformer (enwik8) | 20 M | 1.20 |
Zero learned parameters, between classical PPM and a small Transformer.
Depth 7–10 contributes 0.12 bpb at 10 M. Most PPM literature caps at D = 6 because the table explodes. Using count-dict storage keyed on arbitrary-length contexts, the 10 M trie at D = 10 fits in 1.8 GB of RAM.
| Depth added | Marginal gain (bpb) |
|---|---|
| D=7 | 0.08 |
| D=8 | 0.04 |
| D=9 | 0.02 |
| D=10 | 0.01 |
Diminishing returns — the renormalisation-group picture: at some depth the flow reaches an IR fixed point and new scales add no entropy. The universal beta function puts that fixed point at D* ≈ 4–5.
| File | Lines | Sorrys | Key theorems |
|---|---|---|---|
| PPMBayes.lean | 289 | 0 | ppm_is_bayes_optimal, ppm_kl_zero, escape_eq_one_sub_posterior |
| PPMExclusion.lean | — | 0 | ppm_exclusion_preserves_mass |
| PPMExclusionNormalization.lean | — | 0 | normalised exclusion form |
# formal
lake build Proof.PPMBayes
lake build Proof.PPMExclusion
# empirical
python3 deep_trie.py # 10/20/50 M, D=2..10
Wall clock on a laptop CPU: ~14 minutes for the 10 M run at D=10.
Full paper (HTML) PDF