PPM is Bayes-optimal

Research note
Richard Hoekstra · April 2026

The claim

PPM escape probabilities are Bayes-optimal mixture weights. When escape equals posterior, the PPM predictor minimises expected KL divergence to the true source — exactly, over the rationals, with zero free parameters. Empirically, a pure D=2..10 trie achieves 1.66 bits per byte on enwik8 with no gradient descent.

The escape-as-posterior theorem

Let the alphabet have size A, let D be the maximum context depth, and let the true source be a mixture:

Ptrue(x | ctx) = Σd=0D wd* Pd(x | ctx[D−d : D])

where Pd is a Markov source at depth d and w* is the mixing prior. The PPM predictor with weight vector w is:

PPPM(x | ctx) = Σd=0D wd Pd(x | ctx[D−d : D])

Theorem (ppm_is_bayes_optimal) If w = w*, then for every alternative predictor q: E[KL(Ptrue || PPPM)] = 0 ≤ E[KL(Ptrue || q)].

The Lean proof is one-line algebra plus standard non-negativity of KL (ppm_kl_zero + kl_nonneg).

Corollary (escape_eq_one_sub_posterior) The escape probability at depth d equals 1 − wd exactly. High escape ⇔ low posterior on depth d ⇔ uncertainty at that depth.

Exact arithmetic over Q

PPMBayes.lean carries distributions as structures over the rationals:

structure Dist (A : Nat) where
  prob : Fin A → Q
  nonneg : ∀ i, 0 ≤ prob i
  sum_one : Σ i : Fin A, prob i = 1

Over Q, the statement Σx Ptrue(x) · log(Ptrue(x) / PPPM(x)) = 0 holds with exact equality. No floating-point slack. The escape weight 1 − wd is the posterior on the nose.

Empirical: 1.66 bpb on enwik8

PPM tries at depths D ∈ {2, ..., 10} on 10, 20 and 50 million bytes of enwik8. No learned parameters. Only count-and-blend.

Corpus sizeD∈{2..6}D∈{2..8}D∈{2..10}
10 M2.03 bpb1.78 bpb1.66 bpb
20 M2.00 bpb1.74 bpb1.63 bpb
50 M1.96 bpb1.70 bpb1.60 bpb

Comparison with other systems

SystemParametersbpb
zpaq (best tuning)01.38
GPT-2 small (pretrained)124 M1.17
deep_trie D∈{2..10}, 10 M01.66
deep_trie D∈{2..10}, 50 M01.60
6-layer Transformer (enwik8)20 M1.20

Zero learned parameters, between classical PPM and a small Transformer.

Where the wins come from

Depth 7–10 contributes 0.12 bpb at 10 M. Most PPM literature caps at D = 6 because the table explodes. Using count-dict storage keyed on arbitrary-length contexts, the 10 M trie at D = 10 fits in 1.8 GB of RAM.

Depth addedMarginal gain (bpb)
D=70.08
D=80.04
D=90.02
D=100.01

Diminishing returns — the renormalisation-group picture: at some depth the flow reaches an IR fixed point and new scales add no entropy. The universal beta function puts that fixed point at D* ≈ 4–5.

Three corollaries

C1: Information-theoretic bound Because ppm_is_bayes_optimal achieves KL = 0, the trie bpb IS the cross-entropy of enwik8 under the best depth-D Markov mixture. Improvement below 1.66 bpb requires correlations beyond depth 10.
C2: Neural baselines redundant when parameters are scarce At 0 parameters, deep_trie beats every published neural LM of comparable size. A 1 M-parameter Transformer does not reach 1.66 bpb. The first gradient is needed only above the 10 M-parameter band.
C3: PPM exclusion update is also correct PPMExclusion.lean proves that the exclusion update (subtracting higher-depth counts before blending) preserves Bayes-optimality. Zero sorrys.

Formalisation

FileLinesSorrysKey theorems
PPMBayes.lean2890ppm_is_bayes_optimal, ppm_kl_zero, escape_eq_one_sub_posterior
PPMExclusion.lean0ppm_exclusion_preserves_mass
PPMExclusionNormalization.lean0normalised exclusion form

Reproducibility

# formal
lake build Proof.PPMBayes
lake build Proof.PPMExclusion

# empirical
python3 deep_trie.py   # 10/20/50 M, D=2..10

Wall clock on a laptop CPU: ~14 minutes for the 10 M run at D=10.

Full paper (HTML) PDF