Papers · Atlas · Data

𝓕 — A unified theory of description

Companion to the Hodge-Epsilon Program

Richard Hoekstra · 2026

The functional

The 𝓕-functionaal measures the net benefit of a model-data pair: how much entropy the model saves minus the cost of describing the model itself.

𝓕(m, x) = bits_saved(m, x) - L model (m)

The total description length is L_total(m, x) = L_model(m) + L_residual(x | m). Higher 𝓕 is better: a model earns its complexity.

Model length

The model length is measured in trits via TLC (ternary lambda calculus):

L model (m) = |m| trits \times K₂

where K₂ = 3 · log₂(φ) ≈ 2.083 bits per trit, and φ = (1 + √5)/2 is the golden ratio. This is the natural bit-cost of a trit in the Kraft-saturating TLC encoding (the same encoding used in Paper 2 for Kolmogorov complexity measurement).

Three system classes

Class I: Deterministic P_m(y|x) ∈ {0, 1}. The model either predicts correctly or not.

𝓕 det (m, x) = correct(m, x) \times 8 - |m| K₂

Each correct prediction saves a full byte (8 bits). Each wrong prediction costs 8 bits of residual. The model pays |m|_K₂ bits to exist.

Class II: Stochastic P_m(y|x) ∈ [0, 1]. The model outputs a probability distribution.

𝓕 sto (m, x) = (|x| \times 8 - Σ t [-log₂ P m (x t+1 | x \leqt)]) - |m| K₂

The residual is now the cross-entropy: each prediction contributes −log₂ P(correct next) bits of surprise. Bits saved = total bytes × 8 minus total surprise.

Class III: Open P_m(y|x, z) with external context z. The model receives side-information at each step.

𝓕 open (m, x, z) = correct(m, x, z) \times 8 - |m| K₂

Reduces to Class I when context z is empty.

Three proven properties

Theorem 1: Monotonicity under refinement If m' is a refinement of m (predicts at least as well) and the improvement exceeds the model growth (Δcorrect × 8 > Δcost), then 𝓕(m', x) ≥ 𝓕(m, x).

Proof. 𝓕(m') − 𝓕(m) = (correct(m') − correct(m)) × 8 − (|m'|_K₂ − |m|_K₂) = Δcorrect × 8 − Δcost > 0 by assumption.

Theorem 2: Encoding invariance For any bijective recoding φ: Σ → Σ (alphabet permutation): 𝓕(φ(m), φ(x)) = 𝓕(m, x).

Proof. |φ(m)|_K₂ = |m|_K₂ (bijection preserves trit-length). predict(φ(m), φ(x_t)) = φ(predict(m, x_t)) by equivariance. So correct(φ(m), φ(x)) = correct(m, x).

Theorem 3: Composition bound Separable models compose with logarithmic overhead: 𝓕(m₁ ∘ m₂, x) ≥ 𝓕(m₁, x) + 𝓕(m₂, x) − O(log |x|).

What 𝓕 answers

One question: how many bits of saved entropy justify the cost of describing the model? A model with 𝓕 > 0 has earned its complexity — it saves more than it costs. A model with 𝓕 < 0 is too expensive for what it delivers.

The three classes are not separate theories. Class II reduces to Class I when predictions are deterministic. Class III reduces to Class I when context is empty. The formulas differ but the principle is the same: net benefit = savings − cost.

Connection to the program

𝓕 is the objective that the PPM-C compressor (Paper 2, companion) implicitly optimizes. The model-length term is measured in the same TLC encoding used for Kolmogorov complexity (Paper 2). The stochastic version (Class II) is the natural loss function for the Hodge language model (Paper 4): the harmonic layer's 1.75× parameter efficiency is a statement about 𝓕-efficiency — more bits saved per bit of model description.

The 𝓕-functionaal makes the tradeoff between model complexity and prediction quality precise. It is not a new idea — MDL (Rissanen 1978) and Bayesian model selection are close relatives. The contribution is the specific instantiation for byte-level prediction with TLC-measured model length, and the three-class unification.