𝓕 β€” A unified theory of description

Companion to the Hodge-Epsilon Program
Richard Hoekstra · 2026

The functional

The 𝓕-functionaal measures the net benefit of a model-data pair: how much entropy the model saves minus the cost of describing the model itself.

𝓕(m, x) = bits_saved(m, x) βˆ’ Lmodel(m)

The total description length is Ltotal(m, x) = Lmodel(m) + Lresidual(x | m). Higher 𝓕 is better: a model earns its complexity.

Model length

The model length is measured in trits via TLC (ternary lambda calculus):

Lmodel(m) = |m|trits Γ— Kβ‚‚

where Kβ‚‚ = 3 Β· logβ‚‚(Ο†) β‰ˆ 2.083 bits per trit, and Ο† = (1 + √5)/2 is the golden ratio. This is the natural bit-cost of a trit in the Kraft-saturating TLC encoding (the same encoding used in Paper 2 for Kolmogorov complexity measurement).

Three system classes

Class I: Deterministic Pm(y|x) ∈ {0, 1}. The model either predicts correctly or not.
𝓕det(m, x) = correct(m, x) Γ— 8 βˆ’ |m|Kβ‚‚
Each correct prediction saves a full byte (8 bits). Each wrong prediction costs 8 bits of residual. The model pays |m|Kβ‚‚ bits to exist.
Class II: Stochastic Pm(y|x) ∈ [0, 1]. The model outputs a probability distribution.
𝓕sto(m, x) = (|x| Γ— 8 βˆ’ Ξ£t [βˆ’logβ‚‚ Pm(xt+1 | x≀t)]) βˆ’ |m|Kβ‚‚
The residual is now the cross-entropy: each prediction contributes βˆ’logβ‚‚ P(correct next) bits of surprise. Bits saved = total bytes Γ— 8 minus total surprise.
Class III: Open Pm(y|x, z) with external context z. The model receives side-information at each step.
𝓕open(m, x, z) = correct(m, x, z) Γ— 8 βˆ’ |m|Kβ‚‚
Reduces to Class I when context z is empty.

Three proven properties

Theorem 1: Monotonicity under refinement If m' is a refinement of m (predicts at least as well) and the improvement exceeds the model growth (Ξ”correct Γ— 8 > Ξ”cost), then 𝓕(m', x) β‰₯ 𝓕(m, x).

Proof. 𝓕(m') βˆ’ 𝓕(m) = (correct(m') βˆ’ correct(m)) Γ— 8 βˆ’ (|m'|Kβ‚‚ βˆ’ |m|Kβ‚‚) = Ξ”correct Γ— 8 βˆ’ Ξ”cost > 0 by assumption.

Theorem 2: Encoding invariance For any bijective recoding Ο†: Ξ£ β†’ Ξ£ (alphabet permutation): 𝓕(Ο†(m), Ο†(x)) = 𝓕(m, x).

Proof. |Ο†(m)|Kβ‚‚ = |m|Kβ‚‚ (bijection preserves trit-length). predict(Ο†(m), Ο†(xt)) = Ο†(predict(m, xt)) by equivariance. So correct(Ο†(m), Ο†(x)) = correct(m, x).

Theorem 3: Composition bound Separable models compose with logarithmic overhead: 𝓕(m₁ ∘ mβ‚‚, x) β‰₯ 𝓕(m₁, x) + 𝓕(mβ‚‚, x) βˆ’ O(log |x|).

What 𝓕 answers

One question: how many bits of saved entropy justify the cost of describing the model? A model with 𝓕 > 0 has earned its complexity β€” it saves more than it costs. A model with 𝓕 < 0 is too expensive for what it delivers.

The three classes are not separate theories. Class II reduces to Class I when predictions are deterministic. Class III reduces to Class I when context is empty. The formulas differ but the principle is the same: net benefit = savings βˆ’ cost.

Connection to the program

𝓕 is the objective that the PPM-C compressor (Paper 2, companion) implicitly optimizes. The model-length term is measured in the same TLC encoding used for Kolmogorov complexity (Paper 2). The stochastic version (Class II) is the natural loss function for the Hodge language model (Paper 4): the harmonic layer's 1.75Γ— parameter efficiency is a statement about 𝓕-efficiency β€” more bits saved per bit of model description.

The 𝓕-functionaal makes the tradeoff between model complexity and prediction quality precise. It is not a new idea β€” MDL (Rissanen 1978) and Bayesian model selection are close relatives. The contribution is the specific instantiation for byte-level prediction with TLC-measured model length, and the three-class unification.