The Hodge language model

Paper 4 in the Hodge-Epsilon Program
Richard Hoekstra · April 2026

Architecture from decomposition

Language models implicitly decompose conditional distributions into structure they capture easily and structure they find hard. We make this decomposition explicit by grounding it in the Hodge theory of the underlying Markov graph.

The byte-level de Bruijn graph of order D has as vertices all observed D-byte contexts and as edges the one-step transitions c → c[1:]+b. The edge field A(c → c') = log Pemp(c'|c) is the empirical log-transition probability. The Helmholtz-Hodge decomposition splits this field into three orthogonal components:

Decomposition A = d₀φ + Aharmonic + δ₁ψ

where d₀φ is the exact (gradient) component derivable from a vertex potential φ, Aharmonic is the cycle-current component that cannot be expressed as any gradient, and δ₁ψ is the co-exact component from 2-cells. On the 1-complex, the co-exact term vanishes. Empirically, even on the clique 2-complex, the co-exact component is 0.03% of total energy. The story is two-component.

The vertex potential φ solves Lφ = d₀T(W · A), where L = d₀TWd₀ is the weighted graph Laplacian and W is the diagonal edge-weight matrix (transition counts). One sparse Cholesky factorization, O(|E|) time. The decomposition directly suggests a three-layer architecture:

Architecture Layer 0 (exact): Precompute φ. Prediction via table lookup: logitsexact[b] = φ(c[1:]+bytes([b])). Zero parameters, no gradients, no GPU. Captures the reversible (potential-driven) component — transitions obeying detailed balance.

Layer 1 (harmonic): 60K-parameter MLP learning cycle corrections Aharmonic(c → c'). Input: D byte indices. Architecture: Embedding(256, 42) → Linear(42·D, 128) → ReLU → Linear(128, 256). Parameters: 10,752 embedding + 49,408 network ≈ 60K total. Targets transitions that no vertex potential can capture — the irreversible currents, directional preferences at function-word boundaries, grammatical asymmetries.

Layer 2 (residual): 200K-parameter MLP for everything beyond context depth D. Input: 32 byte indices (longer context window). Architecture: Embedding(256, redim) → Linear(redim·32, 256) → ReLU → Linear(256, 256) → ReLU → Linear(256, 256).

Inference: logitstotal = logitsexact + logitsharmonic + logitsresidual. The additive combination reflects the orthogonal Hodge decomposition: the three components live in orthogonal subspaces of the edge field space. Only Layers 1 and 2 are trained; Layer 0 is frozen. Standard cross-entropy on logitstotal.

The D-sweep

Trained at depths D = 2, 3, 4, 5 on the first 2M bytes of enwik8 (90% train / 10% test), harmonic_dim=128, residual_dim=256, context_len=32, thresh=4, 10 epochs, lr=0.001:

D|V||E|fexactfharmbpbexactbpbtotalparams
23,18934,1360.1870.8134.212.93~260K
313,64275,8810.3110.6893.822.90~260K
427,99795,2550.4720.5283.572.87~260K
538,64588,2320.6330.3673.442.92~260K
Main result Optimum at D=4 with 2.87 bpb. The non-monotonicity is significant: D=5 is worse than D=4 despite having more context.

At D=2, the exact component is only 18.7% of field energy — Layer 0 contributes little, and the network must learn almost everything. At D=5, the exact component is 63.3% but the graph has 38,645 vertices with many contexts seen only a few times, leading to noisy φ estimates. The sweet spot at D=4 is where the exact component carries roughly half the structure (47.2%) and the graph is at maximum density (95,255 edges).

Harmonic efficiency

Define parameter efficiency as bpb improvement per million parameters:

Definition efficiency = (bpbwithout layer − bpbwith layer) / (params / 106)

At D=4:

Layerbpb dropParamsEfficiency (bpb/Mparam)
Harmonic3.57 → 3.21 (−0.36)60K60.7
Residual3.21 → 2.87 (−0.34)200K34.7

The harmonic layer is 1.75× more efficient per parameter than the residual. This is not because the harmonic component is small — at D=3 it carries 69% of total field energy. It is because the harmonic component is structured: it lives on a low-dimensional subspace of the edge space (the first Betti number b₁ = 62,240 at D=3, but the effective dimensionality, measured by eigenvalue concentration of the cycle interaction matrix, is much lower).

When Layer 0 is credited, the effective advantage reaches 12× at the most favorable operating point: the first ~50% of structure costs zero parameters (it is the vertex potential), so the harmonic layer starts from a much better baseline than a model that must learn everything from scratch.

The running coupling

Define g(D) = fharm(D) / fexact(D) = fharm(D) / (1 − fharm(D)). This is the ratio of harmonic to exact field energy — the strength of the cycle-current "interaction" relative to the free (potential-driven) "propagator."

Dfharmfexactg(D)
20.8130.1874.35
30.6890.3112.22
40.5280.4721.12
50.3670.6330.58

The coupling decreases monotonically. The crossover g(D*) = 1 occurs at D* ≈ 4, defining two regimes:

Two regimes D > D* (g < 1): The exact component dominates. Layer 0 carries most of the prediction. A trie with φ-lookup is a reasonable approximation.

D < D* (g > 1): The harmonic component dominates. No vertex potential suffices. The model needs the full cycle-current machinery.

The D=4 optimum coincides with this crossover. This is not a coincidence: it is the scale where neither perturbative expansion (around the exact component) nor cycle-current expansion (around the harmonic component) alone suffices. In the cross-linguistic atlas of 49 languages, the D* at g=1 correlates with the typological D* (where fharm = 0.5) with Spearman ρ > 0.99.

The spectral sequence

The D-sweep has the structure of a spectral sequence. As D increases: the exact component grows (0.187 → 0.311 → 0.472 → 0.633); the harmonic component shrinks (0.813 → 0.689 → 0.528 → 0.367). Longer contexts disambiguate — when context is long enough to determine the next byte, the Markov chain becomes deterministic and hence trivially reversible.

The decay is super-exponential with stretched-exponential fit fharm(D) = exp(−(D/4.96)1.91). The exponent k ≈ 2 means the decay is Gaussian-like in D. The critical depth D* ≈ 7.5 is where fharm extrapolates to zero — the fully reversible chain where the vertex potential determines all transitions.

Self-contained variant

The base Hodge LM requires a precomputed Hodge decomposition: build graph, decompose, freeze φ-table, train neural layers. The self-contained variant replaces the static graph with a living graph that grows with each observed byte and periodically recomputes its own Hodge decomposition.

The model regularizes the neural harmonic layer against its own observed harmonic corrections:

Self-consistency loss loss = lossCE + λh · lossself-consistency

where lossself-consistency is the MSE between the neural harmonic logits and the harmonic logits from the living graph.

Convergence On 500K bytes of enwik8 at D=3, the living graph's harmonic fraction stabilizes at fharm = 0.653, within 5% of the static measurement (0.689 on 2M bytes). The model's self-diagnosis agrees with the external measurement. Final bpb: 2.89 (vs 2.87 for the precomputed model).

Crystallization and melting dynamics

The self-contained model exhibits spontaneous phase transitions. Contexts with low harmonic energy — where outgoing transitions are well-approximated by the vertex potential alone — crystallize into the trie as frozen lookup entries. Contexts whose harmonic energy rises above threshold melt back into the neural layer for re-learning.

MetricValue
Crystallizations631
Melts347
Ratio1.82:1

The system is net-crystallizing: structure progressively solidifies. But the melting rate is substantial (38% of all transitions). The graph is not monotonically freezing — it continually revises as observed statistics evolve. The 1.8:1 ratio is a characteristic of the source, not a hyperparameter: it is identical across three independent implementations (fixed thresholds, learned meta-network, energy functional).

The trie+MLP variant: 2.26 bpb

The crystallization/melting dynamics motivate a dedicated phase model. A meta-network learns when to trust the trie versus the neural predictor:

Phase-learned architecture Trie: Live trie accumulating byte counts per context. Provides logits and per-context features (count, hits, error EMA, entropy).

Neural predictor: MLP, 128-dim, ~65K parameters.

Meta-network: ~15K parameters. Takes context embeddings + per-context features → confidence σ ∈ (0, 1).

Prediction: logits = σ · logitstrie + (1 − σ) · logitsneural
Result 2.26 bpb on 500K bytes of enwik8 at D=3, ~80K total parameters. The confidence distribution is bimodal — contexts are either crystallized (σ → 1) or liquid (σ → 0), confirming the phase picture.

An energy-based variant replaces the meta-network entirely. Phase transitions are governed by a free energy functional with zero learned parameters: F(c) = |entry_cost| − ΣΔH, where crystallization occurs when F(c) < 0. This achieves the same 2.26 bpb, confirming that the crystallization decision is a consequence of thermodynamics, not optimization.

Important caveat: The trie+MLP model is architecturally a mixture-of-experts with context-dependent gating. It does not explicitly compute the Hodge decomposition. The decomposition motivated the design (the crystallization/melting dichotomy corresponds to exact/harmonic dominance), but the 2.26 bpb result validates the insight, not the mechanism.

The model has zero learned parameters in its most important layer. The Hodge decomposition does the heavy lifting; the neural network cleans up what the mathematics leaves behind. The 0.70 bit gap between Layer 0 alone (3.57 bpb at D=4) and the full model (2.87 bpb) is the strong-coupling regime — the irreducible harmonic content that no vertex potential can capture.

Atlas data: topological atlas (49 languages, cycle catalogues), running coupling g(D) (full table).

Full paper (HTML) PDF