Language models implicitly decompose conditional distributions into structure they capture easily and structure they find hard. We make this decomposition explicit by grounding it in the Hodge theory of the underlying Markov graph.
The byte-level de Bruijn graph of order D has as vertices all observed D-byte contexts and as edges the one-step transitions c → c[1:]+b. The edge field A(c → c') = log Pemp(c'|c) is the empirical log-transition probability. The Helmholtz-Hodge decomposition splits this field into three orthogonal components:
where d₀φ is the exact (gradient) component derivable from a vertex potential φ, Aharmonic is the cycle-current component that cannot be expressed as any gradient, and δ₁ψ is the co-exact component from 2-cells. On the 1-complex, the co-exact term vanishes. Empirically, even on the clique 2-complex, the co-exact component is 0.03% of total energy. The story is two-component.
The vertex potential φ solves Lφ = d₀T(W · A), where L = d₀TWd₀ is the weighted graph Laplacian and W is the diagonal edge-weight matrix (transition counts). One sparse Cholesky factorization, O(|E|) time. The decomposition directly suggests a three-layer architecture:
Inference: logitstotal = logitsexact + logitsharmonic + logitsresidual. The additive combination reflects the orthogonal Hodge decomposition: the three components live in orthogonal subspaces of the edge field space. Only Layers 1 and 2 are trained; Layer 0 is frozen. Standard cross-entropy on logitstotal.
Trained at depths D = 2, 3, 4, 5 on the first 2M bytes of enwik8 (90% train / 10% test), harmonic_dim=128, residual_dim=256, context_len=32, thresh=4, 10 epochs, lr=0.001:
| D | |V| | |E| | fexact | fharm | bpbexact | bpbtotal | params |
|---|---|---|---|---|---|---|---|
| 2 | 3,189 | 34,136 | 0.187 | 0.813 | 4.21 | 2.93 | ~260K |
| 3 | 13,642 | 75,881 | 0.311 | 0.689 | 3.82 | 2.90 | ~260K |
| 4 | 27,997 | 95,255 | 0.472 | 0.528 | 3.57 | 2.87 | ~260K |
| 5 | 38,645 | 88,232 | 0.633 | 0.367 | 3.44 | 2.92 | ~260K |
At D=2, the exact component is only 18.7% of field energy — Layer 0 contributes little, and the network must learn almost everything. At D=5, the exact component is 63.3% but the graph has 38,645 vertices with many contexts seen only a few times, leading to noisy φ estimates. The sweet spot at D=4 is where the exact component carries roughly half the structure (47.2%) and the graph is at maximum density (95,255 edges).
Define parameter efficiency as bpb improvement per million parameters:
At D=4:
| Layer | bpb drop | Params | Efficiency (bpb/Mparam) |
|---|---|---|---|
| Harmonic | 3.57 → 3.21 (−0.36) | 60K | 60.7 |
| Residual | 3.21 → 2.87 (−0.34) | 200K | 34.7 |
The harmonic layer is 1.75× more efficient per parameter than the residual. This is not because the harmonic component is small — at D=3 it carries 69% of total field energy. It is because the harmonic component is structured: it lives on a low-dimensional subspace of the edge space (the first Betti number b₁ = 62,240 at D=3, but the effective dimensionality, measured by eigenvalue concentration of the cycle interaction matrix, is much lower).
When Layer 0 is credited, the effective advantage reaches 12× at the most favorable operating point: the first ~50% of structure costs zero parameters (it is the vertex potential), so the harmonic layer starts from a much better baseline than a model that must learn everything from scratch.
Define g(D) = fharm(D) / fexact(D) = fharm(D) / (1 − fharm(D)). This is the ratio of harmonic to exact field energy — the strength of the cycle-current "interaction" relative to the free (potential-driven) "propagator."
| D | fharm | fexact | g(D) |
|---|---|---|---|
| 2 | 0.813 | 0.187 | 4.35 |
| 3 | 0.689 | 0.311 | 2.22 |
| 4 | 0.528 | 0.472 | 1.12 |
| 5 | 0.367 | 0.633 | 0.58 |
The coupling decreases monotonically. The crossover g(D*) = 1 occurs at D* ≈ 4, defining two regimes:
The D=4 optimum coincides with this crossover. This is not a coincidence: it is the scale where neither perturbative expansion (around the exact component) nor cycle-current expansion (around the harmonic component) alone suffices. In the cross-linguistic atlas of 49 languages, the D* at g=1 correlates with the typological D* (where fharm = 0.5) with Spearman ρ > 0.99.
The D-sweep has the structure of a spectral sequence. As D increases: the exact component grows (0.187 → 0.311 → 0.472 → 0.633); the harmonic component shrinks (0.813 → 0.689 → 0.528 → 0.367). Longer contexts disambiguate — when context is long enough to determine the next byte, the Markov chain becomes deterministic and hence trivially reversible.
The decay is super-exponential with stretched-exponential fit fharm(D) = exp(−(D/4.96)1.91). The exponent k ≈ 2 means the decay is Gaussian-like in D. The critical depth D* ≈ 7.5 is where fharm extrapolates to zero — the fully reversible chain where the vertex potential determines all transitions.
The base Hodge LM requires a precomputed Hodge decomposition: build graph, decompose, freeze φ-table, train neural layers. The self-contained variant replaces the static graph with a living graph that grows with each observed byte and periodically recomputes its own Hodge decomposition.
The model regularizes the neural harmonic layer against its own observed harmonic corrections:
where lossself-consistency is the MSE between the neural harmonic logits and the harmonic logits from the living graph.
The self-contained model exhibits spontaneous phase transitions. Contexts with low harmonic energy — where outgoing transitions are well-approximated by the vertex potential alone — crystallize into the trie as frozen lookup entries. Contexts whose harmonic energy rises above threshold melt back into the neural layer for re-learning.
| Metric | Value |
|---|---|
| Crystallizations | 631 |
| Melts | 347 |
| Ratio | 1.82:1 |
The system is net-crystallizing: structure progressively solidifies. But the melting rate is substantial (38% of all transitions). The graph is not monotonically freezing — it continually revises as observed statistics evolve. The 1.8:1 ratio is a characteristic of the source, not a hyperparameter: it is identical across three independent implementations (fixed thresholds, learned meta-network, energy functional).
The crystallization/melting dynamics motivate a dedicated phase model. A meta-network learns when to trust the trie versus the neural predictor:
An energy-based variant replaces the meta-network entirely. Phase transitions are governed by a free energy functional with zero learned parameters: F(c) = |entry_cost| − ΣΔH, where crystallization occurs when F(c) < 0. This achieves the same 2.26 bpb, confirming that the crystallization decision is a consequence of thermodynamics, not optimization.
Important caveat: The trie+MLP model is architecturally a mixture-of-experts with context-dependent gating. It does not explicitly compute the Hodge decomposition. The decomposition motivated the design (the crystallization/melting dichotomy corresponds to exact/harmonic dominance), but the 2.26 bpb result validates the insight, not the mechanism.
Atlas data: topological atlas (49 languages, cycle catalogues), running coupling g(D) (full table).
Full paper (HTML) PDF