Papers · Atlas · Data

The irreversibility depth of natural language

Paper 1 in the Hodge-Epsilon Program

Richard Hoekstra · April 2026

The construction

Every information source has a time arrow. Reading "the cat sat on the" forward, "the" after "on" is highly predictable; reading backward, "on" after "the" is far less constrained. This directional asymmetry is fundamental to how language works, yet it has no standard measurement.

Given a byte stream and a fixed context depth D, we construct the order-D de Bruijn graph whose vertices are observed D-grams and whose edges are empirical transitions. The Markov transition field A(c → c') = log P_emp(c'|c) is a scalar 1-cochain. The Helmholtz-Hodge decomposition splits this field uniquely into an exact (reversible) component A_exact = d₀φ and a harmonic (irreversible) component A_harm. The exact component is the gradient of a vertex potential — transitions that obey detailed balance. The harmonic component is the cycle-current residual — the part of the transition structure that is irreducibly directional.

The harmonic fraction f(D) = ‖A_harm‖² / ‖A‖² measures how much of the transition energy is irreversible at depth D. The irreversibility depth D* is where f extrapolates to zero.

The irreversibility profile of English

Main result D* = 7.5 bytes for English Wikipedia (enwik8, 5M bytes). The harmonic fraction f(D) = exp(−(D/D₀)^k) with k = 1.91, D₀ = 4.96.

D	f(D)	Vertices	Edges
1	0.964	201	40,000
2	0.808	10,082	632,594
3	0.589	79,473	2,083,420
4	0.500	333,170	4,499,109
5	0.443	902,942	7,249,489

At D=1, nearly all transition energy is directional. At D=5, 44% remains — English retains substantial irreversibility even with 5 bytes of context. Content text is 6× more irreversible than markup (f(3) = 0.45 vs 0.07).

f(D) decay curves for 10 selected languages

The strongest harmonic cycles

The harmonic component decomposes into cycle currents. At D=3, the strongest are rotations of common English trigrams:

Cycle	Reads as	\|flow\|
itl → tli → lit	title, little	29.4
ntu → tun → unt	until, country	28.4
irs → rsi → sir	first, sir	28.2
eft → fte → tef	left, after	27.8
how → owh → who	who, how	26.3

Every top cycle is a common English word read as a directed trigram rotation. The cycle catalogue is a topological description of micro-grammar: the fundamental irreversible units of the language, ranked by directional preference. Full cycle catalogues for all 49 languages are in the topological atlas.

The universal beta function

Define the running coupling g(D) = f_harm(D) / (1 − f_harm(D)) — the ratio of irreversible to reversible field energy at depth D.

Universality The beta function β(g) ≈ −0.7g is consistent across all 49 languages, independent of word order, morphology, or script. All languages are asymptotically free: β < 0 at every measured depth, no exceptions.

The running coupling follows g(D) = g₀ exp(−0.7D), where g₀ is the only language-specific parameter. The coupling halves every ~1 byte of context.

Word order	Mean β	Std	n
SVO	−1.13	0.80	52
SOV	−1.24	0.60	28
VSO	−1.02	0.57	8
free	−1.04	0.76	18

All four word orders share β ≈ −1.1. Typological differences shift where on the g-axis a language sits, not how fast g decreases. The full g(D) table for all 49 languages is in the running coupling data.

D* across 49 languages

Measured on Wikipedia text (300k bytes each, MediaWiki API). D* in characters is D*_bytes / bpc (mean bytes per Unicode character).

By word order

Word order	Mean D*_chars	n
SVO	3.90 ± 1.20	23
SOV	2.63 ± 0.99	14
VSO	4.25 ± 1.11	4
free	3.89 ± 0.58	8

SOV languages resolve directional asymmetry 1.3 characters faster than SVO (Mann-Whitney U=82, p=0.007). In verb-final languages, subject and object constrain the verb strongly; the irreversibility runs out sooner.

By morphological type

Type	Mean D*_chars	n
Isolating	2.26 ± 1.01	4
Agglutinating	3.56 ± 1.36	15
Fusional	3.74 ± 1.04	30

Isolating languages (Chinese D* = 1.3 chars) resolve fastest — each character-morpheme is self-contained. The complete ranked table with all 49 languages is in the atlas.

CJK U-curves

Character-level validation reveals a qualitative difference between alphabetic and logographic scripts:

Language	f(1)	f(2)	f(3)	f(4)	f(5)	D*_char
English	0.821	0.634	0.540	0.477	0.486	3.63
Chinese	0.840	0.822	0.799	0.806	0.835	>5
Japanese	0.782	0.746	0.727	0.771	0.813	>5
Korean	0.771	0.661	0.715	0.759	0.803	>5

CJK f(D) is non-monotone: it decreases to D=2–3 then rises. Each additional character of context reveals new directional asymmetry (compound words, particles) rather than resolving it. Alphabetic scripts decay monotonically. This qualitative difference is invisible to byte-level analysis.

Sprachbunde from bytes

Euclidean distance on the f(D) profile clusters languages by structural similarity — capturing both genetic and contact-induced resemblance.

Distance	Pair	Relationship
0.008	Norwegian – Finnish	Nordic Sprachbund (Uralic + Germanic)
0.010	Portuguese – Italian	Romance sisters
0.012	English – French	Norman contact (Germanic + Romance)
0.013	Hungarian – Slovak	Kingdom of Hungary (Uralic + Slavic)
0.014	Russian – Ukrainian	East Slavic sisters

Balkan Sprachbund: mean intra-distance 0.032 vs mean to-other 0.068, ratio 0.47. The Balkan languages (Bulgarian, Romanian, Greek, Serbian) are 2× closer to each other than to the rest, despite spanning three genetic families. Full phylogeny in the atlas.

Distributional divergence

A flow graph connecting distributionally similar contexts (KNN on symmetric KL divergence) captures non-local structure. The difference Δf = f_flow − f_deBruijn measures whether non-local connections add or remove irreversibility.

Language	Script	Δf	Interpretation
English	Latin	−0.131	wormholes simplify
Danish	Latin	−0.079	wormholes simplify
Turkish	Latin	−0.009	neutral
Arabic	Arabic	+0.006	neutral
Japanese	CJK	+0.072	wormholes create
Chinese	CJK	+0.105	wormholes create
Hindi	Devanagari	+0.170	wormholes create
Korean	Hangul	+0.171	wormholes create

The sign splits on encoding type: all Latin-script languages are negative (distributional connections simplify); all multibyte languages are positive (distributional connections reveal new structure the bytes didn't tell you). Vietnamese (Latin script, +0.101) is the one exception — its tonal diacritics create byte diversity within distributional classes.

Cross-domain comparison

Source	f(5)	Domain
Bach MIDI	0.701	Music (strict counterpoint)
Joplin MIDI	0.659	Music (ragtime)
Debussy MIDI	0.570	Music (impressionism)
English text	0.492	Language
Lean proofs	0.464	Formal proofs
E. coli protein	0.882	Protein sequences

Music retains more directional structure than language at every depth — temporal order IS the content. Protein sequences are near-random (f(3) = 0.98 vs 0.59 for English) — biological structure is spatial, not sequential. Cross-domain data: music, protein.

GPT-2 diagnostic

GPT-2 generated text is slightly less irreversible than real English at D ≤ 3 (1–2% lower harmonic fraction), with a crossover at D ~ 4. At D = 5, GPT-2 is 7–8% more irreversible — anomalously rigid. This structural difference is invisible to perplexity. D* provides a new diagnostic for language model quality.

Negative results

Holonomy K(c) is weaker than KL divergence. Partial Spearman correlation +0.24 with held-out loss, controlling for depth. KL divergence achieves +0.39; conditional entropy +0.57.

Scaling exponent α = 0.148, not the predicted 1/3. The power-law form fits well; the specific value was wrong.

Fiedler partition detects script, not grammar. The spectral gap separates character sets, not syntactic categories.

FSI difficulty correlates with bpc, not spectral complexity. Script complexity, not structural complexity, predicts perceived difficulty.

The method requires no parser, no grammar, no domain knowledge — only a byte stream and a sparse linear solve. One sparse conjugate gradient solve per (language, depth). 49 languages × 5 depths in ~40 min on CPU.

Full paper (HTML) PDF