The irreversibility depth of natural language

Paper 1 in the Hodge-Epsilon Program
Richard Hoekstra · April 2026

The construction

Every information source has a time arrow. Reading "the cat sat on the" forward, "the" after "on" is highly predictable; reading backward, "on" after "the" is far less constrained. This directional asymmetry is fundamental to how language works, yet it has no standard measurement.

Given a byte stream and a fixed context depth D, we construct the order-D de Bruijn graph whose vertices are observed D-grams and whose edges are empirical transitions. The Markov transition field A(c → c') = log Pemp(c'|c) is a scalar 1-cochain. The Helmholtz-Hodge decomposition splits this field uniquely into an exact (reversible) component Aexact = d₀φ and a harmonic (irreversible) component Aharm. The exact component is the gradient of a vertex potential — transitions that obey detailed balance. The harmonic component is the cycle-current residual — the part of the transition structure that is irreducibly directional.

The harmonic fraction f(D) = ‖Aharm‖² / ‖A‖² measures how much of the transition energy is irreversible at depth D. The irreversibility depth D* is where f extrapolates to zero.

The irreversibility profile of English

Main result D* = 7.5 bytes for English Wikipedia (enwik8, 5M bytes). The harmonic fraction f(D) = exp(−(D/D₀)k) with k = 1.91, D₀ = 4.96.
Df(D)VerticesEdges
10.96420140,000
20.80810,082632,594
30.58979,4732,083,420
40.500333,1704,499,109
50.443902,9427,249,489

At D=1, nearly all transition energy is directional. At D=5, 44% remains — English retains substantial irreversibility even with 5 bytes of context. Content text is 6× more irreversible than markup (f(3) = 0.45 vs 0.07).

f(D) decay curves for 10 selected languages

The strongest harmonic cycles

The harmonic component decomposes into cycle currents. At D=3, the strongest are rotations of common English trigrams:

CycleReads as|flow|
itl → tli → littitle, little29.4
ntu → tun → untuntil, country28.4
irs → rsi → sirfirst, sir28.2
eft → fte → tefleft, after27.8
how → owh → whowho, how26.3

Every top cycle is a common English word read as a directed trigram rotation. The cycle catalogue is a topological description of micro-grammar: the fundamental irreversible units of the language, ranked by directional preference. Full cycle catalogues for all 49 languages are in the topological atlas.

The universal beta function

Define the running coupling g(D) = fharm(D) / (1 − fharm(D)) — the ratio of irreversible to reversible field energy at depth D.

Universality The beta function β(g) ≈ −0.7g is consistent across all 49 languages, independent of word order, morphology, or script. All languages are asymptotically free: β < 0 at every measured depth, no exceptions.

The running coupling follows g(D) = g₀ exp(−0.7D), where g₀ is the only language-specific parameter. The coupling halves every ~1 byte of context.

Word orderMean βStdn
SVO−1.130.8052
SOV−1.240.6028
VSO−1.020.578
free−1.040.7618

All four word orders share β ≈ −1.1. Typological differences shift where on the g-axis a language sits, not how fast g decreases. The full g(D) table for all 49 languages is in the running coupling data.

Running coupling g(D) for 49 languages

D* across 49 languages

Measured on Wikipedia text (300k bytes each, MediaWiki API). D* in characters is D*bytes / bpc (mean bytes per Unicode character).

By word order

Word orderMean D*charsn
SVO3.90 ± 1.2023
SOV2.63 ± 0.9914
VSO4.25 ± 1.114
free3.89 ± 0.588

SOV languages resolve directional asymmetry 1.3 characters faster than SVO (Mann-Whitney U=82, p=0.007). In verb-final languages, subject and object constrain the verb strongly; the irreversibility runs out sooner.

D* by word order

By morphological type

TypeMean D*charsn
Isolating2.26 ± 1.014
Agglutinating3.56 ± 1.3615
Fusional3.74 ± 1.0430

Isolating languages (Chinese D* = 1.3 chars) resolve fastest — each character-morpheme is self-contained. The complete ranked table with all 49 languages is in the atlas.

CJK U-curves

Character-level validation reveals a qualitative difference between alphabetic and logographic scripts:

Languagef(1)f(2)f(3)f(4)f(5)D*char
English0.8210.6340.5400.4770.4863.63
Chinese0.8400.8220.7990.8060.835>5
Japanese0.7820.7460.7270.7710.813>5
Korean0.7710.6610.7150.7590.803>5

CJK f(D) is non-monotone: it decreases to D=2–3 then rises. Each additional character of context reveals new directional asymmetry (compound words, particles) rather than resolving it. Alphabetic scripts decay monotonically. This qualitative difference is invisible to byte-level analysis.

Sprachbunde from bytes

Euclidean distance on the f(D) profile clusters languages by structural similarity — capturing both genetic and contact-induced resemblance.

DistancePairRelationship
0.008Norwegian – FinnishNordic Sprachbund (Uralic + Germanic)
0.010Portuguese – ItalianRomance sisters
0.012English – FrenchNorman contact (Germanic + Romance)
0.013Hungarian – SlovakKingdom of Hungary (Uralic + Slavic)
0.014Russian – UkrainianEast Slavic sisters
Balkan Sprachbund: mean intra-distance 0.032 vs mean to-other 0.068, ratio 0.47. The Balkan languages (Bulgarian, Romanian, Greek, Serbian) are 2× closer to each other than to the rest, despite spanning three genetic families. Full phylogeny in the atlas.

Distributional divergence

A flow graph connecting distributionally similar contexts (KNN on symmetric KL divergence) captures non-local structure. The difference Δf = fflow − fdeBruijn measures whether non-local connections add or remove irreversibility.

LanguageScriptΔfInterpretation
EnglishLatin−0.131wormholes simplify
DanishLatin−0.079wormholes simplify
TurkishLatin−0.009neutral
ArabicArabic+0.006neutral
JapaneseCJK+0.072wormholes create
ChineseCJK+0.105wormholes create
HindiDevanagari+0.170wormholes create
KoreanHangul+0.171wormholes create

The sign splits on encoding type: all Latin-script languages are negative (distributional connections simplify); all multibyte languages are positive (distributional connections reveal new structure the bytes didn't tell you). Vietnamese (Latin script, +0.101) is the one exception — its tonal diacritics create byte diversity within distributional classes.

Cross-domain comparison

Sourcef(5)Domain
Bach MIDI0.701Music (strict counterpoint)
Joplin MIDI0.659Music (ragtime)
Debussy MIDI0.570Music (impressionism)
English text0.492Language
Lean proofs0.464Formal proofs
E. coli protein0.882Protein sequences

Music retains more directional structure than language at every depth — temporal order IS the content. Protein sequences are near-random (f(3) = 0.98 vs 0.59 for English) — biological structure is spatial, not sequential. Cross-domain data: music, protein.

Cross-domain formality ladder

GPT-2 diagnostic

GPT-2 generated text is slightly less irreversible than real English at D ≤ 3 (1–2% lower harmonic fraction), with a crossover at D ~ 4. At D = 5, GPT-2 is 7–8% more irreversible — anomalously rigid. This structural difference is invisible to perplexity. D* provides a new diagnostic for language model quality.

Negative results

Holonomy K(c) is weaker than KL divergence. Partial Spearman correlation +0.24 with held-out loss, controlling for depth. KL divergence achieves +0.39; conditional entropy +0.57.

Scaling exponent α = 0.148, not the predicted 1/3. The power-law form fits well; the specific value was wrong.

Fiedler partition detects script, not grammar. The spectral gap separates character sets, not syntactic categories.

FSI difficulty correlates with bpc, not spectral complexity. Script complexity, not structural complexity, predicts perceived difficulty.

The method requires no parser, no grammar, no domain knowledge — only a byte stream and a sparse linear solve. One sparse conjugate gradient solve per (language, depth). 49 languages × 5 depths in ~40 min on CPU.
Full paper (HTML) PDF