Every information source has a time arrow. Reading "the cat sat on the" forward, "the" after "on" is highly predictable; reading backward, "on" after "the" is far less constrained. This directional asymmetry is fundamental to how language works, yet it has no standard measurement.
Given a byte stream and a fixed context depth D, we construct the order-D de Bruijn graph whose vertices are observed D-grams and whose edges are empirical transitions. The Markov transition field A(c → c') = log Pemp(c'|c) is a scalar 1-cochain. The Helmholtz-Hodge decomposition splits this field uniquely into an exact (reversible) component Aexact = d₀φ and a harmonic (irreversible) component Aharm. The exact component is the gradient of a vertex potential — transitions that obey detailed balance. The harmonic component is the cycle-current residual — the part of the transition structure that is irreducibly directional.
The harmonic fraction f(D) = ‖Aharm‖² / ‖A‖² measures how much of the transition energy is irreversible at depth D. The irreversibility depth D* is where f extrapolates to zero.
| D | f(D) | Vertices | Edges |
|---|---|---|---|
| 1 | 0.964 | 201 | 40,000 |
| 2 | 0.808 | 10,082 | 632,594 |
| 3 | 0.589 | 79,473 | 2,083,420 |
| 4 | 0.500 | 333,170 | 4,499,109 |
| 5 | 0.443 | 902,942 | 7,249,489 |
At D=1, nearly all transition energy is directional. At D=5, 44% remains — English retains substantial irreversibility even with 5 bytes of context. Content text is 6× more irreversible than markup (f(3) = 0.45 vs 0.07).
The harmonic component decomposes into cycle currents. At D=3, the strongest are rotations of common English trigrams:
| Cycle | Reads as | |flow| |
|---|---|---|
| itl → tli → lit | title, little | 29.4 |
| ntu → tun → unt | until, country | 28.4 |
| irs → rsi → sir | first, sir | 28.2 |
| eft → fte → tef | left, after | 27.8 |
| how → owh → who | who, how | 26.3 |
Every top cycle is a common English word read as a directed trigram rotation. The cycle catalogue is a topological description of micro-grammar: the fundamental irreversible units of the language, ranked by directional preference. Full cycle catalogues for all 49 languages are in the topological atlas.
Define the running coupling g(D) = fharm(D) / (1 − fharm(D)) — the ratio of irreversible to reversible field energy at depth D.
The running coupling follows g(D) = g₀ exp(−0.7D), where g₀ is the only language-specific parameter. The coupling halves every ~1 byte of context.
| Word order | Mean β | Std | n |
|---|---|---|---|
| SVO | −1.13 | 0.80 | 52 |
| SOV | −1.24 | 0.60 | 28 |
| VSO | −1.02 | 0.57 | 8 |
| free | −1.04 | 0.76 | 18 |
All four word orders share β ≈ −1.1. Typological differences shift where on the g-axis a language sits, not how fast g decreases. The full g(D) table for all 49 languages is in the running coupling data.
Measured on Wikipedia text (300k bytes each, MediaWiki API). D* in characters is D*bytes / bpc (mean bytes per Unicode character).
| Word order | Mean D*chars | n |
|---|---|---|
| SVO | 3.90 ± 1.20 | 23 |
| SOV | 2.63 ± 0.99 | 14 |
| VSO | 4.25 ± 1.11 | 4 |
| free | 3.89 ± 0.58 | 8 |
SOV languages resolve directional asymmetry 1.3 characters faster than SVO (Mann-Whitney U=82, p=0.007). In verb-final languages, subject and object constrain the verb strongly; the irreversibility runs out sooner.
| Type | Mean D*chars | n |
|---|---|---|
| Isolating | 2.26 ± 1.01 | 4 |
| Agglutinating | 3.56 ± 1.36 | 15 |
| Fusional | 3.74 ± 1.04 | 30 |
Isolating languages (Chinese D* = 1.3 chars) resolve fastest — each character-morpheme is self-contained. The complete ranked table with all 49 languages is in the atlas.
Character-level validation reveals a qualitative difference between alphabetic and logographic scripts:
| Language | f(1) | f(2) | f(3) | f(4) | f(5) | D*char |
|---|---|---|---|---|---|---|
| English | 0.821 | 0.634 | 0.540 | 0.477 | 0.486 | 3.63 |
| Chinese | 0.840 | 0.822 | 0.799 | 0.806 | 0.835 | >5 |
| Japanese | 0.782 | 0.746 | 0.727 | 0.771 | 0.813 | >5 |
| Korean | 0.771 | 0.661 | 0.715 | 0.759 | 0.803 | >5 |
CJK f(D) is non-monotone: it decreases to D=2–3 then rises. Each additional character of context reveals new directional asymmetry (compound words, particles) rather than resolving it. Alphabetic scripts decay monotonically. This qualitative difference is invisible to byte-level analysis.
Euclidean distance on the f(D) profile clusters languages by structural similarity — capturing both genetic and contact-induced resemblance.
| Distance | Pair | Relationship |
|---|---|---|
| 0.008 | Norwegian – Finnish | Nordic Sprachbund (Uralic + Germanic) |
| 0.010 | Portuguese – Italian | Romance sisters |
| 0.012 | English – French | Norman contact (Germanic + Romance) |
| 0.013 | Hungarian – Slovak | Kingdom of Hungary (Uralic + Slavic) |
| 0.014 | Russian – Ukrainian | East Slavic sisters |
A flow graph connecting distributionally similar contexts (KNN on symmetric KL divergence) captures non-local structure. The difference Δf = fflow − fdeBruijn measures whether non-local connections add or remove irreversibility.
| Language | Script | Δf | Interpretation |
|---|---|---|---|
| English | Latin | −0.131 | wormholes simplify |
| Danish | Latin | −0.079 | wormholes simplify |
| Turkish | Latin | −0.009 | neutral |
| Arabic | Arabic | +0.006 | neutral |
| Japanese | CJK | +0.072 | wormholes create |
| Chinese | CJK | +0.105 | wormholes create |
| Hindi | Devanagari | +0.170 | wormholes create |
| Korean | Hangul | +0.171 | wormholes create |
The sign splits on encoding type: all Latin-script languages are negative (distributional connections simplify); all multibyte languages are positive (distributional connections reveal new structure the bytes didn't tell you). Vietnamese (Latin script, +0.101) is the one exception — its tonal diacritics create byte diversity within distributional classes.
| Source | f(5) | Domain |
|---|---|---|
| Bach MIDI | 0.701 | Music (strict counterpoint) |
| Joplin MIDI | 0.659 | Music (ragtime) |
| Debussy MIDI | 0.570 | Music (impressionism) |
| English text | 0.492 | Language |
| Lean proofs | 0.464 | Formal proofs |
| E. coli protein | 0.882 | Protein sequences |
Music retains more directional structure than language at every depth — temporal order IS the content. Protein sequences are near-random (f(3) = 0.98 vs 0.59 for English) — biological structure is spatial, not sequential. Cross-domain data: music, protein.
GPT-2 generated text is slightly less irreversible than real English at D ≤ 3 (1–2% lower harmonic fraction), with a crossover at D ~ 4. At D = 5, GPT-2 is 7–8% more irreversible — anomalously rigid. This structural difference is invisible to perplexity. D* provides a new diagnostic for language model quality.
Holonomy K(c) is weaker than KL divergence. Partial Spearman correlation +0.24 with held-out loss, controlling for depth. KL divergence achieves +0.39; conditional entropy +0.57.
Scaling exponent α = 0.148, not the predicted 1/3. The power-law form fits well; the specific value was wrong.
Fiedler partition detects script, not grammar. The spectral gap separates character sets, not syntactic categories.
FSI difficulty correlates with bpc, not spectral complexity. Script complexity, not structural complexity, predicts perceived difficulty.