Welsh and Tagalog are sister languages in the zeta metric. The dendrogram from 49 languages' cycle catalogues recovers four known families and assigns every orphan to a structural cluster. The entire pipeline is 240 lines of Python.
For a finite graph G = (V, E), the Ihara zeta function counts closed walks with no backtracking:
ζG(s) = Π[p] (1 − s|p|)−1
where [p] ranges over equivalence classes of primitive closed walks. By Bass's theorem:
ζG(s)−1 = (1 − s²)r−1 det(I − sA + s²(D − I))
with A the adjacency matrix, D the degree matrix, and r = |E| − |V| + components the first Betti number.
Applied to a byte stream, we build the D=3 de Bruijn context graph. The top-20 cycles are the 20 primitive closed walks with smallest length among cycles with nonzero Jacobian multiplicity. The zeta distance between languages L and L' is Jaccard distance on their cycle catalogues:
dζ(L, L') = 1 − |C(L) ∩ C(L')| / |C(L) ∪ C(L')|
UPGMA clustering on 49 languages, cut at d ≤ 0.985:
| Cluster | Languages | Distance |
|---|---|---|
| Scandinavian | sv — (no — da) | 0.919 |
| West Slavic | cs — sk | 0.889 |
| Iberian | es — pt | 0.857 |
| Austronesian | id — ms | 0.857 |
| Celtic/Austronesian | cy — (ga — tl) | 0.919 |
| Romance | fr — la | 0.889 |
| Polish-Germanic | pl — (de — (fr — la)) | 0.956 |
| Uralic/Baltic | fi — lv | 0.919 |
| Dravidian/Semitic | te — (ta — ar) | 0.904 |
| Slavic/Altaic | ja — (uk — mn) | 0.904 |
Two specific trigram cycles appear in both:
| Cycle class | Languages containing it |
|---|---|
| aer / era / rae | Dutch, Portuguese, Romanian, Tagalog, Welsh |
| aig / gai / iga | Indonesian, Irish, Tagalog, Welsh |
Both encode a vowel-resonant-vowel (V-R-V) pattern. In Welsh, these arise from soft mutation: consonants between vowels lenite to approximants. In Tagalog, they arise from infix -um- and reduplication. In Irish, VSO order plus initial mutations produce the same surface pattern. The three languages share surface phonotactics, not common ancestry.
This is precisely what the Hodge orthogonality theorem enables: the cycle space of a graph is orthogonal to the gradient space. Languages with the same cycles have the same harmonic content regardless of their lexicon or inflectional history.
Define Lb(L) = mean cycle length at byte level and Lp(L) = mean cycle length at phoneme level (via IPA transcription). For 30 languages with both measurements:
| Language | Lb | Lp | Opacity (Lp − Lb) |
|---|---|---|---|
| Spanish | 3.02 | 3.05 | +0.03 |
| English | 3.47 | 2.91 | −0.56 |
| Welsh | 2.88 | 3.91 | +1.03 |
English has negative opacity: its trigram graph has more structure at byte level than phoneme level, because English orthography encodes etymology (ght, ph, ough) that phonology has erased. Welsh is the opposite: transparent writing system but initial mutations generate phonological cycle structure invisible to the byte graph.
| File | Theorem | Statement |
|---|---|---|
| HodgeGraph.lean | hodge_orthogonal | Exact and harmonic components are orthogonal (0 sorrys) |
| HodgeIharaBridge.lean | b1_unifies_zeta_homology_coding | b1 simultaneously computes: cyclomatic complexity, harmonic rank, Ihara pole order, and code check-space dimension (0 sorrys) |
Four disparate classical invariants coincide because they are all b1. The Jaccard distance on cycle catalogues is a distance on a direct sum of four classical invariants simultaneously.
# empirical
python3 atlas.py # compute cycle catalogues
python3 phylogeny.py # produce dendrogram
# formal
lake build Proof.HodgeGraph
lake build Proof.HodgeIharaBridge
Full pipeline wall clock on a laptop: ~8 minutes.
Full paper (HTML) PDF