Spectral phylogeny of languages via the Ihara zeta function

Research note
Richard Hoekstra · April 2026

The claim

Welsh and Tagalog are sister languages in the zeta metric. The dendrogram from 49 languages' cycle catalogues recovers four known families and assigns every orphan to a structural cluster. The entire pipeline is 240 lines of Python.

The Ihara zeta function

For a finite graph G = (V, E), the Ihara zeta function counts closed walks with no backtracking:

ζG(s) = Π[p] (1 − s|p|)−1

where [p] ranges over equivalence classes of primitive closed walks. By Bass's theorem:

ζG(s)−1 = (1 − s²)r−1 det(I − sA + s²(D − I))

with A the adjacency matrix, D the degree matrix, and r = |E| − |V| + components the first Betti number.

The zeta distance

Applied to a byte stream, we build the D=3 de Bruijn context graph. The top-20 cycles are the 20 primitive closed walks with smallest length among cycles with nonzero Jacobian multiplicity. The zeta distance between languages L and L' is Jaccard distance on their cycle catalogues:

dζ(L, L') = 1 − |C(L) ∩ C(L')| / |C(L) ∪ C(L')|

The dendrogram

UPGMA clustering on 49 languages, cut at d ≤ 0.985:

ClusterLanguagesDistance
Scandinaviansv — (no — da)0.919
West Slaviccs — sk0.889
Iberianes — pt0.857
Austronesianid — ms0.857
Celtic/Austronesiancy — (ga — tl)0.919
Romancefr — la0.889
Polish-Germanicpl — (de — (fr — la))0.956
Uralic/Balticfi — lv0.919
Dravidian/Semiticte — (ta — ar)0.904
Slavic/Altaicja — (uk — mn)0.904
The finding: Welsh (Insular Celtic) and Tagalog (Western Malayo-Polynesian) enter the tree as sister terminals under Irish, at the same distance that Norwegian and Danish join Swedish. By every classical metric they have nothing in common: separated by 10,000 km and 6,000 years.

What Welsh and Tagalog share

Two specific trigram cycles appear in both:

Cycle classLanguages containing it
aer / era / raeDutch, Portuguese, Romanian, Tagalog, Welsh
aig / gai / igaIndonesian, Irish, Tagalog, Welsh

Both encode a vowel-resonant-vowel (V-R-V) pattern. In Welsh, these arise from soft mutation: consonants between vowels lenite to approximants. In Tagalog, they arise from infix -um- and reduplication. In Irish, VSO order plus initial mutations produce the same surface pattern. The three languages share surface phonotactics, not common ancestry.

This is precisely what the Hodge orthogonality theorem enables: the cycle space of a graph is orthogonal to the gradient space. Languages with the same cycles have the same harmonic content regardless of their lexicon or inflectional history.

The orthographic opacity number

Define Lb(L) = mean cycle length at byte level and Lp(L) = mean cycle length at phoneme level (via IPA transcription). For 30 languages with both measurements:

Opacity measurement corr(Lb, Lp) = 0.12, ρSpearman = 0.14. Orthography is almost completely independent of phonology at the cycle-catalogue level.
LanguageLbLpOpacity (Lp − Lb)
Spanish3.023.05+0.03
English3.472.91−0.56
Welsh2.883.91+1.03

English has negative opacity: its trigram graph has more structure at byte level than phoneme level, because English orthography encodes etymology (ght, ph, ough) that phonology has erased. Welsh is the opposite: transparent writing system but initial mutations generate phonological cycle structure invisible to the byte graph.

Formal backing

FileTheoremStatement
HodgeGraph.leanhodge_orthogonalExact and harmonic components are orthogonal (0 sorrys)
HodgeIharaBridge.leanb1_unifies_zeta_homology_codingb1 simultaneously computes: cyclomatic complexity, harmonic rank, Ihara pole order, and code check-space dimension (0 sorrys)

Four disparate classical invariants coincide because they are all b1. The Jaccard distance on cycle catalogues is a distance on a direct sum of four classical invariants simultaneously.

Reproducibility

# empirical
python3 atlas.py       # compute cycle catalogues
python3 phylogeny.py   # produce dendrogram

# formal
lake build Proof.HodgeGraph
lake build Proof.HodgeIharaBridge

Full pipeline wall clock on a laptop: ~8 minutes.

Full paper (HTML) PDF