Papers · Atlas · Data

Spectral phylogeny of languages via the Ihara zeta function

Research note

Richard Hoekstra · April 2026

The claim

Welsh and Tagalog are sister languages in the zeta metric. The dendrogram from 49 languages' cycle catalogues recovers four known families and assigns every orphan to a structural cluster. The entire pipeline is 240 lines of Python.

The Ihara zeta function

For a finite graph G = (V, E), the Ihara zeta function counts closed walks with no backtracking:

ζ_G(s) = Π_[p] (1 − s^|p|)⁻¹

where [p] ranges over equivalence classes of primitive closed walks. By Bass's theorem:

ζ_G(s)⁻¹ = (1 − s²)^r−1 det(I − sA + s²(D − I))

with A the adjacency matrix, D the degree matrix, and r = |E| − |V| + components the first Betti number.

The zeta distance

Applied to a byte stream, we build the D=3 de Bruijn context graph. The top-20 cycles are the 20 primitive closed walks with smallest length among cycles with nonzero Jacobian multiplicity. The zeta distance between languages L and L' is Jaccard distance on their cycle catalogues:

d_ζ(L, L') = 1 − |C(L) ∩ C(L')| / |C(L) ∪ C(L')|

The dendrogram

UPGMA clustering on 49 languages, cut at d ≤ 0.985:

Cluster	Languages	Distance
Scandinavian	sv — (no — da)	0.919
West Slavic	cs — sk	0.889
Iberian	es — pt	0.857
Austronesian	id — ms	0.857
Celtic/Austronesian	cy — (ga — tl)	0.919
Romance	fr — la	0.889
Polish-Germanic	pl — (de — (fr — la))	0.956
Uralic/Baltic	fi — lv	0.919
Dravidian/Semitic	te — (ta — ar)	0.904
Slavic/Altaic	ja — (uk — mn)	0.904

The finding: Welsh (Insular Celtic) and Tagalog (Western Malayo-Polynesian) enter the tree as sister terminals under Irish, at the same distance that Norwegian and Danish join Swedish. By every classical metric they have nothing in common: separated by 10,000 km and 6,000 years.

What Welsh and Tagalog share

Two specific trigram cycles appear in both:

Cycle class	Languages containing it
aer / era / rae	Dutch, Portuguese, Romanian, Tagalog, Welsh
aig / gai / iga	Indonesian, Irish, Tagalog, Welsh

Both encode a vowel-resonant-vowel (V-R-V) pattern. In Welsh, these arise from soft mutation: consonants between vowels lenite to approximants. In Tagalog, they arise from infix -um- and reduplication. In Irish, VSO order plus initial mutations produce the same surface pattern. The three languages share surface phonotactics, not common ancestry.

This is precisely what the Hodge orthogonality theorem enables: the cycle space of a graph is orthogonal to the gradient space. Languages with the same cycles have the same harmonic content regardless of their lexicon or inflectional history.

The orthographic opacity number

Define L_b(L) = mean cycle length at byte level and L_p(L) = mean cycle length at phoneme level (via IPA transcription). For 30 languages with both measurements:

Opacity measurement corr(L_b, L_p) = 0.12, ρ_Spearman = 0.14. Orthography is almost completely independent of phonology at the cycle-catalogue level.

Language	L_b	L_p	Opacity (L_p − L_b)
Spanish	3.02	3.05	+0.03
English	3.47	2.91	−0.56
Welsh	2.88	3.91	+1.03

English has negative opacity: its trigram graph has more structure at byte level than phoneme level, because English orthography encodes etymology (ght, ph, ough) that phonology has erased. Welsh is the opposite: transparent writing system but initial mutations generate phonological cycle structure invisible to the byte graph.

Formal backing

File	Theorem	Statement
HodgeGraph.lean	hodge_orthogonal	Exact and harmonic components are orthogonal (0 sorrys)
HodgeIharaBridge.lean	b1_unifies_zeta_homology_coding	b₁ simultaneously computes: cyclomatic complexity, harmonic rank, Ihara pole order, and code check-space dimension (0 sorrys)

Four disparate classical invariants coincide because they are all b₁. The Jaccard distance on cycle catalogues is a distance on a direct sum of four classical invariants simultaneously.

Reproducibility

# empirical
python3 atlas.py       # compute cycle catalogues
python3 phylogeny.py   # produce dendrogram

# formal
lake build Proof.HodgeGraph
lake build Proof.HodgeIharaBridge

Full pipeline wall clock on a laptop: ~8 minutes.

Full paper (HTML) PDF