Papers · Atlas · Data

E. coli K-12 Proteome

property	value
Organism	Escherichia coli K-12
Source	UniProt (Swiss-Prot reviewed)
Proteins	4531
Total residues	1387915
Alphabet	20 amino acids (ACDEFGHIKLMNPQRSTVWY)
Bytes/char	1.0 (ASCII)
Sample	1387915 bytes

Harmonic fraction f(D) — full proteome

D	f_real	f_shuffled	Δf	Δf%
1	0.9853	0.9852	+0.0001	+0.0%
2	0.9801	0.9806	-0.0005	-0.1%
3	0.9824	0.9827	-0.0003	-0.0%
4	0.9767	0.9788	-0.0021	-0.2%
5	0.8817	0.8877	-0.0060	-0.7%

Harmonic fraction f(D) — first 100 proteins (48573 residues)

D	f_real	f_shuffled	Δf	Δf%
1	0.9839	0.9836	+0.0003	+0.0%
2	0.9835	0.9845	-0.0010	-0.1%
3	0.9739	0.9762	-0.0022	-0.2%
4	0.8804	0.8831	-0.0027	-0.3%
5	0.8588	0.8638	-0.0050	-0.6%

Interpretation

At D=3, real protein sequences have lower harmonic fraction than shuffled (Δf = -0.0003 full, -0.0022 subset), consistent with the prediction that protein secondary structure (alpha helices, beta sheets) imposes bidirectional sequential patterns that reduce irreversibility compared to a random arrangement of the same amino acids.

Key observations:

The effect is real but small. Δf ranges from -0.0003 to -0.0060 on the full proteome, growing with depth. The signal is consistent across both the full proteome and the 100-protein subset.
Proteins are far more harmonic than natural language. At D=3, protein f(D) ~ 0.98 vs English f(D) ~ 0.59. The 20-letter amino acid alphabet (vs 256 bytes) creates a much denser de Bruijn graph with fewer structural holes, making nearly all flow harmonic.
The Δf signal strengthens with depth. At D=5, Δf = -0.0060 (-0.7%) for the full proteome. This is expected: longer-range structural motifs (helices span ~10-20 residues, sheets ~5-10) only become visible at higher context depths.
Signal is consistent between full and subset. The 100-protein subset shows the same direction and comparable magnitude, confirming this is not an artifact of concatenation boundaries.
Compared to natural language (English Δf ~ -0.05 at D=3), protein irreversibility reduction is ~100x weaker. This makes sense: protein sequences are closer to random (high entropy per residue) with structural constraints operating at longer range than the bigram/trigram patterns that dominate natural language.