| property | value |
|---|---|
| Organism | Escherichia coli K-12 |
| Source | UniProt (Swiss-Prot reviewed) |
| Proteins | 4531 |
| Total residues | 1387915 |
| Alphabet | 20 amino acids (ACDEFGHIKLMNPQRSTVWY) |
| Bytes/char | 1.0 (ASCII) |
| Sample | 1387915 bytes |
| D | f_real | f_shuffled | Δf | Δf% |
|---|---|---|---|---|
| 1 | 0.9853 | 0.9852 | +0.0001 | +0.0% |
| 2 | 0.9801 | 0.9806 | -0.0005 | -0.1% |
| 3 | 0.9824 | 0.9827 | -0.0003 | -0.0% |
| 4 | 0.9767 | 0.9788 | -0.0021 | -0.2% |
| 5 | 0.8817 | 0.8877 | -0.0060 | -0.7% |
| D | f_real | f_shuffled | Δf | Δf% |
|---|---|---|---|---|
| 1 | 0.9839 | 0.9836 | +0.0003 | +0.0% |
| 2 | 0.9835 | 0.9845 | -0.0010 | -0.1% |
| 3 | 0.9739 | 0.9762 | -0.0022 | -0.2% |
| 4 | 0.8804 | 0.8831 | -0.0027 | -0.3% |
| 5 | 0.8588 | 0.8638 | -0.0050 | -0.6% |
At D=3, real protein sequences have lower harmonic fraction than shuffled (Δf = -0.0003 full, -0.0022 subset), consistent with the prediction that protein secondary structure (alpha helices, beta sheets) imposes bidirectional sequential patterns that reduce irreversibility compared to a random arrangement of the same amino acids.
Key observations:
The effect is real but small. Δf ranges from -0.0003 to -0.0060 on the full proteome, growing with depth. The signal is consistent across both the full proteome and the 100-protein subset.
Proteins are far more harmonic than natural language. At D=3, protein f(D) ~ 0.98 vs English f(D) ~ 0.59. The 20-letter amino acid alphabet (vs 256 bytes) creates a much denser de Bruijn graph with fewer structural holes, making nearly all flow harmonic.
The Δf signal strengthens with depth. At D=5, Δf = -0.0060 (-0.7%) for the full proteome. This is expected: longer-range structural motifs (helices span ~10-20 residues, sheets ~5-10) only become visible at higher context depths.
Signal is consistent between full and subset. The 100-protein subset shows the same direction and comparable magnitude, confirming this is not an artifact of concatenation boundaries.
Compared to natural language (English Δf ~ -0.05 at D=3), protein irreversibility reduction is ~100x weaker. This makes sense: protein sequences are closer to random (high entropy per residue) with structural constraints operating at longer range than the bigram/trigram patterns that dominate natural language.