Optimization and historical contingency in protein sequences
Protein sequences are shaped by functional optimization on the one hand and by evolutionary history, i.e. phylogeny, on the other hand. A multiple sequence alignment of homologous proteins contains sequences which evolved from the same ancestral sequence and have similar structure and function. In such an alignment, correlations in amino acid usage at different sites can arise from structural and functional constraints due to coevolution, but also from historical contingency. Correlations arising from phylogeny often confound coevolution signal from functional or structural optimization, impairing the inference of structural contacts from sequences. However, inferred Potts models are more robust than local statistics to these effects, which may explain their success. Dedicated corrections can further increase this robustness. Moreover, phylogenetic correlations can in fact provide useful information for some inference tasks, especially to infer interaction partners from sequences among the paralogs of two protein families. In this case, signal from phylogeny and signal from constraints combine constructively, and explicitly exploiting both further improves inference performance. Protein language models have recently been applied to sequence data, greatly advancing structure, function and mutational effect prediction. Language models trained on multiple sequence alignments capture coevolution and structural contacts, but also phylogenetic relationships. They are able to disentangle signal from structural constraints and from phylogeny more efficiently than Potts models, and they have promising generative properties. Furthermore, they allow predicting interacting partners from protein sequences, outperforming traditional coevolution methods on difficult datasets.
2024-02-08
123
3
44a
REVIEWED
EPFL