Revealing and exploiting coevolution through protein language models
Protein sequences carry rich evolutionary information that reflects both structural and functional constraints, as well as shared ancestry. In this thesis, I explore how protein language models (pLMs) can uncover and leverage these signals, particularly coevolutionary information, through the use of multiple sequence alignments (MSAs). I also investigate how these models can be made homology-aware in alignment-independent ways, thus expanding their applicability in protein modeling. First, I show that transformer-based pLMs trained on MSAs capture detailed phylogenetic relationships. Specifically, column attention patterns in MSA Transformer correlate strongly with sequence similarity, revealing a natural separation of phylogenetic and coevolutionary signals within the model's architecture. Next, also building on MSA Transformer, I develop an iterative generation method that produces realistic and diverse protein sequences. These synthetic sequences not only match natural sequences in evaluation metrics but also outperform those generated by traditional models, particularly when working with shallow MSAs. Next, I address the problem of pairing interacting protein sequences, which is crucial for predicting the structures of protein complexes. I introduce DiffPALM, a method that pairs interacting paralogs by minimizing the masked language modeling loss of paired MSAs, in a differentiable way. This unsupervised approach improves pairing accuracy and enhances the structure prediction of some protein complexes when used as input to AlphaFold-Multimer. Alignment methods are often imperfect. To overcome the limitations of MSAs, I introduce ProtMamba, a lightweight, alignment-free model capable of processing long concatenations of homologous sequences. ProtMamba matches or surpasses the performance of larger transformer models on tasks such as sequence generation and fitness prediction, while featuring greater computational efficiency. Finally, I present RAG-ESM, a retrieval-augmented framework that adds homology awareness to pretrained single-sequence pLMs by conditioning on retrieved homologs through cross-attention. RAG-ESM achieves improved prediction and generation performance with minimal computational overhead and reveals emergent sequence alignment capabilities, making it a strong candidate for scalable protein design applications. Together, these contributions demonstrate how protein language models can reveal, disentangle, and harness coevolutionary signals in different ways, and offer new paths forward for computational protein science.
EPFL_TH11286.pdf
Main Document
http://purl.org/coar/version/c_be7fb7dd8ff6fe43
openaccess
N/A
27.69 MB
Adobe PDF
24451071e5f7abf3b010a132f90cdbfe