Impact of phylogeny on inference from protein sequences: from models to natural data
Proteins are macromolecules considered as the building blocks of cells because they are at the heart of almost every cellular task, ranging from chemical to mechanical processes. For instance, proteins play a role as enzymes for digestion, transporters of oxygen in the blood, antibodies against viruses, propellers for bacteria, etc. Understanding proteins allows a systems-level understanding of the cell, thus providing insights on a large number of phenomena in biology or medicine. A protein can be described as a polymer, i.e.\ a linear chain of monomers, which are amino acids. In addition, the polymer folds into a specific three-dimensional structure that is crucial to the protein's function. Inversely, a given function can induce a conformational change in its structure. Importantly, the function of a protein is often mediated through interactions with other proteins or molecules. Proteins with similar function and structure are present in many different species. This means that there is an evolutionary relatedness, termed phylogeny, between proteins that is reflected in their composition.
In this thesis, we aim at studying the impact of phylogeny on inference methods from sequences of homologous protein families. We are interested in methods predicting structural contacts in proteins, but also partners of interactions, and functional groups of amino acids, termed sectors. We use methods that rely on correlations in multiple sequence alignments of proteins, specifically Potts models which are global statistical models, and local methods such as covariance or mutual information. A challenge is that it is difficult to disentangle phylogenetic from functional or structural correlations in natural sequences. To overcome this, we start by generating synthetic sequences using a minimal model where the amount of phylogeny and selection for a function can be tuned. These sequences allow to assess the performance of prediction methods while tuning phylogenetic or functional correlations. We then show that we recover the findings from synthetic sequences in natural or realistic sequences.
For the inference of structural contacts, we find that phylogenetic correlations are deleterious for local and global methods, but global methods are more robust to them. In contrast, for the inference of partners of interaction, we find that performance is improved by combining phylogenetic and functional correlations. For the inference of sectors, we find that performance is decreased when including phylogenetic correlations on top of functional ones. Finally, we observe that out-of-equilibrium noise associated to variations of selection pressure can impact positively the inference of structural contacts.
These findings show the interplay of phylogeny and functional or structural constraints in protein sequences and in inference methods from protein sequences. They illustrate the complexity and the rich structure of biological data, and show that it can be dissected and understood.
EPFL_TH10698.pdf
main document
restricted
N/A
9.46 MB
Adobe PDF
1ffa77b9616a877ed0014341051b6e63