Protein Domain Prediction and Alternative Splicing Histories: a Modular View

Modularity is at the core of bioinformatics. Here we focused on two different building blocks: protein domains and exons. While protein domains enable us to study the protein‚s functions, exons allow us to understand the evolution of the former. However, ab initio prediction of protein domains performs poorly on multi-domain proteins and we addressed this problem in the first part of the dissertation. In a second part, we focused on the evolution of alternative splicing which, despite its importance in higher eukaryotes, has not seen much work. Based on the observation from Wheelan et al. that domain sizes have a narrow distribution and are independent of the number of domains in a protein, we proposed a theoretical distribution for the size of a domain and, by extension, a probability distribution for the length of a protein given its number of domains. Using these distributions, we designed an ab initio predictor for domain boundaries. Given a set of potential boundaries, our method detects the most likely subset. A set of potential boundaries was obtained through three existing features and a novel one based on PSIBLAST alignments. The resulting software, DomML, outperformed all tested methods on multi-domain proteins, the most demanding category. We then developed a consensus method based on DomML and compared it to a standard majority vote method. Our consensus performed well with low and medium quality inputs but could not outperform a standard consensus method in the presence of a high-quality feature. Nonetheless, the predicted number of domains was correct in most cases – a performance that could not be achieved with a majority vote. In the second part of the dissertation, we addressed the lack of evolutionary models for alternative splicing. We presented a model of transcript evolution that distinguishes the evolution of the gene structure – exons and introns – and the evolution of the splicing patterns. Our model depicts the transcript phylogeny as a forest of transcript trees, each representing the evolution of an ancestral transcript. The reconstruction of transcript phylogenies is a complex problem. We designed thus a first algorithm, validated the concept on two gene families, MAG and PAX6, then created a faster heuristic and bundled it in a tool: TrEvoR. We selected two applications to demonstrate the usefulness of TrEvoR and transcript phylogenies. On the ASPic database, we showed that transcript phylogenies can refine the accuracy of transcriptome reconstruction methods from ESTs. On the same set of genes (805 gene families, 7 species), we then demonstrated the feasibility of large-scale functional studies with TrEvoR.


Related material