Improving Comparative Genomic Studies : Definitions and Algorithms for Syntenic Blocks
Comparative genomics aims to understand the structure of genomes and the function of various genomic fragments, by transferring knowledge gained from well studied genomes, to the new object of study. Rapid and inexpensive high-throughput sequencing is making available more and more complete genome sequences. Despite the significant scientific advance, we still lack good models for the evolution of the genomic architecture, therefore analyzing these genomes presents formidable challenges. Early approaches used pairwise comparisons, but today researchers are attempting to leverage the larger potential of multiway comparisons. Current approaches are based on the identification of so called syntenic blocks: blocks of sequence that exhibit conserved features across the genomes under study. Syntenic blocks are in many ways analogous to genesâ -in many cases, the markers are used to constructing them are genes. Like genes they can exist in multiple copies, in which case we could define analogs of orthology and paralogy. However, whereas genes are studied at the sequence level, syntenic blocks are too large for that level of detail - it is their structure and function as a unit that makes them valuable for genome level comparative studies. Syntenic blocks are required for complex computations to scale to the billions of nucleotides present in many genomes; they enable comparisons across broad ranges of genomes because they filter outmuch of the individual variability; they highlight candidate regions for in-depth studies; and they facilitate whole-genome comparisons through visualization tools. The identification of such blocks is the first step in comparative studies, yet its effect on final results has not been well studied, nor has any formalization of syntenic blocks been proposed. Tools for the identification of syntenic blocks yield quite different results, thereby preventing a systematic assessment of the next steps in an analysis. Current tools do not include measurable quality objectives and thus cannot be benchmarked against themselves. Comparisons among tools have also been neglected - what few results are given use superficial measures unrelated to quality or consistency. In this thesis we address two major challenges, and present: (i) a theoretical model as well as an experimental basis for comparing syntenic blocks and thus also for improving the design of tools for the identification of syntenic blocks; (ii) a prototype model that serves as a basis for implementing effective synteny mining tools. We offer an overview of the milestones present in literature, on the development of concepts and tool related to synteny; we illustrate the application of the model and the measures by applying them to syntenic blocks produced by different contemporary tools on publicly available data sets. We have taken the first step towards a formal approach to the construction of syntenic blocks by developing a simple quality criterion based on sound evolutionary principles. Our experiments demonstrate widely divergent results among these tools, throwing into question the robustness of the basic approach in comparative genomics. Our findings highlight the need for a well founded, systematic approach to the decomposition of genomes into syntenic blocks and motivate the second part of the work - starting from the proposed model, we extend the concept with data dependent features and constraints, in order to test the concept on cases of interest.
EPFL_TH6347.pdf
openaccess
13.73 MB
Adobe PDF
c3727648dd84fbc45b4ca64ef8c7fcc4