Inference of genealogies with geolocalised genetic data

Rinaldo, AndreaDutertre, Charles2017-11-022017-11-022017https://infoscience.epfl.ch/handle/20.500.14299/141757Natural populations present an abundant genetic variability. Like mutation or natural se- lection, dierent processes are at stake to generate this variability. Population genetics is a topic that emerged in the late 40's, thanks mainly to the biologists Fisher and Wright. Its goal is to analyze and understand the interactions between those evolutionary processes. Among others, a byproduct of this eld has been the development of evolutionary models that try to explain how genetic information is transmitted along a genealogy of individuals sampled in the same population or in distinct ones but from the same species. Simpler models are non spatial, i.e. dealing with individuals as if they were in the same location; more complicated models suppose the population to be distributed on a lattice, but the most interesting ones consider the individuals to be distributed in a spatial continuum, leading to a leap in the complexity of the model. Nowadays, genetic information has become very cheap and abundant, but the computer power required to process all this data has not followed the growth of data availability, and especially in the case of spatial models. The intersection between population genetics and computational biology, in which this project takes place, is actively working on ways to make those computations faster. In this project we focus on one particular model, called the spatial A-Fleming Viot. This model, that appeared in the literature less than ten years ago, alleviates mathematical problems that hampered classical (spatial) models, and that Felsenstein pointed out in 1975. The general goal of this work is to investigate this model and the statistical inference of its parameters - i.e. nding the values of parameters of the model, given a sample of a population at present time (that evolved under the model). Inference methods have already been proposed, for instance using Markov Chain Monte Carlo, in Guindon et al. [2016], but, particularly because of the spatial dimension, those methods are computationally intensive. Finding ways of simplifying the problem, and making the computations faster is the object of current research. In section 1 we will present the rst model that has been proposed, in 1943 and 1948 by Wright and Fisher, along with the Kingman coalescent, proposed in 1982, that proposes to build a genealogy by going backwards in time. We will also brie y describe more recent models that try to take into account the spatial dimension. In section 2 - which is a bibliographic study - we will define the spatial -Fleming Viot, which is a measure-valued process, thus already requiring theoretical tools to get a full understanding of its definition. We will also give the proof of a fundamental property the model: time reversibility. In section 3 - which is a personal study of the spatial -Fleming Viot - we give a basis for comparison between this model and others by computing, both theoretically and by simulations, two parameters of interest that give a good description a modelled population. In the last section we focus on inference. This more theoretical part oers the description of an importance sampling inference scheme using time reversal as a proposal distribution, that applies generally to the case of rare events modelling, and that was explained by Koskela in his PhD thesis in 2016. Then we propose a way to simplify this scheme by making use of a particular type of genetic information: the Single Nulceotide Polymorphisms (SNPs).genetic variabilitypopulation geneticsmodelsInference of genealogies with geolocalised genetic datastudent work::master thesis