Topic: Inférences démographiques et historiques à partir de données génomiques sous des modèles spatialisés réalistes : vers une prise en compte du paysage.
Dates: 1er octobre 2018 – 30 septembre 2021
CBGP supervisor: R. Leblois
University: University of Montpellier/ISEM
Analysis of neutral genetic polymorphism allows to estimate demographic and historical population parameters such as population sizes or densities, dispersal parameters, divergence times, or past demographic changes. These analyses are based on the combination of (1) stochastic models of population evolution such as the Kingman Coalescent (1982) for independent loci or the ancestral recombination graph (Hudson 1983, Griffiths and Marjoram 1997) taking into account the recombination between sequences; and (2) statistical inference methods, the most powerful ones being based on the likelihood estimation for the simplest evolution models (Kuhner 2009, Rousset et al., 2018), or on the comparison between simulations and a real dataset (through a set of summary statistics) for more complex models (Approximate Bayesian Computation methods, ABC, Beaumont 2010, Marin et al., 2012). These methods of population genetics inference have deeply changed over the last 10 years, especially to adapt to the drastic change in the type and size of genetic / genomic datasets due to the rapid development of new sequencing techniques (Next Generation Sequencing, NGS).
The objective of this project is to develop and test inferential tools adapted to a very specific class of stochastic models of population genetics: spatial demographic models. In many species, the dispersal of individuals is limited in space: individuals reproduce preferentially with individuals geographically close to each other. In addition, many populations show a continuous spatial distribution of individuals rather than individuals aggregated into panmictic sub-populations. Spatial models of isolation by distance (IBD) in continuous habitat take these characteristics into account, and in particular make it possible to estimate some characteristics of dispersal and population density. However, despite the recent explosion of methodological developments mentioned above, the development of new spatial methods of analysis remains relatively limited, certainly due to the complexity of the implementation of inference methods on spatialized demo-genetic data and the relative scarcity of geo-referenced individual genomic data that was still observed recently. The main existing methods of inference are still based on the use of F-statistics, and only allow the estimation of the neighbourhood size, the product of the density by some characteristic of the dispersal (Rousset 1997, 2000). A method of maximum likelihood inference, thus using all the information of the genetic data, has been developed more recently, but can not consider in a completely satisfactory way continuous populations nor a very large number of genetic markers (Rousset and Leblois 2012). However, the recent development of new simulation-based inference methods has resulted in a gain of a factor 10 to 100 in terms of speed (Approximate Bayesian Computation using Random Forest, ABC-RF, Pudlo et al. Marin et al., 2017, or the summary-likelihood method, SL, implemented in the R Infusion package, Rousset 2016) and the cost of obtaining large numbers of individual genomes has dropped significantly. These two major advances make it possible today to consider realistic spatial models for which the simulation is relatively slow, as well as a very large number of markers, in order to infer the demographic functioning of populations in space and time in more details and with a better precision than what is allowed with current methods.
The aim of this project is therefore to develop, test and apply new methods of inferences of demographic and historical parameters (dispersal, densities, barriers to gene flow, past demographic changes, secondary contacts, etc.) under spatial models, starting from simple homogeneous models in time and space to move towards more and more realistic models with spatial and temporal heterogeneities. Indeed, the explosion of the quantity of data available, both in terms of number of markers and number of individuals, suggests that we may now study weak and complex genetic signals left by more and more fine demographic and historical processes. We even think that, in fine, the influence of the landscape on the fine scale spatial population structure may soon be taken into account in the inferences and then used to make predictions on the future evolution of neutral biodiversity, in particular in the context of global changes we are experiencing.
The first part of this PhD project aims at the implementation/enrichment of a new genomic data simulator based on coalescent algorithms that can consider realistic spatial models, in order to use it to make demographic and historical inferences. Since modern simulation inference techniques require efficient algorithms, both in terms of computational execution speed and memory requirement, an important effort will be made on the choice and possible combinations of (1) storage and indexing method for ancestral recombination graphs, coalescence trees and simulated genomes (eg. Kelleher et al., 2016), (2) coalescent algorithms (exact "generation by generation", Leblois et al., 2009; with exponential time approximations , Hudson 1990), (3) recombination algorithms (exact ancestral recombination graph, Griffiths and Marjoram 1997, SMC approximation of Marjoram and Wall 2006), and (4) algorithms for calculating summary statistics (see part 2 below). The developed code will be constantly validated by unit tests and by comparison with analytical results and simulations from other less efficient programs such as IBDSim for the spatial aspects (Leblois et al., 2009) and msPrime for the genomic and recombination aspects (Kelleher et al., 2016). This part of the project aims to develop an autonomous software, open source, collaborative (Git) and developed in continuous integration. It will be built on the principles of programming based on the novelties of the C ++ 11/14 or even 17 standards, in order to produce a code that is readable, concise, and optimized to be easily modifiable and reusable by anyone. This part includes both algorithmic, software architecture and C ++ development, all tinged with optimization and parallelization.
The second part focuses on the adaptation, testing and comparison of new simulation-based inference methods in the context of spatial demographic models and genomic data. To date, we intend to test two approaches, each with their respective interests and limits: (1) the ABC-RF method, which is fast and can consider models with a large number of parameters. This method has already been tested and used since 2015, especially in our research teams; and (2) the very recently developed SL method, whose limits are therefore poorly known. We will test in particular a variant a priori less limited in number of parameters than that described in the publication Rousset et al. 2017. With the ultimate goal of producing and disseminating powerful, robust and easy-to-use data analysis methods, the PhD candidate will explore three main questions: (1) which summary statistics are most relevant to best summarize the information contained in the genomic data, what is the information provided by linkage disequilibrium over long DNA sequences and what parameters can be estimated from these statistics; (2) what improvement may provide the use of machine learning methods such as neural networks to reduce the number of summary statistics or to be directly used in simulation inference procedures (ABC and SL) without summary statistics computation; and (3) what are the statistical performances of the ABC-RF and SL methods as a function of the number of model parameters and their correlation levels, the number and types of summary statistics used, the use of neural networks and the type of question asked (i.e. estimation of different demographic parameters or model choice, see below). To answer these questions, the PhD candidate will develop a simulation test approach similar to that used in our previous publications (precision and robustness of the estimates, validity of confidence / credibility intervals and model choice procedures), completed by the analysis of real datasets to define realistic simulation conditions.