Know more

Our use of cookies

Cookies are a set of data stored on a user’s device when the user browses a web site. The data is in a file containing an ID number, the name of the server which deposited it and, in some cases, an expiry date. We use cookies to record information about your visit, language of preference, and other parameters on the site in order to optimise your next visit and make the site even more useful to you.

To improve your experience, we use cookies to store certain browsing information and provide secure navigation, and to collect statistics with a view to improve the site’s features. For a complete list of the cookies we use, download “Ghostery”, a free plug-in for browsers which can detect, and, in some cases, block cookies.

Ghostery is available here for free: https://www.ghostery.com/fr/products/

You can also visit the CNIL web site for instructions on how to configure your browser to manage cookie storage on your device.

In the case of third-party advertising cookies, you can also visit the following site: http://www.youronlinechoices.com/fr/controler-ses-cookies/, offered by digital advertising professionals within the European Digital Advertising Alliance (EDAA). From the site, you can deny or accept the cookies used by advertising professionals who are members.

It is also possible to block certain third-party cookies directly via publishers:

Cookie type

Means of blocking

Analytical and performance cookies

Realytics
Google Analytics
Spoteffects
Optimizely

Targeted advertising cookies

DoubleClick
Mediarithmics

The following types of cookies may be used on our websites:

Mandatory cookies

Functional cookies

Social media and advertising cookies

These cookies are needed to ensure the proper functioning of the site and cannot be disabled. They help ensure a secure connection and the basic availability of our website.

These cookies allow us to analyse site use in order to measure and optimise performance. They allow us to store your sign-in information and display the different components of our website in a more coherent way.

These cookies are used by advertising agencies such as Google and by social media sites such as LinkedIn and Facebook. Among other things, they allow pages to be shared on social media, the posting of comments, and the publication (on our site or elsewhere) of ads that reflect your centres of interest.

Our EZPublish content management system (CMS) uses CAS and PHP session cookies and the New Relic cookie for monitoring purposes (IP, response times).

These cookies are deleted at the end of the browsing session (when you log off or close your browser window)

Our EZPublish content management system (CMS) uses the XiTi cookie to measure traffic. Our service provider is AT Internet. This company stores data (IPs, date and time of access, length of the visit and pages viewed) for six months.

Our EZPublish content management system (CMS) does not use this type of cookie.

For more information about the cookies we use, contact INRA’s Data Protection Officer by email at cil-dpo@inra.fr or by post at:

INRA
24, chemin de Borde Rouge –Auzeville – CS52627
31326 Castanet Tolosan CEDEX - France

Dernière mise à jour : Mai 2018

Menu Logo Principal CBGP Cirad IRD SupAgro Muse

Home page

Thimothée Virgoulay

Timothée Virgoulay

Thimothée VIRGOULAY
Email thimothee.virgoulay(at)etu.montpellier.fr
Topic: Inférences démographiques et historiques à partir de données génomiques sous des modèles spatialisés réalistes : vers une prise en compte du paysage.
Dates: 1er octobre 2018 – 30 septembre 2021
CBGP supervisor: R. Leblois
University: University of Montpellier/ISEM

Analysis of neutral genetic polymorphism allows to estimate demographic and historical population parameters such as population sizes or densities, dispersal parameters, divergence times, or past demographic changes. These analyses are based on the combination of (1) stochastic models of population evolution such as the Kingman Coalescent (1982) for independent loci or the ancestral recombination graph (Hudson 1983, Griffiths and Marjoram 1997) taking into account the recombination between sequences; and (2) statistical inference methods, the most powerful ones being based on the likelihood estimation for the simplest evolution models (Kuhner 2009, Rousset et al., 2018), or on the comparison between simulations and a real dataset (through a set of summary statistics) for more complex models (Approximate Bayesian Computation methods, ABC, Beaumont 2010, Marin et al., 2012). These methods of population genetics inference have deeply changed over the last 10 years, especially to adapt to the drastic change in the type and size of genetic / genomic datasets due to the rapid development of new sequencing techniques (Next Generation Sequencing, NGS).

The objective of this project is to develop and test inferential tools adapted to a very specific class of stochastic models of population genetics: spatial demographic models. In many species, the dispersal of individuals is limited in space: individuals reproduce preferentially with individuals geographically close to each other. In addition, many populations show a continuous spatial distribution of individuals rather than individuals aggregated into panmictic sub-populations. Spatial  models of isolation by distance (IBD) in continuous habitat take these characteristics into account, and in particular make it possible to estimate some characteristics of dispersal and population density. However, despite the recent explosion of methodological developments mentioned above, the development of new spatial methods of analysis remains relatively limited, certainly due to the complexity of the implementation of inference methods on spatialized demo-genetic data and the relative scarcity of geo-referenced individual genomic data that was still observed recently. The main existing methods of inference are still based on the use of F-statistics, and only allow the estimation of the neighbourhood size, the product of the density by some characteristic of the dispersal (Rousset 1997, 2000). A method of maximum likelihood inference, thus using all the information of the genetic data, has been developed more recently, but can not consider in a completely satisfactory way continuous populations nor a very large number of genetic markers (Rousset and Leblois 2012). However, the recent development of new simulation-based inference methods has resulted in a gain of a factor 10 to 100 in terms of speed (Approximate Bayesian Computation using Random Forest, ABC-RF, Pudlo et al. Marin et al., 2017, or the summary-likelihood method, SL, implemented in the R Infusion package, Rousset 2016) and the cost of obtaining large numbers of individual genomes has dropped significantly. These two major advances make it possible today to consider realistic spatial models for which the simulation is relatively slow, as well as a very large number of markers, in order to infer the demographic functioning of populations in space and time in more details and with a better precision than what is allowed with current methods.

The aim of this project is therefore to develop, test and apply new methods of inferences of demographic and historical parameters (dispersal, densities, barriers to gene flow, past demographic changes, secondary contacts, etc.) under spatial models, starting from simple homogeneous  models in time and space to move towards more and more realistic models with spatial and temporal heterogeneities. Indeed, the explosion of the quantity of data available, both in terms of number of markers and number of individuals, suggests that we may now study weak and complex genetic signals left by more and more fine demographic and historical processes. We even think that, in fine, the influence of the landscape on the fine scale spatial population structure may soon be taken into account in the inferences and then used to make predictions on the future evolution of neutral biodiversity, in particular in the context of global changes we are experiencing.

The first part of this PhD project aims at the implementation/enrichment of a new genomic data simulator based on coalescent algorithms that can consider realistic spatial models, in order to use it to make  demographic and historical inferences. Since modern simulation inference techniques require efficient algorithms, both in terms of computational execution speed and memory requirement, an important effort will be made on the choice and possible combinations of (1) storage and indexing method for ancestral recombination graphs, coalescence trees and simulated genomes (eg. Kelleher et al., 2016), (2) coalescent algorithms (exact "generation by generation", Leblois et al., 2009; with exponential time approximations , Hudson 1990), (3) recombination algorithms (exact ancestral recombination graph, Griffiths and Marjoram 1997, SMC approximation of Marjoram and Wall 2006), and (4) algorithms for calculating summary statistics (see part 2 below). The developed code will be constantly validated by unit tests and by comparison with analytical results and simulations from other less efficient programs such as IBDSim for the spatial aspects (Leblois et al., 2009) and msPrime for the genomic and recombination aspects (Kelleher et al., 2016). This part of the project aims to develop an autonomous software, open source, collaborative (Git) and developed in continuous integration. It will be built on the principles of programming based on the novelties of the C ++ 11/14 or even 17 standards, in order to produce a code that is readable, concise, and optimized to be easily modifiable and reusable by anyone. This part includes both algorithmic, software architecture and C ++ development, all tinged with optimization and parallelization.

The second part focuses on the adaptation, testing and comparison of new simulation-based inference methods in the context of spatial demographic models and genomic data. To date, we intend to test two approaches, each with their respective interests and limits: (1) the ABC-RF method, which is fast and can consider models with a large number of parameters. This method has already been tested and used since 2015, especially in our research teams; and (2) the very recently developed SL method, whose limits are therefore poorly known. We will test in particular a variant a priori less limited in number of parameters than that described in the publication Rousset et al. 2017. With the ultimate goal of producing and disseminating powerful, robust and easy-to-use data analysis methods, the PhD candidate will explore three main questions: (1) which summary statistics are most relevant to best summarize the information contained in the genomic data, what is the information provided by linkage disequilibrium over long DNA sequences and what parameters can be estimated from these statistics; (2) what improvement may provide the use of machine learning methods such as neural networks to reduce the number of summary statistics or to be directly used in simulation inference procedures (ABC and SL) without summary statistics computation; and (3) what are the statistical performances of the ABC-RF and SL methods as a function of the number of model parameters and their correlation levels, the number and types of summary statistics used, the use of neural networks and the type of question asked (i.e. estimation of different demographic parameters or model choice, see below). To answer these questions, the PhD candidate will develop a simulation test approach similar to that used in our previous publications (precision and robustness of the estimates, validity of confidence / credibility intervals and model choice procedures), completed by the analysis of real datasets to define realistic simulation conditions.