Abstract
Simulations of genotype data of families or unrelated individuals across multiple markers is a task that is commonly encountered by researchers in many disciplines, including human statistical genetics and plant based mapping studies. Although a number of high quality genetic simulators already exist, many of them require extensive understanding of population models or high level of technical training to be effectively used.
We developed sim1000G, a new integrated and easy to use R package that generates simulated genomic regions in unrelated individuals or families based on a regression framework. The main benefits of sim1000G are it's simplicity and non-dependence on external tools that enable its more widespread use. Sim1000G can capture allele frequencies and short and long-range LD patterns in human, animal or plant studies, starting for a raw phased variant file. Currently, it is one of the few simulation packages that is completely integrated within R and can simulate unrelated and family data for multiple markers. This integration enable the more efficient development of methods for association tests and other types of analysis, for rare and common variants.
Introduction
With the emergence of sequencing technologies of ever increasing throughput, the amount of genetic data to be analyzed grows exponentially every year. Developing new methods for analyzing this type of data is an area of active research, not only in human populations but also in plant and animal species. Frequently, a need for generating large number of simulated datasets arises for validation and comparison of their association and other methods.
Although many approaches exist for simulations of genetic data \cite{Peng_2014} , like HapGen2 \citep{Su_2011}, simuPOP \citep{Peng_2005} and simuRARE \citep{23161487}, many of those are tailored to population evolution studies and pose additional complexity to researchers developing methods for statistical genetics.
In many cases, researchers develop custom in-use software packages for genetic simulation needs in small projects. As described in
\cite{Chen_2014}, from an analysis of Genetic Epidemiology journal issues, only 8 out of 36 articles which included simulated data, used existing genetic simulators, or simulators already catalogued by the
Genetic Simulation Resources page of NIH (
https://popmodels.cancercontrol.cancer.gov/gsr/). It is clear that there is a lot of duplicated effort in designing and implementing genetic simulators, even though excellent genetic simulation packages exist. A reason for this duplication of efforts might be the complexity and difficulties of non-simulation expert individuals in dealing with the complexities of many simulation packages.
To address this gap we implemented a genetic variant simulator (sim1000G) that is easy to use, is completely integrated within R and can be used for many different scenarios of simulation. In addition to generating unrelated individuals, we focused also on generating family data with arbitrary pedigree structures, a feature that is missing in many of existing simulators.
Methods
Simulating unrelated individuals
Simulating genetic variant data requires capturing the minor allele frequencies and LD patters of a particular population or mixture of populations. We start with a phased VCF file containing both rare and common variants of a particular population. A great source of initial population data is 1000 genomes \citep{26432245} with whole-genome sequencing variant calls available for 2504 samples across 26 populations. For example, the initial population can be the European descent subset of 1000 Genomes individuals. A region of the genome is extracted, which will be the target of the simulation. This regional file can be extracted readily with commands from the bcftools package (Supplementary material - script 1).
There is not limit on the size of the region being simulated, up to a whole chromosome can be simulated, however the number of markers has to be reduced to fit within the computational limitations of the R environment. Within the package there is support for filtering variants according to minimum of maximum minor allele frequency and reduced the number of variants to a specific number, for speeding up computations.
After the file is read the haplotypes of a particular population are extracted and the correlation between each pair of markers is computed. We use the functionality provided by the R hapsim package in R \cite{Montana_2005}. Within hapsim, a haplotype is modeled as a multivariate random variable possessing estimated marginal distributions and pairwise correlation coefficients. Hapsim provides computationally efficient algorithms to generate large sample of such random variables.
As a first step, the phased genotypes that have been read are split into haplotypes and are passed to hapsim for the preprocessing step. This step generally takes time proportinal to the square of the number of markers.
Generating an unrelated individual requires generating two simulated haplotypes of the region, by simulating the multivariate random variables with the functionality provided by hapsim package. Using this type of modelling, the simulated data captures the allele frequencies and short and long-range LD patterns in the genome, as well as recombination hotspots. To enable higher computational efficiency, we added functionality to generate pools of haplotypes in batches which can be used later.
Within our approach, we can simulate short regions or even whole-chromosome data, by taking into account the limitations of R. There is a penalty of simulating large number of markers, as the memory requirement and the time required to compute correlations rise quadratically with the number of markers. Simulations of 1000-2000 markers in a laptop computer is easily done, and 4000 to 10000 thousand is possible in a high memory workstation computer.
Generating variants in an arbitrary pedigree
Although this basic approach is valid for generating unrelated individuals, when simulating data for families the recombination of chromosomes has to be modelled. A goal of the development of sim1000G was to enable simulations of arbitrary pedigree structures and generate realistic multi-marker variant datasets. This requires modelling and simulation of all meiosis and recombination events and locations in a pedigree. We have added a set of methods in the package to enable the generation of genotype variant data in familial data of simple or complex pedigrees.
As a first step the unrelated individuals in a pedigree are simulated. Then we trace through the pedigree structure and find all instances that require recombination of two regions. Recombination positions across the region are generated and the haplotypes are recombined to generate an offspring of two previously simulated individuals (function mate)
When modelling recombination, we introduce two recombination models within sim1000G: a interference chi-squared model and a simple no-interference model. This models are used to generate recombination distances for the region that is under simulation, which are used for the recombination of the parental haplotypes. The model with interference was adapted from a previously described two-pathway model \citep{Housworth_2003}.
In addition, simulations of family data require a detailed genetic map of the region under simulation. For human autosomal data, we provide detailed genetic maps for all chromosomes obtained by lifting the HapMap Phase II genetic map from build 35 to GRCh37. The original map was generated using LDhat as described in the 2007 HapMap paper \citep{17943122}. Because of package size limitations the genetic maps are not included in the package distribution but they are provided on the accompanying website of the package.
It is common for methods in familial studies to require the estimation of identity by descent (IBD) probabilities between members of the same pedigree. Within it's simulation model, sim1000G tracks all ancestral haplotypes and alleles for each recombination event. This enable the compuation of the exact identity by descent state for each marker in the region that is being simulated. We added a simple user interface that computes the exact IBD 1 and 2 proportions of every pair of individuals, with a single call of the function computePairIBD12.
Computational efficiency