Introduction
With the emergence of sequencing technologies of ever increasing throughput, the amount of genetic data to be analyzed grows exponentially every year. Developing new methods for analyzing this type of data is an area of active research, not only in human populations but also in plant and animal species.
Frequently researchers trying to develop methods dealing with genetic variant analysis require simulated genetic data for extensive validation and comparison of their methods.
Although there exist many packages for simulations of genetic data, like HapGen2
(Su 2011) and simuPOP
\cite{Peng_2005}, many of those are tailored to population evolution studies and pose additional complexity to researchers developing methods for statistical genetics. In addition generating family data with specific pedigree structures is a feature that is missing in many of those existing packages.
To address this gap and enable simulation of genetic data in unrelated individuals or families, we developed a package (sim1000G) that is self-contained within R and easy to use.
Approach
sim1000G generates simulated genetic data based on any phased vcf file of a genomic region. For example, the 1000 genomes project, provides phased genotype data for 26 diverse human population \citep{26432245}. The simulation is based on an initial population, from which the allele frequencies and LD structure of the data is preserved.
For example, the initial population can be a subset of 1000 genomes CEU individuals from sequencing data. After selecting a region to be simulated and the population, a regional vcf file can be generated using simple commands from the bcftools package. This the vcf file is read from sim1000G and the regional LD and allele frequencies of the samples are automatically calculated using the hapsim package \cite{Montana_2005} , within R. There are no limitations in the extent of the region (in MBp), however the number of markers selected for simulation generally has to be less that 2000.
After this step, sim1000G can generate new genetic data for simulated individuals, within this region, utilizing the hapsim method of truncating a multivariate normal distribution with specified covariance based on the LD pattern of the region.
Using this type of modelling, the simulated data model long-range LD patterns in the genome, as well as recombination hotspots.
Simulating family genetic data
A goal of the development of sim1000G was simulating arbitrary pedigree structures with multi-marker data. This requires modelling and simulation of meiosis and fertilisation events to allow for chromosomal recombination.
We have added simple methods in the package that enable generation of realistic genotype data for multiple markers within a region (usually a gene) in familial data of simple or complex pedigrees.
When modelling recombination, we introduce two recombination models within sim1000G: a interference chi-squared model and a simple no-interference model. This models are used to generate inter-recombination distances for the region that is under simulation and the recombination of the resulting haplotypes. The model with interference was adapted from a previously described two-pathway model \cite{Housworth_2003}.
In addition, simulations of family data require a detailed genetic map. For the 1000 genomes data, we provide genetic maps for all autosomes. Because of package size limitations we were not able to include genetic maps in the package distribution and we provide all the genetic map files on the accompanying website of the package.
Family studies usually require the estimation of identity by descent (IBD) probabilities between members of the same pedigree. sim1000G tracks all ancestral haplotypes and alleles for each recombination event. This enable the compuation of the exact identity by descent state for each marker in the region that is being simulated. We added a simple user interface that allows users to obtain region IBD 1 and 2 estimates, with a single call of the function computePairIBD12.
Example
Using sim1000G we generated data for 300 individuals in the region of chromosome 4, 60995249-61569446bp and 95 individuals from 1000 genomes population CEU. We examined the LD patterns of the original genotype data, with the LD patterns of the simulated data. In figure 1 we show both of the LD patterns, the lower triangle of the matrix shows the original genotypes of 95 individuals from 1000 genomes and the upper triangle the same pattern for the 300 simulated individuals. Although there are some subtle differences in LD, we see that sim1000G preserved both short range and long-range LD in this region.