The package can be used for simulations of whole-chromosome or whole-genome data, howerver the number of markers selected for simulation must be constrained, given the limitation of R and the extended memory usage when computing correlations and LD patterns across every pair of markers. Simulations of 1000 markers in a laptop computer is easily done, and 4 to 10 thousand is possible in a high memory workstation computer.
Simulating family genetic data
A goal of the development of sim1000G was simulating arbitrary pedigree structures with multi-marker data. This requires modelling and simulation of meiosis and fertilisation events to allow for chromosomal recombination.
We have added simple methods in the package that enable generation of realistic genotype data for multiple markers within a region (usually a gene) in familial data of simple or complex pedigrees.
When modelling recombination, we introduce two recombination models within sim1000G: a interference chi-squared model and a simple no-interference model. This models are used to generate inter-recombination distances for the region that is under simulation and the recombination of the resulting haplotypes. The model with interference was adapted from a previously described two-pathway model \citep{Housworth_2003}.
In addition, simulations of family data require a detailed genetic map. For the 1000 genomes data, we provide genetic maps for all autosomes. Because of package size limitations we were not able to include genetic maps in the package distribution and we provide all the genetic map files on the accompanying website of the package.
Family studies usually require the estimation of identity by descent (IBD) probabilities between members of the same pedigree. sim1000G tracks all ancestral haplotypes and alleles for each recombination event. This enable the compuation of the exact identity by descent state for each marker in the region that is being simulated. We added a simple user interface that allows users to obtain region IBD 1 and 2 estimates, with a single call of the function computePairIBD12.
Example
Using sim1000G we generated data for 300 individuals in the region of chromosome 4, 60995249-61569446bp and 95 individuals from 1000 genomes population CEU. We examined the LD patterns of the original genotype data, with the LD patterns of the simulated data. In figure 1 we show both of the LD patterns, the lower triangle of the matrix shows the original genotypes of 95 individuals from 1000 genomes and the upper triangle the same pattern for the 300 simulated individuals. Although there are some subtle differences in LD, we see that sim1000G preserved both short range and long-range LD in this region.