The package can be used for simulations of short region or even whole-chromosome data, by taking into account the limitations of R and the extended memory usage when computing correlations and LD patterns across every pair of markers. Simulations of 1000-2000 markers in a laptop computer is easily done, and 4000 to 10000 thousand is possible in a high memory workstation computer.
Family and recombination modelling
Although this approach is valid for generating unrelated individuals, when simulating data for families the recombination of chromosomes has to be modelled. A goal of the development of sim1000G was to enable simulations of arbitrary pedigree structures with multi-marker data. This requires modelling and simulation of all meiosis and recombination events and locations in a pedigree. We have added a complete set of methods in the package to enable the generation of realistic genotype data in familial data of simple or complex pedigrees.
When modelling recombination, we introduce two recombination models within sim1000G: a interference chi-squared model and a simple no-interference model. This models are used to generate inter-recombination distances for the region that is under simulation and the recombination of the resulting haplotypes. The model with interference was adapted from a previously described two-pathway model \citep{Housworth_2003}.
In addition, simulations of family data require a detailed genetic map. For human autosomal data, we provide genetic maps obtained by lifting the HapMap Phase II genetic map from build 35 to GRCh37. The original map was generated using LDhat as described in the 2007 HapMap paper \citep{17943122}. Because of package size limitations the genetic maps are not included in the package distribution but they are provided on the accompanying website of the package.
It is common for methods in familial studies to require the estimation of identity by descent (IBD) probabilities between members of the same pedigree. Within it's simulation model, sim1000G tracks all ancestral haplotypes and alleles for each recombination event. This enable the compuation of the exact identity by descent state for each marker in the region that is being simulated. We added a simple user interface that computes the exact IBD 1 and 2 proportions of every pair of individuals, with a single call of the function computePairIBD12.
Simulating population stratification
Computational efficiency