After selecting a region to be simulated and the population, a regional vcf file can be generated using simple commands from the bcftools package. This the vcf file is read from sim1000G and the regional LD and allele frequencies of the samples are automatically calculated using the hapsim package \cite{Montana_2005} , within R. There are no limitations in the extent of the region (in MBp), however the number of markers selected for simulation generally has to be less that 2000.
After this step, sim1000G can generate new genetic data for simulated individuals, within this region, utilizing the hapsim approach of truncating a multivariate normal distribution with specified covariance based on the LD pattern of the region.
Simulating family genetic data
A goal of the development of sim1000G was to be able to simulate arbitrary pedigree structures. We have added methods that enable generation of realistic genotype data for multiple markers within a region, usually a gene in familial data of simple or complex pedigrees.
We introduce two recombination models within sim1000G, a chi-squared model and a simple no-interference model. This models are used to enable simulating recombination event locations for a meiosis / fertilization event.
sim1000G tracks all ancestral haplotypes and alleles for each recombination event. This enable the compuation of Identity by descent state for each marker in the region that is being simulated. To this extent, the function computeIBD12
that can be chosen from a subset of 1000 genomes individuals.The initial population is selected to be from 1000For example methods and simulation statistics