Denote the n outcomes as Y = (y1, …, yn)T, and the m-vector of genotype scores as Gi for individual i = 1, …, n. We assume the linear regression model, yi = β0 + μi + εi, where εi is a zero mean normal random variable with variance σ2, μi=GTiβ is the contribution from the variant set, and β is a m-vector of regression coefficients. Without loss of generality, we assume the genotype scores have been centered and the outcome is standardized with unit variance σ = 1.

Discussion

We have developed an R package (sim1000G) that enables researchers to easily generate simulated regional genetic data that are realistic and based on real sequenced data from 1000 Genomes project or any other phased variant call vcf file.  The capabilties of the package extend to simulation of thousands of markers and tens of thousands of individuals across arbitrary pedigree structures.
Compared to most other simulation packages, sim1000G is easier to install and use and is based on real phased sequenced data from actual human populations. Whether generating realistic data for 50 markers in European families, or simulating 1000 markers in populations of mixed ethnicity, sim1000G is able to perform it with minimal overhead and set-up time.

(to remove) additional articles to cite