Bootstrap Distillation: Non-parametric Internal Validation of GWAS Results by Subgroup Resampling

loading page

David Andrew Eccles,
Awaiting Activation,
Rodney

Abstract

Genome-wide Association Studies are carried out on a large number of genetic variants in a large number of people, allowing the detection of small genetic effects that are associated with a trait. Natural variation of genotypes within populations means that any particular sample from the population may not represent the true genotype frequencies within that population. This may lead to the observation of marker-disease associations when no such association exists.

A bootstrap population sub-sampling technique can reduce the influence of allele frequency variation in producing false-positive results for particular samplings of the population. In order to utilise bioinformatics in the service of a serious disease, this sub-sampling method has been applied to the Type 1 Diabetes dataset from the Wellcome Trust Case Control Consortium in order to evaluate its effectiveness.

While previous literature on Type 1 Diabetes has identified some DNA variants that are associated with the disease, these variants are not informative for distinguishing between disease cases and controls using genetic information alone (AUC=0.7284). Population sub-sampling filtered out noise from genome-wide association data, and increased the chance of finding useful associative signals. Subsequent filtering based on marker linkage and testing of marker sets of different sizes produced a 5-SNP signature set of markers for Type 1 Diabetes. The group-specific markers used in this set, primarily from the HLA region on chromosome 6, are considerably more informative than previously known associated variants for predicting T1D phenotype from genetic data (AUC=0.8395). Given this predictive quality, the signature set may be useful alone as a screening test, and would be particularly useful in combination with other clinical cofactors for Type 1 Diabetes risk.