Genotyping, Data processing and Quality control (QC)
Genotyping was performed in the Illumina iScan system using the high
throughput Infinium Global Screening Array version 1.0 that contains
642,824 genome-wide probes. Experiments were performed as per the
manufacturer’s protocol. This chip has representation of clinically
important markers from ClinVar, GWAS and pharmacogenomics information
from PharmaGKB. (Figure-1 )
To reduce genotype calling errors, we tried to follow best practices for
illumina data processing. Raw data for 3,132 samples were loaded in the
GenomeStudio version 2.0 for clustering and calling genotypes. Out of
3,132, there were 2,976 samples that passed genotype call rate of
>=0.95. Further, we selected 612,322 SNPs with GenTrain
score >=0.7. SNPs with GenTrain >=0.7 are
expected to be clustered correctly. Data filtered through GenomeStudio
criteria was processed by zcall (Version3.4). Zcall is a variant caller
that was suited for calling rare SNPs. We used z-score threshold of 8
and retrieved output in the PLINK format, preferable for QC and analysis
(Goldstein et al., 2012).
PLINK (v1.9) was used (Chang et al.,
2015) to filter out the variants and samples with 10% missing values,
additional 1,373 SNPs were removed while there was no exclusion at
sample level. We removed SNPs that significantly deviate from Hardy
Weinberg equilibrium (HWE). 12,970 variants were excluded with p-value
<10-6. We applied HWE filter only for common
SNPs (–maf 0.05) as it is not appropriate to use this filter for rare
SNPs. In addition, we also checked for relatedness among samples using
–genome function. For this, autosomal SNPs with –maf 0.1 were LD
pruned to estimate proportion of Identity-by-descent (IBD) between two
individuals. We removed 181 individuals with PI_HAT >0.2,
an estimate of IBD. Further, exclusion of samples was based on the rate
of missing genotypes. Individual with high rate of missing genotypes was
excluded among the two. After all QC, final set included 597,979
variants in 2795 individuals for which frequency was computed. Allele
frequencies of the variants were computed using –freq option in
Plink. Frequency was computed with respect to the alternate allele
defined in the 1000 genomes
(http://grch37.ensembl.org/Homo_sapiens/Info/Index). An in-house
developed perl script was used to count the homozygous (ref/alt) and
heterozygous genotypes.