Genotyping, Data processing and Quality control (QC)
Genotyping was performed in the Illumina iScan system using the high throughput Infinium Global Screening Array version 1.0 that contains 642,824 genome-wide probes. Experiments were performed as per the manufacturer’s protocol. This chip has representation of clinically important markers from ClinVar, GWAS and pharmacogenomics information from PharmaGKB. (Figure-1 )
To reduce genotype calling errors, we tried to follow best practices for illumina data processing. Raw data for 3,132 samples were loaded in the GenomeStudio version 2.0 for clustering and calling genotypes. Out of 3,132, there were 2,976 samples that passed genotype call rate of >=0.95. Further, we selected 612,322 SNPs with GenTrain score >=0.7. SNPs with GenTrain >=0.7 are expected to be clustered correctly. Data filtered through GenomeStudio criteria was processed by zcall (Version3.4). Zcall is a variant caller that was suited for calling rare SNPs. We used z-score threshold of 8 and retrieved output in the PLINK format, preferable for QC and analysis (Goldstein et al., 2012).
PLINK (v1.9) was used (Chang et al., 2015) to filter out the variants and samples with 10% missing values, additional 1,373 SNPs were removed while there was no exclusion at sample level. We removed SNPs that significantly deviate from Hardy Weinberg equilibrium (HWE). 12,970 variants were excluded with p-value <10-6. We applied HWE filter only for common SNPs (–maf 0.05) as it is not appropriate to use this filter for rare SNPs. In addition, we also checked for relatedness among samples using –genome function. For this, autosomal SNPs with –maf 0.1 were LD pruned to estimate proportion of Identity-by-descent (IBD) between two individuals. We removed 181 individuals with PI_HAT >0.2, an estimate of IBD. Further, exclusion of samples was based on the rate of missing genotypes. Individual with high rate of missing genotypes was excluded among the two. After all QC, final set included 597,979 variants in 2795 individuals for which frequency was computed. Allele frequencies of the variants were computed using –freq option in Plink. Frequency was computed with respect to the alternate allele defined in the 1000 genomes (http://grch37.ensembl.org/Homo_sapiens/Info/Index). An in-house developed perl script was used to count the homozygous (ref/alt) and heterozygous genotypes.