Mapping of clinically important / relevant variants in GSA with ClinVar database
We mapped the variants genotyped in our subjects with the variants in ClinVar database (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/, downloaded on April 1, 2019 (Landrum et al., 2015) . Both the coordinates and dbSNP RSID names were used. In case of multi-allelic variants, we retained only those alleles with exact matches in ClinVar. After manual QC, we selected 19,538 variants (SNPs and Indels) with alternate allele frequency ≤0.05. As the focus of our analysis was only rare and clinically relevant variants we further narrowed our query to only pathogenic and likely pathogenic variants. To retrieve these variants, we used clinical significance value of 1 given in ClinVar database and then applied keyword filter of “Pathogenic or Likely pathogenic”. Pathogenic or likely pathogenic variants are designated as pathogenic throughout this manuscript.
Variants with keywords “conflicting” and “no or uncertain interpretation” of pathogenicity and other such keywords as “uncertain significance, association, risk factor, affects” were selected and analysed using a combination of three tools to ascertain their effect. We used CADD scores, Polyphen_DIV and SIFT predictions from ANNOVAR (Wang, Li, & Hakonarson, 2010). A score of 3 has been assigned Variant of Uncertain Significance if all three tools predict pathogenicity with following criteria - deleterious in SIFT, Probably Damaging (D) in Polyphen , >=20 CADD and this we classified as (VUS-I). A score of 2.5 was assigned if the variant is deleterious in SIFT, Possibly Damaging (P) n Polyphen , >=20 CADD and was assigned as VUS-II.
Annotation of genes and variants associated rare and complex disorders: Inborn errors of metabolism (IEM), MODY, Cystic fibrosis, hereditary cancers and other hereditary conditions using different resources.
  1. Genes associated with different IEM classes were retrieved from The Monarch Initiative database (https://monarchinitiative.org/) (Mungall et al., 2016). 419 unique genes for IEM related to four classes- carbohydrate, amino acid, thyroid and energy metabolism as well as subclasses defined under different every IEM class is provided in Table S2 andFigure S1 .
  2. Maturity onset diabetes of the young (MODY) associated genes: This data is compiled from two sources. Source A – DiabetesGenes (https://www.diabetesgenes.org/tests-for-diabetes-subtypes/a-new-test-for-all-mody-genes/) houses 33 genes, implicated in MODY or its related form like MIDD (maternally inherited diabetes and deafness) or partial lipodystrophy and Source B: Fidrous et al. 2018 compiled and classified genes into 14 MODY subtypes (Firdous et al., 2018). Table S3 provides annotation of 35 genes associated with MODY.
  3. Germline Variants in Hereditary cancers: List of 851 Genetic variants in 99 cancer predisposing genes that are associated with hereditary cancers is provided in the study by Huang et al. Table S4
  4. Genetic Variants associated with Cystic Fibrosis Table S5 : CFTR2 (https://www.cftr2.org/) database which reports pathogenic variants in cystic fibrosis transmembrane conductance regulator (CFTR) gene from 88,664 patients (Sosnay et al., 2013). Data was downloaded from - https://www.cftr2.org/sites/default/files/CFTR2_11March2019%20%281%29.xlsx. We prioritized 28 pathogenic variants from cystic fibrosis transmembrane conductance regulator (CFTR) gene. This included classical Cystic Fibrosis (CF) causing Phenylalanine 508 (F508) deletion (rs113993960) which has ~70% frequency in CFTR2 database. To investigate the haplotype origin of most common F508del mutation in CFTR gene, we performed haplotype analysis using genotype data on 4389 variants from 1000 genomes project. These genotype datasets were divided separately for the four major group of populations. We first selected those variants (209) that have frequency of ≥0.05 in European populations. Tagger was used to identify tag SNPs and we also included less frequent F508del variant with tag SNPs to identify the segregation of this variant on different haplotype backgrounds. The frequency of the inferred haplotypes was estimated using PHASE algorithm (Stephens, Smith, & Donnelly, 2001)Table S6 .
  5. Among other hereditary conditions, variants with high occurrence (≥5) were analyzed for disorders viz. Neurological and other neuromuscular disorders, Cardiac disorders, Cornelia de Lange syndrome and other syndromic disorders.
  6. We also shortlisted 30 variants relevant from pharmacogenomics perspective which are tagged with the keyword “drug response” in ClinVar (Table S7 ).