Mapping of clinically important / relevant variants in GSA
with ClinVar database
We mapped the variants genotyped in our subjects with the variants in
ClinVar database
(ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/, downloaded on
April 1, 2019 (Landrum et al., 2015) .
Both the coordinates and dbSNP RSID names were used. In case of
multi-allelic variants, we retained only those alleles with exact
matches in ClinVar. After manual QC, we selected 19,538 variants (SNPs
and Indels) with alternate allele frequency ≤0.05. As the focus of our
analysis was only rare and clinically relevant variants we further
narrowed our query to only pathogenic and likely pathogenic variants. To
retrieve these variants, we used clinical significance value of 1 given
in ClinVar database and then applied keyword filter of “Pathogenic or
Likely pathogenic”. Pathogenic or likely pathogenic variants are
designated as pathogenic throughout this manuscript.
Variants with keywords “conflicting” and “no or uncertain
interpretation” of pathogenicity and other such keywords as “uncertain
significance, association, risk factor, affects” were selected and
analysed using a combination of three tools to ascertain their effect.
We used CADD scores, Polyphen_DIV and SIFT predictions from ANNOVAR
(Wang, Li, & Hakonarson, 2010). A score
of 3 has been assigned Variant of Uncertain Significance if all three
tools predict pathogenicity with following criteria - deleterious in
SIFT, Probably Damaging (D) in Polyphen , >=20 CADD and
this we classified as (VUS-I). A score of 2.5 was assigned if the
variant is deleterious in SIFT, Possibly Damaging (P) n Polyphen ,
>=20 CADD and was assigned as VUS-II.
Annotation of genes and variants associated rare and
complex disorders: Inborn errors of metabolism (IEM), MODY, Cystic
fibrosis, hereditary cancers and other hereditary conditions using
different resources.
- Genes associated with different IEM classes were retrieved from The
Monarch Initiative database (https://monarchinitiative.org/)
(Mungall et al., 2016). 419 unique
genes for IEM related to four classes- carbohydrate, amino acid,
thyroid and energy metabolism as well as subclasses defined under
different every IEM class is provided in Table S2 andFigure S1 .
- Maturity onset diabetes of the young (MODY) associated genes: This
data is compiled from two sources. Source A – DiabetesGenes
(https://www.diabetesgenes.org/tests-for-diabetes-subtypes/a-new-test-for-all-mody-genes/)
houses 33 genes, implicated in MODY or its related form like MIDD
(maternally inherited diabetes and deafness) or partial lipodystrophy
and Source B: Fidrous et al. 2018 compiled and classified genes
into 14 MODY subtypes (Firdous et al.,
2018). Table S3 provides annotation of 35 genes associated
with MODY.
- Germline Variants in Hereditary cancers: List of 851 Genetic variants
in 99 cancer predisposing genes that are associated with hereditary
cancers is provided in the study by Huang et al. Table
S4
- Genetic Variants associated with Cystic Fibrosis Table S5 :
CFTR2 (https://www.cftr2.org/) database which reports pathogenic
variants in cystic fibrosis transmembrane conductance regulator (CFTR)
gene from 88,664 patients (Sosnay et
al., 2013). Data was downloaded from -
https://www.cftr2.org/sites/default/files/CFTR2_11March2019%20%281%29.xlsx.
We prioritized 28 pathogenic variants from cystic fibrosis
transmembrane conductance regulator (CFTR) gene. This included
classical Cystic Fibrosis (CF) causing Phenylalanine 508 (F508)
deletion (rs113993960) which has ~70% frequency in
CFTR2 database. To investigate the haplotype origin of most common
F508del mutation in CFTR gene, we performed haplotype analysis using
genotype data on 4389 variants from 1000 genomes project. These
genotype datasets were divided separately for the four major group of
populations. We first selected those variants (209) that have
frequency of ≥0.05 in European populations. Tagger was used to
identify tag SNPs and we also included less frequent F508del variant
with tag SNPs to identify the segregation of this variant on different
haplotype backgrounds. The frequency of the inferred haplotypes was
estimated using PHASE algorithm
(Stephens, Smith, & Donnelly, 2001)Table S6 .
- Among other hereditary conditions, variants with high occurrence (≥5)
were analyzed for disorders viz. Neurological and other neuromuscular
disorders, Cardiac disorders, Cornelia de Lange syndrome and other
syndromic disorders.
- We also shortlisted 30 variants relevant from pharmacogenomics
perspective which are tagged with the keyword “drug response” in
ClinVar (Table S7 ).