INTRODUCTION
Advancements over the last decade in genetic tools and high throughput detection methods has accelerated the pace of novel genes and variants associated with monogenic Mendelian diseases. Currently 7000 OMIM phenotypes with distinct genetic etiologies have been delineated (Hamosh, Scott, Amberger, Bocchini, & McKusick, 2005). These global efforts have significantly advanced our understanding of rare genetic disorders and monogenic diseases. Though there have been significant contributions of population genomics research of Indian populations very few studies have provided a comprehensive genome level understanding of monogenic disorders in Indian populations. In the present era of precision medicine, there is an urgent need for nationwide genomics efforts to establish - a framework for genomic medicine guided healthcare delivery needs; provide extensive coverage of genomic biomarkers across populations that facilitate rapid diagnosis and affordable genomic healthcare solutions.
India comprises of 1.3 billion people from diverse ethnic, cultural and linguistic lineages and shared ancestries with many global populations. Further, the genetic diversity of the populations has also been shaped by socio-cultural factors such as endogamy and consanguinity, geographical clines, its vast history of migration events during intercontinental exchange of trade and art as also admixtures with local population (Basu et al., 2003; Basu, Sarkar-Roy, & Majumder, 2016; I. G. V. Consortium, 2008; Reich, Thangaraj, Patterson, Price, & Singh, 2009). This provides a unique gene-variant-pool and a reservoir for founder events in recent past, extensive nationwide genomic efforts have been undertaken to understand its genetic diversity. For instance, in IGVdb a consortium level efforts have provided a catalogue of single nucleotide polymorphisms of 900 genes that map to disease associated regions across 55 diverse Indian populations (I. G. V. Consortium, 2008; Narang et al., 2010). Genetic analysis revealed that ethnicity and language are major determinants than geography. These studies highlighted that Indian populations can be divided broadly into four genetic clusters (Austro-Asiatic (AA), Dravidian (DR), Indo-European (IE) and Tibeto-Burman (TB)) based on ethno-linguistic classification. DR and IE large are known to exhibit a large degree of admixture and there are multiple sub-clusters, however, isolated populations, specifically from DR and AA group are distinct and unique (I. G. V. Consortium, 2008). In addition, mitochondrial and Y-chromosome haplogroup based studies have also helped in characterization of gene pool of diverse Indian populations (Bamshad et al., 2001; Borkar, Ahmad, Khan, & Agrawal, 2011; Kivisild et al., 2003; Majumder et al., 1999; Thanseem et al., 2006). The utility of an India specific baseline variability has been demonstrated during pre-NGS days - in infectious diseases (For example, Malaria, HIV), pharmacogenomics studies, disease associations and identification of at-risk populations for various neurological, cutaneous and high altitude adaptation related disorders (Aggarwal et al., 2015; Aggarwal et al., 2010; Bhattacharjee et al., 2008; A Biswas et al., 2007; Arindam Biswas et al., 2010; Chaki et al., 2011; Giri et al., 2014; Grover et al., 2010; Gupta et al., 2007; P. Jha et al., 2012; Kanchan et al., 2015; Kumar et al., 2009; Sinha, Arya, Agarwal, & Habib, 2009; Sinha et al., 2008; Talwar et al., 2017).
Due to limited availability of high throughput platforms systematic efforts to understand the spectrum of Mendelian and monogenic variants have not carried out across the diverse Indian populations. With the advent of NGS, Indian other global research groups have put in additional efforts to provide variant information at the genome wide scale - SAGE (South Asian Genome and Exome) (Hariprakash et al., 2018), South Asian genomes from 1000 Genomes Project (G. P. Consortium, 2015), south Indians individuals (INDEX-db) (Ahmed P et al., 2019) and a few others. The Indian Genetic disease database v1.0 provides information on 1000 genetic disease in over 3500 Indian patients (#IGDD). Other noteworthy contributions have been made in the genetics of hemoglobinopathies (thalassemia and sickle cell anemia), Duchenne Muscular Dystrophy (DMD), cystic fibrosis (CF), spinocerebellar ataxias, Mitochondrial disorders, cardiomyopathies (Pradhan et al., 2010). There is now also representative knowledgebase of Indian genetic disorders that aggregate information from NGS and single sequencing based multiple case reports studies in Lysosomal storage disorders, skeletal dysplasias and disorders of primary immunodeficiencies, genodermatosis and other neurogenetic ailments (http://guardian.meragenome.com/). A recently published GenomeAsia 100k Project (GAsP) data provided a comprehensively covered genome level data of over 1700 individuals from different Asian countries, thus highlighting the need for adequate representation of Asian genome level information in public databases (GenomeAsia100K Consortium, 2019).
Multiple country wide efforts are ongoing from government funded basic and translational genomic research laboratories, genetics unit of tertiary hospitals and commercial enterprise to meet the needs of clinical genetics segment of healthcare system in India. Despite these there are a few unmet challenges for implementation of genomics medicine in Indian populations. Primarily, either due to lack of representation of different ethnic populations of India or low sample size in earlier studies conducted in Indian populations. Therefore, we have 1.) paucity of knowledge for mutations spectrum and their frequencies, 2.) lack of systematic characterization of known pathogenic mutations linked to various monogenic disorders, 3.) scarcity of knowledge of genetic spectrum of 7000 OMIM phenotypes and other prevalent genetic disorders, 4.) characterization of novel mutations.
To address these issues primarily, our study provides a comprehensive catalogue of monogenic disease linked variants in diverse Indian populations (n=2795). Our study utilized a high throughput and affordable genomics tool that provides information of over 19,538 global clinical annotated variants using Global Screening Array (GSA) from Illumina. In brief, the content of our study is novel and unique as : i) it covers diverse multiethnic Indian cohorts with large sample size of 2795 healthy subjects, ii) provides frequency distribution of known pathogenic variants for Inborn errors of Metabolism, hematological disorders and other Mendelian disorders in Indian populations, (iii) representation of SAS pathogenic variants is higher in our study i when compared with other global repositories like 1000 Genome populations (G. P. Consortium, 2015), The Genome Aggregation Database (gnomAD) (K. Karczewski & Francioli, 2017) and The Exome Aggregation Consortium (ExAC) (K. J. Karczewski et al., 2016) and GenomeAsia100K (GenomeAsia100K Consortium (2019). We have created a unique database to catalogue and register the information of clinically relevant variants for Indian population. Further, we were able to demonstrate that our cohort is genetically much more diverse than representative South Asian populations in 1000 genome dataset to provide opportunities and gaps for future research.