Introduction
Since 2009, whole exome sequencing (WES) and whole genome sequencing (WGS) have become the main tools for the discovery of novel disease genes and variants related to rare Mendelian phenotypes (Chong et al. 2015). Using this approach, progress has accelerated so that the number of genes with known phenotype-causing variants has expanded from 2,346 in 2009 to 4,532 currently or ~22% of the total protein-encoding genes in the genome (OMIM). That leaves nearly 80% of the predicted ~20,000 protein-encoding genes yet to be connected to a disease phenotype. Similarly, 50-75% of clinical and research WES do not identify a responsible variant(s) even in families that present Mendelian segregation of human disease traits (Chong et al. 2015; Posey et al. 2019; Retterer et al. 2016; Yang et al. 2014).
Possible explanations for the modest diagnostic rate include: unappreciated phenotypic and genetic heterogeneity; causative variants in not yet recognized disease genes (Liu et al. 2019); high locus heterogeneity; complex molecular mechanisms underlying incomplete penetrance; technical limitations in the applied sequencing approach; and limitations in the variant analysis and classification. One particular limitation, that we focus on here, is the lack of accurate analytical tools to interpret and classify variants in known or novel disease genes.
Variant classification in the research or clinical setting is a complex process that takes into consideration many different features related to the individual, the phenotype, the variant, the gene and the environment. In 2015, Richards and colleagues (Richards et al. 2015), published a guideline for variant interpretation and classification based on criteria using typical types of variant evidence (e.g. population data, computational data, functional data, segregation data, etc.). To apply these criteria, research and clinical laboratories use many different databases with different types of evidence, but very few of them allow the laboratories to have access to detailed phenotypic information related to the specific variants being investigated. Knowing the phenotypic features of other individuals that carry the variant of interest is a critical step in variant classification, but detailed phenotypic information linked to putatively-causal variants is rarely available in public or even controlled-access databases because of the difficulty in obtaining detailed phenotype data, rarity of the candidate variants, and challenges and uncertainty due to potential regulatory requirements to maintain the confidentiality and privacy of individuals who carry these rare variants.
Here we describe several databases that have made variant-level information together with phenotype or phenotypic features available to researchers, clinicians, health care providers and patients; and, their plan to connect to each other following in the footsteps of the Matchmaker Exchange project that connects gene-level databases (Sobreira et al. 2017).