Best place for figure 1
Liftover to enable compatible genetic variant description
formats
The MECP2 genetic variant descriptions from the different sources were
made compatible and therefore comparable by application of the HGVS
nomenclature and the same reference sequence. This is the first step to
make the data interoperable. For this, we used the reference sequence
for chromosome 23 (X) NC_000023.11, which is part of the current human
genome reference assembly (GRCh38). Genomic descriptions were used to
ensure that variations in and outside the gene region (exonic, intronic,
up- and downstream) were included. The process of re-describing all
variants with the HGVS nomenclature using the same reference build,
liftover, was done by using the Mutalyzer position converter webtool
[https://mutalyzer.nl/]
(Wildeman, van Ophuizen, den Dunnen, & Taschner, 2008). Mutalyzer can
perform a conversion between different reference sequences and
categories (e.g. complete genomic regions NC and mRNA NM) but requires
nomenclature compliant input. Manual correction was performed on genetic
variant descriptions that did not have the complete and correct format
for conversion but provided enough information to correct the format.
Creation of phenotype annotated
collections
Genetic variants were assigned by their linked phenotype information to
three different categories: 1. RTT causing (verified by identification
as disease causing variant according to the requirements of the
databases), 2. benign (verified by finding them in a healthy control
subject), and 3. unknown evidence (only pathogenicity prediction scores
provided by database). These lists are collected and used for further
analysis.
Data FAIRification
We made the prepared genetic variant and phenotype data more Findable,
Accessible, Interoperable, and Reusable for humans and computers
following the FAIR guiding principles (Wilkinson et al., 2016). The data
was made machine-readable (in RDF format) using a semantic data model
(see below) and a general-purpose FAIRifier tool (Thompson, Burger,
Kaliyaperumal, Roos, & Bonino da Silva Santos, 2020) based on the
OpenRefine data cleaning and wrangling tool
(http://openrefine.org/) and an
RDF plugin
(https://github.com/stkenny/grefine-rdf-extension).
Similarly, machine-readable metadata (information about the data) was
generated using the Metadata Editor (Thompson et al., 2020). The
machine-readable metadata was made available on a FAIR Data Point
((Bonino da Silva Santos et al., 2016)
https://github.com/FAIRDataTeam/FAIRDataPoint-Spec) available via:http://purl.org/biosemantics-lumc/rettbase/fdp.
The FAIR Data Point metadata provides URIs that resolve to the RDF and
CSV files for each of the nine sources on Figshare
(https://doi.org/10.6084/m9.figshare.c.4769153.v1).
We applied and extended the semantic data model of a genetic variant
described in (Horst; et al., 2015) to convert the prepared data to RDF.
The model is available on GitHub
(https://github.com/LUMC-BioSemantics/rett-variant)
and describes the important data elements of the datasets: 1) the
genetic variant: HGVS nomenclature, start/end position of the variation,
and genome build, and 2) the phenotype information that describes
whether a variant is thought to be RTT causing, benign or unknown.
Downstream analysis
Network analysis of data distribution in RTT databases
To analyse the distribution of MECP2 variations in the RTT
databases a network was created where the nodes represent databases and
the node size the number of available MECP2 variations. The thickness of
the lines connecting the databases indicate how many MECP2 variations
they share. Network visualization and analysis software Cytoscape
(Shannon et al., 2003) was used for this purpose.
Variant annotation and characterization by genomic
features
To characterize all the collected MECP2 variants, we developed an
automatic analysis pipeline for variant annotation. We used the HGVS
corrected variants to integrate custom scripts with HGVS conversion tool
fromhttps://github.com/counsyl/hgvsand generated VCF files for annotation within an automated pipeline
available athttps://github.com/mbosio85/HGVSparse.
Afterwards, we proceeded to annotate variants with Ensembl Variant
Effect Predictor, VEP, (McLaren et al., 2016) v94 using the GRCh38
assembly, selecting all available features, plus optional plugins to
estimate variant pathogenicity (i.e., PolyPhen (Adzhubei et al., 2010),
SIFT (Sim et al., 2012), MetaLR (Dong et al., 2015), CADD (Kircher et
al., 2014), FATHMM-MKL (Shihab et al., 2015) from dbNSFP and dbscSNV
scores (Liu, Wu, Li, & Boerwinkle, 2016)) both in coding and splicing
regions.
The resulting VEP annotated data was processed with R scripts, available
athttps://gitlab.bsc.es/mbosio85/rtt_summary_plot,
to compare RTT causing and benign variants as subsets, and to generate
summary statistics for these. The scripts allow to compare and visualize
the two classes in terms of any of the available VEP annotation
features, (e.g. variant frequency in the population, estimated variant
consequence, and conservation score of the genomic location). Using this
we compared the two datasets of RTT causing and benign variants by
pathogenicity scores, impact (i.e. estimation of the consequence of each
variant on the protein sequence), variant frequency, and genomic
location. Because a few variations appear both as RTT causing and
benign, we represented this subset of variants as a third class
(“both”) in all visualizations.
Finally, we focused on exonic missense variants and used VEP information
about the amino acid change and position within the MECP2-e2 transcript
to visualize the variation distribution across protein domains and
conserved regions (as described in (Lombardi, Baker, & Zoghbi, 2015)).
This allowed us to make a finer characterization of differential
distribution of RTT causing and benign variants across MECP2 domains.