Best place for Table 5
In our integrated dataset most pathogenic mutations in MECP2occur in the methyl-DNA or transcription repressor binding domain. This
has been found and confirmed before (Ballestar, Yusufzai, & Wolffe,
2000; Ghosh, Horowitz-Scherer, Nikitina, Gierasch, & Woodcock, 2008;
Heckman, Chahrour, & Zoghbi, 2014; Krishnaraj et al., 2017). The
functionality of the methyl-DNA binding domain is reported to be
extremely sensitive for changes (Ballestar et al., 2000). The importance
of the domain also shows from the observation that a construct
consisting only of methyl-DNA binding and transcription repressor domain
could preserve some basic functions of MECP2 (Tillotson et al., 2017).
There is also a clear distinction between conserved and non-conserved
regions. As expected, disease-causing mutations occur much more often in
the conserved regions. However, the data shows clearly that mutations in
all domains, both conserved and non-conserved regions, can cause RTT.
The open question here remains how much influence does a particular
mutation has and how much is contributed by other genetic aspects or
environmental influences. This question becomes more important
considering the discovery of variants that in one individual can be
benign and RTT causing in another.
How can the same variation be benign AND cause RTT in
different individuals?
The majority of the MECP2 genetic variations, which are described
as RTT causing in one, and benign in another database entry, are
predicted to be benign (Figure 3). Possible explanations why a variant
can be disease causing in one individual and benign in another could be
due to the location of the gene on the X chromosome which may result in
a subclinical phenotype in females but a fully-fledged RTT in male
patients. The sex of patients is usually not given in these
genotype-phenotype database. Also, X inactivation patterns (Weaving et
al., 2003) and genetic background related to other participating genes
in MECP2 related pathways (Pizzo et al., 2018) influence the
severity of a rare monogenic (X-linked) disease and can possibly even
save individuals with a documented pathogenic variation from disease
development (Chen et al., 2016). In principle, patients could also have
an unreported second mutation that could cause the effect either alone
or through epistatic interaction.
For several variations, a high pathogenicity score was predicted but
they were still documented in healthy individuals. This has been
observed before in a girl with RTT who inherited a germline disease
causing MECP2 c.1160C>T (P387L,
NC_000023.11:g.154030668G>A) variation from a healthy (!)
father (Bhanushali, Mandsaurwala, & Das, 2016). Exactly this variant we
found only in our RTT causing dataset (documented in ClinVar and
RettBase), the annotation with the benign outcome was not added to one
of these databases yet. To unravel the different influences ofMECP2 variations in the context of an individual patient, we need
to evaluate how genetic background can affect other process related
genes. For this, genotype-phenotype databases with detailed phenotype
capture will be highly important and data integration tools and methods
must be developed to investigate this further.
There is also a significant number of patients (54 in our integrated
dataset) whose MECP2 gene carries more than one variation. In
these cases, we presume that the disease is caused by one (pathogenic)MECP2 variation while the other variation can be benign if
it occurs alone. Other possibilities are positive or negative epistatic
effects if these variants occur on the same allele. All of these
possibilities may lead to wrong classification of variants.
Making the MECP2 genetic variant data
FAIR
The FAIR guiding principles have emerged from analysing the general, and
often repeated, process that data scientists go through when preparing
data from multiple sources for data integration and analysis TheMECP2 genotype-phenotype data from this study were retrieved from
nine heterogeneous resources, which we prepared for analysis by making
them more FAIR. This was first and foremost done to enable integration
of the data for analysis as correctly as possible, which also
facilitates integration with other interoperable data such as protein
functionality data from for instance UniProt, NextProt or Phyre
databases. Another reason was to ensure reusability of the integrated
data for other research studies. Note, all the FAIRified resources allow
redistribution.
The FAIRified data was described with machine-readable metadata and
distributed at a new location, which prospectively allows other
researchers to reuse this data. Thus, as data users, we made the data
FAIR after retrieving them from their respective distributions. This was
necessary, because the way that the data were provided by the different
sources was not sufficiently uniform for machines to integrate multiple
sources. The disadvantage of leaving the implementation of FAIR
principles to data consumers is that they are more likely to make
mistakes in the interpretation of the meaning of the data, which may not
be the same as the sources. Ideally, data are made FAIR at the source to
minimize that risk and optimize transparency. This would have allowed us
to directly use the data in automated workflows that can be run
regularly to update our findings.
Next step:
automatization
The first step in the integration of genetic variation data across
multiple resources was a time consuming study, which included a lot of
manual data acquisition and curation. Additionally, analysingMECP2 variations as the causative entities in RTT was the leading
example in this study but this method should be available and applicable
for any other gene, too. The next step therefore would be to automate
this process. This can only be efficient and robust when the data
resources provide an interface by which machines can predict how to
find, access, and use their data. This interface is complementary to the
specific features that each source provides for its users. FAIR
principles provide useful guidance here: they do not prescribe any
specific implementation, but do enforce a higher level of transparency
for machines. In other words, the feasibility and quality of automation
depends on the resources being FAIR. Consequently, a workflow can be
developed and used as a tool to retrieve the known disease causing,
benign, or other variants of yet unknown significance for any gene. The
FAIRification of the databases is a process, which has already started
and will hopefully continue to support efficient data science.
Interesting for the implementation of FAIR principles are activities
towards new standards for processing variant data, such as by the
genetic variant workstream of the Global Alliance for Genomics and
Health as well as its GA4GH Beacon project, which allows cross database
search for variants. Generic FAIR services and service specifications,
such as produced in FAIRtrain, FAIRsFAIR, and EOSC-Life (e.g.
FAIRsharing.org and the FAIR Data Point specification (Bonino da Silva
Santos et al., 2016)
https://github.com/FAIRDataTeam/FAIRDataPoint-Spec, enable the
general task of identifying and visiting interoperable RDF) from
restricted access databases.