Discussion
Added-value of integration of data across different
sources
This is to our knowledge the first study that integrates genetic
variation data from multiple databases on MECP2 . Despite best
efforts of individual sources to reach the largest possible coverage,
our results demonstrate that the number of usefully annotated variants
increases when databases are combined. The greatest advantage of the
integrated approach is therefore that more variants become available for
further research and diagnosis. This is especially interesting for rare
diseases which have relatively small study populations. By mapping to a
common reference sequence, the information of different sources becomes
comparable and we are getting closer to the “true” number of variants
known. In this study, we were able to increase the previously estimated
numbers of a few hundred RTT causing unique sequence variations to 863.
However, databases, at least the active ones, get regular updates and
input of data. In the time from the beginning of this study the number
of variants in e.g. RettBase increased within six months from 4738
(March 2018, (Townend et al., 2018)) to 4757 (November 2018) to 4806
(NM_004992.3, April 2020). Consequently, the number of 863 known RTT
causing variants is likely outdated when this study is published. We
argue that it is unrealistic to assume that any single database will
ever be completely comprehensive, unless it automatically pulls in
updates from other databases. A possible contribution to the solution of
this problem would be to create the combined list of pathogenic variants
by automated workflows that find and summarize data from across
databases on demand or continuously. To make that possible we need to
standardize how databases provide data for machine processing. The role
of FAIR data principles to achieve this is discussed later in more
detail.
This integrated dataset gives the possibility to study abundance and
prevalence of certain variations in a larger population than any of the
study populations published before. There are several studies on
relatively small (Das, Raha, Sanghavi, Maitra, & Udani, 2013; Inui et
al., 2001) or large populations (e.g. (Bienvenu et al., 2002; Percy et
al., 2010)) that have published their data in the previous years.
(Bienvenu et al., 2002) analysed 301 different MECP2 alleles in a French
population and found 69 different variations, which cause 64% of RTT.
They identified NP_004983.1:p.R168*, R255*, R270*, T158M, and R306C
(Table 5) as the most abundant variations and 59 variations were found
in only one or two patients. In the list from the US national history
study (819 participants (Percy et al., 2010)) the variations R106W,
R133C, T158M, R168*, R255*, R270*, R294*, and R306C were responsible for
more than 60% of RTT. The MECP2 variation content of RettBase was
analyzed recently by (Krishnaraj et al., 2017) and the following eight
hotspot variations are responsible for a total of 47% of RTT cases (of
total number of MECP2 entries was at that time 4668, disease
causing and benign): R106W, R133C, T158M, R168* , R255* ,
R270*, R294*, and R306C. (Percy et al., 2007) provides information about
eleven more datasets from different countries.
Although our study resulted in a different ranking of the eight hotspots
we could confirm these as the most abundant ones which occur in our
dataset in 54.6% of all RTT causing database entries. All eight hotspot
mutations are C>T transitions leading in seven of eight
cases to a change from Arginine to a stop codon, Cysteine or Tryptophan
which are changes with a high probability to change the 3D structure of
the protein. The special vulnerability of certain Cytosine positions to
errors in base excision repair was described before (Wang, Tang, Lai, &
Zhang, 2014).