Best place for Table 5
In our integrated dataset most pathogenic mutations in MECP2occur in the methyl-DNA or transcription repressor binding domain. This has been found and confirmed before (Ballestar, Yusufzai, & Wolffe, 2000; Ghosh, Horowitz-Scherer, Nikitina, Gierasch, & Woodcock, 2008; Heckman, Chahrour, & Zoghbi, 2014; Krishnaraj et al., 2017). The functionality of the methyl-DNA binding domain is reported to be extremely sensitive for changes (Ballestar et al., 2000). The importance of the domain also shows from the observation that a construct consisting only of methyl-DNA binding and transcription repressor domain could preserve some basic functions of MECP2 (Tillotson et al., 2017). There is also a clear distinction between conserved and non-conserved regions. As expected, disease-causing mutations occur much more often in the conserved regions. However, the data shows clearly that mutations in all domains, both conserved and non-conserved regions, can cause RTT. The open question here remains how much influence does a particular mutation has and how much is contributed by other genetic aspects or environmental influences. This question becomes more important considering the discovery of variants that in one individual can be benign and RTT causing in another.

How can the same variation be benign AND cause RTT in different individuals?

The majority of the MECP2 genetic variations, which are described as RTT causing in one, and benign in another database entry, are predicted to be benign (Figure 3). Possible explanations why a variant can be disease causing in one individual and benign in another could be due to the location of the gene on the X chromosome which may result in a subclinical phenotype in females but a fully-fledged RTT in male patients. The sex of patients is usually not given in these genotype-phenotype database. Also, X inactivation patterns (Weaving et al., 2003) and genetic background related to other participating genes in MECP2 related pathways (Pizzo et al., 2018) influence the severity of a rare monogenic (X-linked) disease and can possibly even save individuals with a documented pathogenic variation from disease development (Chen et al., 2016). In principle, patients could also have an unreported second mutation that could cause the effect either alone or through epistatic interaction.
For several variations, a high pathogenicity score was predicted but they were still documented in healthy individuals. This has been observed before in a girl with RTT who inherited a germline disease causing MECP2 c.1160C>T (P387L, NC_000023.11:g.154030668G>A) variation from a healthy (!) father (Bhanushali, Mandsaurwala, & Das, 2016). Exactly this variant we found only in our RTT causing dataset (documented in ClinVar and RettBase), the annotation with the benign outcome was not added to one of these databases yet. To unravel the different influences ofMECP2 variations in the context of an individual patient, we need to evaluate how genetic background can affect other process related genes. For this, genotype-phenotype databases with detailed phenotype capture will be highly important and data integration tools and methods must be developed to investigate this further.
There is also a significant number of patients (54 in our integrated dataset) whose MECP2 gene carries more than one variation. In these cases, we presume that the disease is caused by one (pathogenic)MECP2 variation while the other variation can be benign if it occurs alone. Other possibilities are positive or negative epistatic effects if these variants occur on the same allele. All of these possibilities may lead to wrong classification of variants.

Making the MECP2 genetic variant data FAIR

The FAIR guiding principles have emerged from analysing the general, and often repeated, process that data scientists go through when preparing data from multiple sources for data integration and analysis TheMECP2 genotype-phenotype data from this study were retrieved from nine heterogeneous resources, which we prepared for analysis by making them more FAIR. This was first and foremost done to enable integration of the data for analysis as correctly as possible, which also facilitates integration with other interoperable data such as protein functionality data from for instance UniProt, NextProt or Phyre databases. Another reason was to ensure reusability of the integrated data for other research studies. Note, all the FAIRified resources allow redistribution.
The FAIRified data was described with machine-readable metadata and distributed at a new location, which prospectively allows other researchers to reuse this data. Thus, as data users, we made the data FAIR after retrieving them from their respective distributions. This was necessary, because the way that the data were provided by the different sources was not sufficiently uniform for machines to integrate multiple sources. The disadvantage of leaving the implementation of FAIR principles to data consumers is that they are more likely to make mistakes in the interpretation of the meaning of the data, which may not be the same as the sources. Ideally, data are made FAIR at the source to minimize that risk and optimize transparency. This would have allowed us to directly use the data in automated workflows that can be run regularly to update our findings.

Next step: automatization

The first step in the integration of genetic variation data across multiple resources was a time consuming study, which included a lot of manual data acquisition and curation. Additionally, analysingMECP2 variations as the causative entities in RTT was the leading example in this study but this method should be available and applicable for any other gene, too. The next step therefore would be to automate this process. This can only be efficient and robust when the data resources provide an interface by which machines can predict how to find, access, and use their data. This interface is complementary to the specific features that each source provides for its users. FAIR principles provide useful guidance here: they do not prescribe any specific implementation, but do enforce a higher level of transparency for machines. In other words, the feasibility and quality of automation depends on the resources being FAIR. Consequently, a workflow can be developed and used as a tool to retrieve the known disease causing, benign, or other variants of yet unknown significance for any gene. The FAIRification of the databases is a process, which has already started and will hopefully continue to support efficient data science. Interesting for the implementation of FAIR principles are activities towards new standards for processing variant data, such as by the genetic variant workstream of the Global Alliance for Genomics and Health as well as its GA4GH Beacon project, which allows cross database search for variants. Generic FAIR services and service specifications, such as produced in FAIRtrain, FAIRsFAIR, and EOSC-Life (e.g. FAIRsharing.org and the FAIR Data Point specification (Bonino da Silva Santos et al., 2016) https://github.com/FAIRDataTeam/FAIRDataPoint-Spec, enable the general task of identifying and visiting interoperable RDF) from restricted access databases.