Results
Data integration challenges
identified
We encountered several challenges while integrating data from the
different RTT databases: 1) different descriptions of genetic variants
were used, 2) liftover process and limitations in automated liftover,
and 3) findability of terms of use/re-use, detailed below.
1. For the descriptions of genetic variants, the most commonly used
nomenclature was HGVS. HGVS still comes in different, correct, flavours,
e.g. using genomic or cDNA positions or different (versions of)
reference sequences, which still need conversions from one to the
other, using for instance Mutalyzer. The other most
common standard was the RS number (reference SNP identifier, from
dbSNP). These are usually linked to loci and can therefore not be used
as unambiguous identifiers for a variant. Databases that give only RS
identifiers were therefore not included in further analysis. The same
problem occurred with the annotation of diagnosis and/or phenotypes. As
described before (Townend et al., 2018) only a few databases link
original diagnostic information to the genetic information. If this
information was given different formats or definitions were used.
2. For the liftover to one common, comparable variant description
(GRCh38
(hg19)), genomic position)Mutalyzer was used. It can be used
programmatically via API (Application programming interface) or via
Graphical User Interface (GUI). After liftover to HGVS nomenclature it
was possible for the majority of variants (90.7% - 100% per dataset)
to use Mutalyzer without further curation (Table 1). Nevertheless, for
up to 9.3% of the variations in a dataset (Maastricht Rett dataset, the
average was 4.3%, Table 1) the data needed curation due to typos,
incorrect nomenclature (e.g., symbols which are not in the official
nomenclature), or outdated/historic position description (e.g., Genbank
variation description nomenclature). Mutalyzer itself cannot deal with
insertions of a number on unknown base pairs (e.g., ins3 instead of
insATT), round brackets ( ) to indicate uncertainty (they are gone after
translation while square brackets [ ] to indicate different alleles
or group alleles are fine), asterisk * to indicate stop (protein)
according to the official HGVS nomenclature. These variations required
manual curation, e.g. changing round brackets to square brackets, use
Mutalyzer to do the liftover, changing square brackets back to round
brackets. Furthermore, it is currently not possible to do a direct
liftover from one genomic reference sequence to another (e.g.,
NC_000023.10:g.153282026G>A to
NC_000023.11:g.154016575G>A) due to the size of the
reference sequence. At the moment, this must be done in two steps via
transcript (NC -> NM -> NC).
3. The permission to reuse and redistribute was difficult to find for
some databases (RettBase, KMD).