Introduction
An estimated fifty percent of pathogenic variants result in aberrant
splicing (López-Bigas, Audit, Ouzounis, Parra, & Guigó, 2005; Pan,
Shai, Lee, Frey, & Blencowe, 2008). Genetic variants may affect all
sequence elements required for correct splicing including the three core
elements that are recognized by the spliceosome: the canonical 5’ splice
donor site (SDS), the canonical 3’ splice acceptor site (SAS) and the
branchpoint. Both the SDS and SAS contain conserved dinucleotides. At
the SDS the most common encountered dinucleotide is a GT and at the SAS
invariably an AG. Alternative dinucleotides for the SDS are known, of
which GC with a frequency of 1% is the most common one (Sheth et al.,
2006). In contrast with the SDS and SAS, the branchpoint motif is less
conserved (Will & Lührmann, 2011). It contains the consensus sequence
yUnAy with a conserved uracil (U) and adenine (A) and less conserved
pyrimidines (y) (Gao, Masuda, Matsuura, & Ohno, 2008; Rogan, Caminsky,
& Mucaki, 2014). The branchpoint is located between 9 and 400
nucleotides (nt) upstream of the SAS (Abramowicz & Gos, 2018). The
non-canonical sequences around the canonical splice sites are part of
the splice site consensus and therefore also conserved. The
non-canonical sequences at the SAS are located from 14 to 3 nt upstream
and 2 nt downstream, i.e., in the exon. For the SDS, these are the last
two nt of the exon and positions 3 to 6 downstream. In addition to the
three main core elements, other cis -acting elements such as
intronic and exonic splicing enhancers and silencers are involved in
splicing (Albert et al., 2018; Glisovic, Bachorik, Yong, & Dreyfuss,
2008).
Variants in the SDS, SAS, branchpoint and enhancer and silencer motifs
can alter splicing (Ohno, Takeda, & Masuda, 2018; Wimmer et al., 2007).
Those affecting canonical sequences are considered to have a major
effect, where the relevant exon is skipped and even skipping of
neighboring exons can be observed. In the presence of alternative splice
sites in or outside of the exon, partial exon skipping or exon
elongation also have been observed (Fadaie et al., 2019; Fang et al.,
2001; Khan, Cornelis, Pozo-Valero, et al., 2020; Labonne et al., 2016;
Ramalho et al., 2003; Sangermano et al., 2018; Symoens et al., 2011).
Variants in the non-canonical splicing motifs are referred to as
non-canonical splice site (NCSS) variants. These may affect splicing by
weakening the existing splice site (Bradley et al., 2005; Shaw et al.,
2003). On the contrary, deep-intronic (DI) variants can create or
strengthen cryptic splice sites (Fadaie et al., 2019; Khan, Cornelis,
Pozo-Valero, et al., 2020; Sangermano et al., 2018;
Sobczyńska-Tomaszewska et al., 2013; Hanzhen Sun & Chasin, 2000). In
general, DI variants will result in pseudo-exon inclusion into the mRNA,
when an appropriate naturally existing SAS or SDS is present (Dhir &
Buratti, 2010; Romano, Buratti, & Baralle, 2013).
To determine the impact of a putative pathogenic variant or variant of
unknown significance (VUS) on splicing, in silico splice
prediction tools may be employed. The available tools make use of three
different algorithms: motif-based algorithms, machine learning
algorithms and deep learning algorithms. The novel deep learning tools
show promising improvements in the field of in silico splice
prediction (Cheng et al., 2019; Naito, 2019; Zuallaert et al., 2019), as
they do not rely on preselected features. As such, they may capture more
complex information such as the distance between different sequence
motifs, structural motifs, and non-linear relationships. They may also
capture the joint effects of the SDS and SAS, explaining splice site
interdependence (Hefferon, Broackes-Carter, Harris, & Cutting, 2002;
Khan, Cornelis, Sangermano, et al., 2020; Ohno et al., 2018). Mostin silico splice prediction tools are trained and evaluated on
RNA-seq data, achieving high scores for accuracy and precision that
often cannot be reproduced in diagnostics. The reported area under the
precision recall curve for SpliceAI for instance is 0.98 (Jaganathan et
al., 2019). SpliceAI demonstrated lower performance in small clinical
real time test sets (Ellingford et al., 2019; Wai et al., 2020).
Currently, there is no study comparing different deep learning splice
prediction tools on a clinically relevant set of variants. In the past,
non-deep learning tools have been compared to each other (Jian,
Boerwinkle, & Liu, 2014; Moles-Fernández et al., 2018). More recently,
one deep learning tool has been compared to non-deep learning tools, in
which case the deep learning tool has shown to be more accurate in its
predictions and to perform better (Ellingford et al., 2019; Jaganathan
et al., 2019; Jian et al., 2014; Ohno et al., 2018). In this study, we
compared the motif-based algorithm SpliceSiteFinder-like (Shapiro &
Senapathy, 1987), the interaction-based algorithm MaxEntScan (Yeo &
Burge, 2004), the machine-learning tools CADD (Rentzsch, Witten, Cooper,
Shendure, & Kircher, 2019), GeneSplicer (Pertea, 2001), NNSPLICE
(Reese, Eeckman, Kulp, & Haussler, 1997), S-CAP (Jagadeesh et al.,
2019) and SPIDEX (Xiong et al., 2015) and the deep learning tools DSSP
(Naito, 2019), MMSplice (Cheng et al., 2019), MTSplice (Cheng, Çelik,
Kundaje, & Gagneur, 2020), SpliceAI (Jaganathan et al., 2019) and
SpliceRover (Zuallaert et al., 2018). A motivation for this selection is
given in the Methods section. The comparison was done on two of the
largest, high confidence sets of variants that are rare, potentially
clinically relevant and for which the effect of splicing has been
functionally assessed using mini or midigene assays.
The variants are located in genes coding for ATP binding cassette
subfamily A member 4 (ABCA4 ) and Myosin binding protein C
(MYBPC3) . ABCA4 is a flippase that effectively transports the
inactive ligand of rhodopsin and the (color) opsins to the photoreceptor
cell cytoplasm. The ligand is then transported to the retinal pigment
epithelium where it is converted back to the active ligand and re-united
with the opsins. (Molday, Rabin, & Molday, 2000; H. Sun & Nathans,
1997). Biallelic pathogenic variants in ABCA4 cause Stargardt
disease (STGD1), which displays a spectrum of retinal phenotypes
encompassing early-onset, classical and late-onset STGD1 depending on
the severity of the two alleles (Allikmets et al., 1997; Cremers, Lee,
Collin, & Allikmets, 2020; Cremers et al., 1998; Maugeri et al., 2000).MYBPC3 is involved in muscle contraction in heart muscle cells,
and defects are associated with cardiomyopathy (Marston et al., 2009;
Van Dijk et al., 2009).