Introduction
An estimated fifty percent of pathogenic variants result in aberrant splicing (López-Bigas, Audit, Ouzounis, Parra, & Guigó, 2005; Pan, Shai, Lee, Frey, & Blencowe, 2008). Genetic variants may affect all sequence elements required for correct splicing including the three core elements that are recognized by the spliceosome: the canonical 5’ splice donor site (SDS), the canonical 3’ splice acceptor site (SAS) and the branchpoint. Both the SDS and SAS contain conserved dinucleotides. At the SDS the most common encountered dinucleotide is a GT and at the SAS invariably an AG. Alternative dinucleotides for the SDS are known, of which GC with a frequency of 1% is the most common one (Sheth et al., 2006). In contrast with the SDS and SAS, the branchpoint motif is less conserved (Will & Lührmann, 2011). It contains the consensus sequence yUnAy with a conserved uracil (U) and adenine (A) and less conserved pyrimidines (y) (Gao, Masuda, Matsuura, & Ohno, 2008; Rogan, Caminsky, & Mucaki, 2014). The branchpoint is located between 9 and 400 nucleotides (nt) upstream of the SAS (Abramowicz & Gos, 2018). The non-canonical sequences around the canonical splice sites are part of the splice site consensus and therefore also conserved. The non-canonical sequences at the SAS are located from 14 to 3 nt upstream and 2 nt downstream, i.e., in the exon. For the SDS, these are the last two nt of the exon and positions 3 to 6 downstream. In addition to the three main core elements, other cis -acting elements such as intronic and exonic splicing enhancers and silencers are involved in splicing (Albert et al., 2018; Glisovic, Bachorik, Yong, & Dreyfuss, 2008).
Variants in the SDS, SAS, branchpoint and enhancer and silencer motifs can alter splicing (Ohno, Takeda, & Masuda, 2018; Wimmer et al., 2007). Those affecting canonical sequences are considered to have a major effect, where the relevant exon is skipped and even skipping of neighboring exons can be observed. In the presence of alternative splice sites in or outside of the exon, partial exon skipping or exon elongation also have been observed (Fadaie et al., 2019; Fang et al., 2001; Khan, Cornelis, Pozo-Valero, et al., 2020; Labonne et al., 2016; Ramalho et al., 2003; Sangermano et al., 2018; Symoens et al., 2011). Variants in the non-canonical splicing motifs are referred to as non-canonical splice site (NCSS) variants. These may affect splicing by weakening the existing splice site (Bradley et al., 2005; Shaw et al., 2003). On the contrary, deep-intronic (DI) variants can create or strengthen cryptic splice sites (Fadaie et al., 2019; Khan, Cornelis, Pozo-Valero, et al., 2020; Sangermano et al., 2018; Sobczyńska-Tomaszewska et al., 2013; Hanzhen Sun & Chasin, 2000). In general, DI variants will result in pseudo-exon inclusion into the mRNA, when an appropriate naturally existing SAS or SDS is present (Dhir & Buratti, 2010; Romano, Buratti, & Baralle, 2013).
To determine the impact of a putative pathogenic variant or variant of unknown significance (VUS) on splicing, in silico splice prediction tools may be employed. The available tools make use of three different algorithms: motif-based algorithms, machine learning algorithms and deep learning algorithms. The novel deep learning tools show promising improvements in the field of in silico splice prediction (Cheng et al., 2019; Naito, 2019; Zuallaert et al., 2019), as they do not rely on preselected features. As such, they may capture more complex information such as the distance between different sequence motifs, structural motifs, and non-linear relationships. They may also capture the joint effects of the SDS and SAS, explaining splice site interdependence (Hefferon, Broackes-Carter, Harris, & Cutting, 2002; Khan, Cornelis, Sangermano, et al., 2020; Ohno et al., 2018). Mostin silico splice prediction tools are trained and evaluated on RNA-seq data, achieving high scores for accuracy and precision that often cannot be reproduced in diagnostics. The reported area under the precision recall curve for SpliceAI for instance is 0.98 (Jaganathan et al., 2019). SpliceAI demonstrated lower performance in small clinical real time test sets (Ellingford et al., 2019; Wai et al., 2020).
Currently, there is no study comparing different deep learning splice prediction tools on a clinically relevant set of variants. In the past, non-deep learning tools have been compared to each other (Jian, Boerwinkle, & Liu, 2014; Moles-Fernández et al., 2018). More recently, one deep learning tool has been compared to non-deep learning tools, in which case the deep learning tool has shown to be more accurate in its predictions and to perform better (Ellingford et al., 2019; Jaganathan et al., 2019; Jian et al., 2014; Ohno et al., 2018). In this study, we compared the motif-based algorithm SpliceSiteFinder-like (Shapiro & Senapathy, 1987), the interaction-based algorithm MaxEntScan (Yeo & Burge, 2004), the machine-learning tools CADD (Rentzsch, Witten, Cooper, Shendure, & Kircher, 2019), GeneSplicer (Pertea, 2001), NNSPLICE (Reese, Eeckman, Kulp, & Haussler, 1997), S-CAP (Jagadeesh et al., 2019) and SPIDEX (Xiong et al., 2015) and the deep learning tools DSSP (Naito, 2019), MMSplice (Cheng et al., 2019), MTSplice (Cheng, Çelik, Kundaje, & Gagneur, 2020), SpliceAI (Jaganathan et al., 2019) and SpliceRover (Zuallaert et al., 2018). A motivation for this selection is given in the Methods section. The comparison was done on two of the largest, high confidence sets of variants that are rare, potentially clinically relevant and for which the effect of splicing has been functionally assessed using mini or midigene assays.
The variants are located in genes coding for ATP binding cassette subfamily A member 4 (ABCA4 ) and Myosin binding protein C (MYBPC3) . ABCA4 is a flippase that effectively transports the inactive ligand of rhodopsin and the (color) opsins to the photoreceptor cell cytoplasm. The ligand is then transported to the retinal pigment epithelium where it is converted back to the active ligand and re-united with the opsins. (Molday, Rabin, & Molday, 2000; H. Sun & Nathans, 1997). Biallelic pathogenic variants in ABCA4 cause Stargardt disease (STGD1), which displays a spectrum of retinal phenotypes encompassing early-onset, classical and late-onset STGD1 depending on the severity of the two alleles (Allikmets et al., 1997; Cremers, Lee, Collin, & Allikmets, 2020; Cremers et al., 1998; Maugeri et al., 2000).MYBPC3 is involved in muscle contraction in heart muscle cells, and defects are associated with cardiomyopathy (Marston et al., 2009; Van Dijk et al., 2009).