Discussion
Increasing use of whole genome and whole exome sequencing in routine diagnostics requires in silico splice prediction tools to select likely pathogenic variants for further testing. To date, there are studies evaluating single splice prediction tools, but none comparing multiple deep learning tools. This study benchmarked selected established and deep learning in silico splice prediction tools based on multiple classification metrics on two of the largest sets of variants for which the effect of splicing is functionally assessed using mini- or midigene assays. The data showed that SpliceAI, the Alamut 3/4 consensus approach, NNSPLICE and MaxEntScan perform well on all datasets. Additionally, this study demonstrated that the choice of the best splice prediction tool may depend on the gene of interest and the type of splice altering variants.
We included NCSS and DI variants in the ABCA4 gene and NCSS variants in the MYBPC3 gene. There was no single best performing splice prediction tool for these different datasets. This may be explained in several ways. ABCA4 and MYBPC3 are expressed in a tissue specific manner, with high expression in the retina and heart muscle, respectively. The representation of splice patterns in these tissues in the data used for training of the different deep learning algorithms may affect its performance. None of the tools included retina tissue in its training data, as far as we can judge. Moreover, most splice prediction tools focus on the area around the canonical splice sites and were not trained on DI variants, which explains their lower performance on the DI dataset. Another reason for differences in performance may lie in the selection criteria used to functionally assess the ABCA4 and MYBPC3 variants.MYBPC3 variants were selected for functional validation based on MaxEntScan scores, and ABCA4 variants were selected when they showed a difference in splice score for at least two of the Alamut programs (including Human Splicing Finder) and/or a delta score of at least 2%. This may lead to a positive bias in the performance assessment for the tools that were used to select the variants, but we find the opposite, where Alamut 3/4 performs best on the MYBPC3data, and MaxEntScan performs relatively well on the ABCA4dataset. Yet another source of difference in performance can be found in the functional assays used for their evaluation; ABCA4 variants were tested in midigenes and MYBPC3 variants in minigenes. In most cases, minigenes and midigenes result in the same transcripts but when the flanking exons of the minigene vector are stronger than the ones in the gene of interest, they can cause artefacts.
The performance measures of splice prediction tools need to be carefully chosen, in particular when there is an imbalance in the number of splice altering and non-splice altering variants. In the ABCA4 NCSS dataset most variants affected splicing, while most ABCA4 DI variants had no effect on splicing. The MYBPC3 dataset contained about the same number of splice and non-splice altering variants. Imbalance in the dataset influences most classification metrics. If the positive (splice altering) and negative class (non-splice altering) are interchanged during calculation of the metric the metric changes. The only metric not influenced by class imbalance is MCC, and we regard this as the preferred measure in the current setting. One example to demonstrate this are the metrics for Spidex calculated for ABCA4NCSS variants with mainly splice altering variants. Spidex showed a specificity of 71%, which is similar to the other tools, but an MCC of only 0.02.
Our results are consistent with previous studies that included smaller number of splice prediction tools. Wai et al. compared Alamut, Human Splicing Finder and SpliceAI on 257 VUSs (NCSS and DI) from blood RNA samples showing that SpliceAI outperformed the other tools with an AUC of 0.951 (Wai et al., 2020). A second study by Ellingford et al. compared SpliceAI, SPIDEX, S-CAP, CADD and MaxEntScan first in a real time assessment of 21 variants and then in variant prioritization of nearly 3000 variants (Ellingford et al., 2019). The real-time assessment showed that SpliceAI and MaxEntScan achieved a good performance. In the variant prioritization of the large cohort only SpliceAI, Spidex and CADD are compared. Here SpliceAI showed the highest AUC (0.96). Our AUC values for SpliceAI were 0.80 (ABCA4 NCSS), 0.95 (ABCA4DI) and 0.72 (MYBPC3 NCSS). Especially the AUCs of the NCSS datasets are lower than the AUC found in the two other studies. There can be multiple explanations for this. First, our datasets are smaller, making the right prediction for each individual variant more important. Second, we used variants located in only one gene, whereas the above-mentioned studies used variants in a variety of genes. This could indicate that for genes with tissue-specific expression the available splice prediction tools are not specialized enough, for reasons explained above. Third, we evaluated tools based on functional assessment with midi- or minigenes assays, which currently represent the best medium-throughput tools. Still, also this experimental set-up has limitations, since the splice assays were performed in human kidney cells. This means that tissue specific splicing events may be missed. For ABCA4 it is known that variants can lead to tissue specific pseudo-exon inclusion (Albert et al., 2018). Another limitation is that the percentage mutant RNA of the ABCA4 variants is determined based on RT-PCR products visualized on agarose gels. RT-PCR has a bias towards smaller segments and this can lead to incorrect classification of the variants. A better alternative would be to use RNA-sequencing.
A general observation made in our benchmark study is that the prediction of the in silico tools on a set of clinically relevant variants varies considerably from the performance described in the original paper. SpliceAI, for instance, achieves an area under the precision recall curve (PR-AUC) of 0.98 on RNA-seq data (Jaganathan et al., 2019). For our datasets the PR-AUC is 0.94 for ABCA4 NCSS variants, 0.91 for ABCA4 DI variants and 0.75 for MYBPC3 NCSS variants. The higher performance observed by the authors can be explained by the use of an RNA-seq dataset. Using big RNA datasets to evaluate the performance of a novel algorithm will artificially increase its performance, because naturally occurring high frequency variants have a different effect on splice sites than rare variants affecting splicing. Moreover, circularity, i.e. incomplete independence of the variants used for training and testing, may result in overestimation of the performance of the model very similar properties were already (Grimm et al., 2015). This is why it is important to use a truly independent set of clinically relevant variants to evaluate the performance of the splice prediction tools. Additionally, it is important to use the right evaluation metrics to compare different algorithms. As shown for theABCA4 variants imbalance in the dataset influences the classification metrics and therefore also the comparison. The precision recall curve uses the PPV and sensitivity to calculate the AUC. Imbalance in the dataset has an influence on both metrics, which makes it difficult to compare highly imbalanced datasets based on the PR-AUC.
To conclude, there are a variety of different splice prediction tools available. It is not easy to choose which tool to use, because different tools may perform better in different contexts. The best performing tools make use of different algorithms, deep learning (SpliceAI), machine-learning (NNSPLICE) and interactions (MaxEntScan). Deep learning has the possibility to improve splice prediction, but is not a guarantee for success: Out of the five deep learning tools, only SpliceAI performed better than the more established tools.