Discussion
Increasing use of whole genome and whole exome sequencing in routine
diagnostics requires in silico splice prediction tools to select
likely pathogenic variants for further testing. To date, there are
studies evaluating single splice prediction tools, but none comparing
multiple deep learning tools. This study benchmarked selected
established and deep learning in silico splice prediction tools
based on multiple classification metrics on two of the largest sets of
variants for which the effect of splicing is functionally assessed using
mini- or midigene assays. The data showed that SpliceAI, the Alamut 3/4
consensus approach, NNSPLICE and MaxEntScan perform well on all
datasets. Additionally, this study demonstrated that the choice of the
best splice prediction tool may depend on the gene of interest and the
type of splice altering variants.
We included NCSS and DI variants in the ABCA4 gene and NCSS
variants in the MYBPC3 gene. There was no single best performing
splice prediction tool for these different datasets. This may be
explained in several ways. ABCA4 and MYBPC3 are expressed
in a tissue specific manner, with high expression in the retina and
heart muscle, respectively. The representation of splice patterns in
these tissues in the data used for training of the different deep
learning algorithms may affect its performance. None of the tools
included retina tissue in its training data, as far as we can judge.
Moreover, most splice prediction tools focus on the area around the
canonical splice sites and were not trained on DI variants, which
explains their lower performance on the DI dataset. Another reason for
differences in performance may lie in the selection criteria used to
functionally assess the ABCA4 and MYBPC3 variants.MYBPC3 variants were selected for functional validation based on
MaxEntScan scores, and ABCA4 variants were selected when they
showed a difference in splice score for at least two of the Alamut
programs (including Human Splicing Finder) and/or a delta score of at
least 2%. This may lead to a positive bias in the performance
assessment for the tools that were used to select the variants, but we
find the opposite, where Alamut 3/4 performs best on the MYBPC3data, and MaxEntScan performs relatively well on the ABCA4dataset. Yet another source of difference in performance can be found in
the functional assays used for their evaluation; ABCA4 variants
were tested in midigenes and MYBPC3 variants in minigenes. In
most cases, minigenes and midigenes result in the same transcripts but
when the flanking exons of the minigene vector are stronger than the
ones in the gene of interest, they can cause artefacts.
The performance measures of splice prediction tools need to be carefully
chosen, in particular when there is an imbalance in the number of splice
altering and non-splice altering variants. In the ABCA4 NCSS
dataset most variants affected splicing, while most ABCA4 DI
variants had no effect on splicing. The MYBPC3 dataset contained
about the same number of splice and non-splice altering variants.
Imbalance in the dataset influences most classification metrics. If the
positive (splice altering) and negative class (non-splice altering) are
interchanged during calculation of the metric the metric changes. The
only metric not influenced by class imbalance is MCC, and we regard this
as the preferred measure in the current setting. One example to
demonstrate this are the metrics for Spidex calculated for ABCA4NCSS variants with mainly splice altering variants. Spidex showed a
specificity of 71%, which is similar to the other tools, but an MCC of
only 0.02.
Our results are consistent with previous studies that included smaller
number of splice prediction tools. Wai et al. compared Alamut, Human
Splicing Finder and SpliceAI on 257 VUSs (NCSS and DI) from blood RNA
samples showing that SpliceAI outperformed the other tools with an AUC
of 0.951 (Wai et al., 2020). A second study by Ellingford et al.
compared SpliceAI, SPIDEX, S-CAP, CADD and MaxEntScan first in a real
time assessment of 21 variants and then in variant prioritization of
nearly 3000 variants (Ellingford et al., 2019). The real-time assessment
showed that SpliceAI and MaxEntScan achieved a good performance. In the
variant prioritization of the large cohort only SpliceAI, Spidex and
CADD are compared. Here SpliceAI showed the highest AUC (0.96). Our AUC
values for SpliceAI were 0.80 (ABCA4 NCSS), 0.95 (ABCA4DI) and 0.72 (MYBPC3 NCSS). Especially the AUCs of the NCSS
datasets are lower than the AUC found in the two other studies. There
can be multiple explanations for this. First, our datasets are smaller,
making the right prediction for each individual variant more important.
Second, we used variants located in only one gene, whereas the
above-mentioned studies used variants in a variety of genes. This could
indicate that for genes with tissue-specific expression the available
splice prediction tools are not specialized enough, for reasons
explained above. Third, we evaluated tools based on functional
assessment with midi- or minigenes assays, which currently represent the
best medium-throughput tools. Still, also this experimental set-up has
limitations, since the splice assays were performed in human kidney
cells. This means that tissue specific splicing events may be missed.
For ABCA4 it is known that variants can lead to tissue specific
pseudo-exon inclusion (Albert et al., 2018). Another limitation is that
the percentage mutant RNA of the ABCA4 variants is determined
based on RT-PCR products visualized on agarose gels. RT-PCR has a bias
towards smaller segments and this can lead to incorrect classification
of the variants. A better alternative would be to use RNA-sequencing.
A general observation made in our benchmark study is that the prediction
of the in silico tools on a set of clinically relevant variants
varies considerably from the performance described in the original
paper. SpliceAI, for instance, achieves an area under the precision
recall curve (PR-AUC) of 0.98 on RNA-seq data (Jaganathan et al., 2019).
For our datasets the PR-AUC is 0.94 for ABCA4 NCSS variants, 0.91
for ABCA4 DI variants and 0.75 for MYBPC3 NCSS variants.
The higher performance observed by the authors can be explained by the
use of an RNA-seq dataset. Using big RNA datasets to evaluate the
performance of a novel algorithm will artificially increase its
performance, because naturally occurring high frequency variants have a
different effect on splice sites than rare variants affecting splicing.
Moreover, circularity, i.e. incomplete independence of the variants used
for training and testing, may result in overestimation of the
performance of the model very similar properties were already (Grimm et
al., 2015). This is why it is important to use a truly independent set
of clinically relevant variants to evaluate the performance of the
splice prediction tools. Additionally, it is important to use the right
evaluation metrics to compare different algorithms. As shown for theABCA4 variants imbalance in the dataset influences the
classification metrics and therefore also the comparison. The precision
recall curve uses the PPV and sensitivity to calculate the AUC.
Imbalance in the dataset has an influence on both metrics, which makes
it difficult to compare highly imbalanced datasets based on the PR-AUC.
To conclude, there are a variety of different splice prediction tools
available. It is not easy to choose which tool to use, because different
tools may perform better in different contexts. The best performing
tools make use of different algorithms, deep learning (SpliceAI),
machine-learning (NNSPLICE) and interactions (MaxEntScan). Deep learning
has the possibility to improve splice prediction, but is not a guarantee
for success: Out of the five deep learning tools, only SpliceAI
performed better than the more established tools.