In silico Splice Prediction Tools
In silico splice prediction tools were selected based on the
following criteria. First, the tool is freely available. Second, the
tool can be applied to a variant in either variant or sequence format.
Third, the tool either uses deep learning or is widely applied in
routine diagnostics. An overview of all in silico prediction
tools and their characteristics is provided in Table 1. Delta scores
according to formula (1) were calculated for tools that provided a
separate score for wild type and variant sequences. The absolute value
of the score was used for tools that returned negative values to only
compare the magnitude of splice change.
\begin{equation}
Delta\ score=\left|\ \frac{WT\ score-variant\ score}{\text{Maximum\ score\ of\ the\ tool}}\ \right|\ \ \ \ \ \ \ \ \ \ (1)\nonumber \\
\end{equation}The commonly applied tools GeneSplicer, MaxEntScan, NNSPLICE and
SpliceSiteFinder-like were accessed from Alamut Visual Software version
2.13 (SOPHiA GENETICS, Lausanne, Switzerland). Missing values likely do
not result in a change compared to wildtype and are unlikely to affect
splicing. Therefore, they were replaced with zero. When multiple splice
sites close to the variant were scored, the score for the canonical
splice site was chosen for NCSS variants and the score for the novel
created/strengthened splice site was chosen for DI variants.
The other tools were accessed separately from either a website,
available scripts or files with precomputed scores. Tools accessed via
their website were CADD v1.6 and SpliceRover. For CADD, a VCF file with
the variants was uploaded to the website, and raw scores were obtained.
SpliceRover required a FASTA sequence with a minimal length of 400 nt.
Thus, we included 410 nt long sequences around the variant of interest
as input. For 11 variants, which provided an error message, we used a
different input length to obtain a score (ABCA4 : 1000 nt for
c.769-605T>C, c.769-1778T>C,
c.302+628C>T and 750 for c.769-788A>T;MYBPC3 : 1000 for c.3815-10T>G,
c.2906-12C>T, c.1928-11G>A,
c.1625-8C>G, c.1227-9C>A,
c.1091-8G>A and 750 for c.906-8T>C). Python
scripts were available for DSSP, SpliceAI, MMSplice and MTSplice. Input
sequences for DSSP inquired input sequences of 140 nt with the SAS
dinucleotide at positions 69 and 70 or the SDS dinucleotide at positions
71 and 72. Donor and acceptor sequences were processed with separate
python scripts available on the DSSP GitHub
(https://github.com/DSSP-github/DSSP). SpliceAI was applied to a variant
call format (VCF) file. MMSplice v2.7 and MTSplice were also applied to
VCF files but returned multiple scores for most variants. The absolute
delta logit PSI scores for the longest transcript and the exon closest
to the variant was chosen as primary score. Both tools were included
into the same script and the parameter tissue_specificity determined
which tool was applied. If tissue_specificity was set to true, MTSplice
was chosen, otherwise MMSplice was run on the VCF file. A file with
precomputed scores was available for both SPIDEX v1.0 and S-CAP v1.0.
The data and all analysis scripts can be found athttps://github.com/cmbi/Benchmarking_splice_prediction_tools.
MMSplice, MTSplice, Spidex and S-CAP could not calculate a score for
more than half of the ABCA4 DI variants, and we therefore
excluded these tools completely for the analysis of DI variants.
MMSplice and MTSplice only consider variants located within 300 nt of
the SDS or SAS. SPIDEX and S-CAP scores were retrieved from files with
precomputed values, which did not include DI variants.