In silico Splice Prediction Tools
In silico splice prediction tools were selected based on the following criteria. First, the tool is freely available. Second, the tool can be applied to a variant in either variant or sequence format. Third, the tool either uses deep learning or is widely applied in routine diagnostics. An overview of all in silico prediction tools and their characteristics is provided in Table 1. Delta scores according to formula (1) were calculated for tools that provided a separate score for wild type and variant sequences. The absolute value of the score was used for tools that returned negative values to only compare the magnitude of splice change.
\begin{equation} Delta\ score=\left|\ \frac{WT\ score-variant\ score}{\text{Maximum\ score\ of\ the\ tool}}\ \right|\ \ \ \ \ \ \ \ \ \ (1)\nonumber \\ \end{equation}
The commonly applied tools GeneSplicer, MaxEntScan, NNSPLICE and SpliceSiteFinder-like were accessed from Alamut Visual Software version 2.13 (SOPHiA GENETICS, Lausanne, Switzerland). Missing values likely do not result in a change compared to wildtype and are unlikely to affect splicing. Therefore, they were replaced with zero. When multiple splice sites close to the variant were scored, the score for the canonical splice site was chosen for NCSS variants and the score for the novel created/strengthened splice site was chosen for DI variants.
The other tools were accessed separately from either a website, available scripts or files with precomputed scores. Tools accessed via their website were CADD v1.6 and SpliceRover. For CADD, a VCF file with the variants was uploaded to the website, and raw scores were obtained. SpliceRover required a FASTA sequence with a minimal length of 400 nt. Thus, we included 410 nt long sequences around the variant of interest as input. For 11 variants, which provided an error message, we used a different input length to obtain a score (ABCA4 : 1000 nt for c.769-605T>C, c.769-1778T>C, c.302+628C>T and 750 for c.769-788A>T;MYBPC3 : 1000 for c.3815-10T>G, c.2906-12C>T, c.1928-11G>A, c.1625-8C>G, c.1227-9C>A, c.1091-8G>A and 750 for c.906-8T>C). Python scripts were available for DSSP, SpliceAI, MMSplice and MTSplice. Input sequences for DSSP inquired input sequences of 140 nt with the SAS dinucleotide at positions 69 and 70 or the SDS dinucleotide at positions 71 and 72. Donor and acceptor sequences were processed with separate python scripts available on the DSSP GitHub (https://github.com/DSSP-github/DSSP). SpliceAI was applied to a variant call format (VCF) file. MMSplice v2.7 and MTSplice were also applied to VCF files but returned multiple scores for most variants. The absolute delta logit PSI scores for the longest transcript and the exon closest to the variant was chosen as primary score. Both tools were included into the same script and the parameter tissue_specificity determined which tool was applied. If tissue_specificity was set to true, MTSplice was chosen, otherwise MMSplice was run on the VCF file. A file with precomputed scores was available for both SPIDEX v1.0 and S-CAP v1.0. The data and all analysis scripts can be found athttps://github.com/cmbi/Benchmarking_splice_prediction_tools.
MMSplice, MTSplice, Spidex and S-CAP could not calculate a score for more than half of the ABCA4 DI variants, and we therefore excluded these tools completely for the analysis of DI variants. MMSplice and MTSplice only consider variants located within 300 nt of the SDS or SAS. SPIDEX and S-CAP scores were retrieved from files with precomputed values, which did not include DI variants.