Bioinformatic prediction of pseudoexon activation
Currently, there is no bioinformatic tool dedicated to prediction of pseudoexon-activating variants together with the corresponding size and/or sequence of the inserted cryptic exon. The current prediction strategy is to determine whether a deep intronic variant leads to ade novo splice site gain, and then separately check for a nearby pre-existing cryptic splice site of opposite polarity that could define the boundary of the new exon (Caminsky et al., 2016; Lee et al., 2017).
In the variant prioritization method of Caminsky et al. (2016), an Information Theory model was used to measure changes in splicing-relevant protein binding sites and predict whether a variant would lead to a gain or loss of a splicing motif. A total of 623 variants in hereditary breast and ovarian cancer genes were predicted to create or strengthen an intronic cryptic splice site. However, only 17 variants were prioritized as likely to create a pseudoexon due to their location within 250 nucleotides of another existing intronic site of opposite polarity and the existence of an hnRNPA1 site within five nucleotides of the acceptor of the predicted pseudoexon (Caminsky et al., 2016). However, these prioritized variants have yet to undergo splicing analysis, and so it is not possible to assess the performance of the Information Theory model.
Another workflow incorporates use of CryptSplice, a tool which extends the splice site definition of Burge et al. (1999) to capture more sequence component information (Lee et al., 2017). The donor sequences extend from seven nucleotides upstream of GT (−7) to six nucleotides downstream of GT (+6), and acceptor sequences extend from 68 nucleotides upstream of AG (−68) to 20 nucleotides downstream of AG (+20). This extended definition was previously reported to improve splice site prediction by combining the feature information of splicing signals and SREs around splice sites (J. L. Li, Wang, Wang, Bai, & Yuan, 2012). In an analysis of CFTR variants in cystic fibrosis patients with partly explained genetic cause for their recessively inherited disease, intronic variants underwent prioritization to detect variants that may lead to pseudoexon activation (Lee et al., 2017). Of 41 candidate intronic variants predicted to create either donor or acceptor sequences using CryptSplice, only three donor sequences were additionally predicted to activate pseudoexons by manual evaluation of the surrounding sequence for a splice site of opposite polarity (Lee et al., 2017). Two variants were shown to lead to pseudoexon insertion resulting in transcript loss due to nonsense-mediated decay; and the other, with a weakly predicted upstream acceptor, did not lead to aberrant splicing. In the same study, CryptSplice analysis of 4,685 DKC1 unique variants present in six individuals identified five candidate donor sequences and 12 candidate acceptor sequences (Lee et al., 2017). Only one of the five candidate donors was predicted to activate a pseudoexon; while mRNA analysis provided evidence for pseudoexonization, the donor activated by this DKC1 variant did not pair with the CryptSplice predicted acceptor, but rather with another acceptor 14 nucleotides upstream (Lee et al., 2017).
The Information Theory and CryptSplice prioritization methods for pseudoexon-activating variants did not comprehensively take into account the role of SREs, which can influence the expression of pseudoexons. To illustrate, the Information Theory model predicted that MLH1LRG_216t1:c.1559-1732A>T creates a new acceptor and activates a 239-bp pseudoexon due to the presence of a downstream pre-existing cryptic donor (Caminsky et al., 2016). However, our analysis of the pseudoexon sequence using HSF revealed a cluster of putative ESS octamers ((X. H.-F. Zhang & Chasin, 2004), with high relative activity and located within 30 nucleotides upstream of the cryptic donor that potentially inactivates this cryptic donor (Supplementary Figure 2). Therefore, a prediction model that incorporates both splice site motifs and the distribution of SREs within candidate pseudoexons and their flanking regions is likely to improve the accuracy of pseudoexon activation predictions.