DISCUSSION
The goal of this manuscript was to develop a more accurate method to characterize DNA species contained in rAAV batches. Several technological platforms exist for the manufacturing of rAAV vectors for their use in gene therapy, either using mammalian or insect cells (Penaud-Budloo et al., 2018b). It is known that both upstream and downstream processes may impact the purity of the final product, including the amount and type of residual DNA. In order to assess the risk for the patient to co-transfer undesired DNA sequences with AAV vectors, an exhaustive identification and quantification of these DNAs is of utmost importance and can be achieved thanks to methods based on high-throughput sequencing technologies. We have previously described a protocol based on Illumina sequencing, called Single-Stranded DNA Virus Sequencing (SSV-Seq), to control rAAV purity in term of DNA contaminants (Lecomte et al., 2019, 2015). This protocol includes a PCR step during the library preparation, which could be affected by some type of bias inherent to the PCR technique, such as the presence of AT- (Oyola et al., 2012) and GC-rich regions (Aird et al., 2011). Several solutions have been proposed to reduce these artifacts, either by optimizing PCR conditions (Quail et al., 2011) or by developing alternative methods for library amplification (van Dijk et al., 2014). Here, we decided to be more drastic in order to improve our SSV-Seq protocol shifting towards a PCR-free library preparation kit. Our study clearly shows a correlation between a high GC and homopolymers content and a poor sequencing coverage. In order to avoid data misinterpretation, for example as a large deletion or a biological under-representation of a particular sequence in the rAAV particles population, it is of great importance to screen the rAAV genome for GC-rich regions and homopolymers prior to sequencing-based analysis. To this purpose, the software MISA and the new bioinformatics tool NTContent developed here (available athttps://github.com/emlec/NTContent) can be extremely useful as prediction tools. In order to monitor any potential bias in SSV-Seq, an internal normalizer is also processed in parallel to the rAAV samples. Composed of a mix of the plasmid vector and other potential residual DNA species (producer cell DNA, helper plasmids), this control enables to visualize and compare the sequencing coverages obtained from the rAAV sample and from the plasmid vector, as shown on Figure 2 .
A coverage drop in the CAG promoter, which has a local GC percentage higher than 90%, was previously observed using SSV-Seq (Kondratov et al., 2017). The same observation has been reported by another group using Fast-Seq, a technique based on Tn5 tagmentation (Maynard et al., 2019). Kondratov et al. have shown that a PCR-free protocol could outperformed a PCR-enriched method (8 amplification cycles) in term of sequencing coverage along GC-rich regions of the AAV vector genome (Kondratov et al., 2017). The authors used the Accel-NGS 2S PCR-Free DNA Library Kit from Swift Biosciences to prepare libraries. An initial amount of 4x1011 vg of a rAAV5-CAG-GFP vector and an input of 220 ng of dsDNA was used in their protocol. The Accel-NGS workflow includes two DNA repair steps and two adapter ligation steps and requires the use of specific adapters that are not compatible with low-throughput applications. Similar to our data, the authors were able to reduce the sequencing bias due to a high GC content by evicting PCR amplification, although coverage drops were still detected in the CAG promoter. Therefore, biases due to the sequencing technology need to be further assessed. Indeed, all sequencing technologies exhibit error-rate biases in low- (≤10%) and high-GC (≥75%) regions, and at long homopolymers (Ross et al., 2013). G-rich sequences can also be at the origin of sequence-specific errors (SSE) using Illumina technology (Dohm et al., 2008) and may cause false SNV (Shin and Park, 2016). However, in our study, no SNV has been observed in the CAG promoter.
Using our SSV-Seq 2.0 PCR-free method, we still detected minor coverage drops, as for example within the eGFP transgene (Figure 4 ). Independent of a PCR amplification bias, this could be related to the sequencing technology itself. Indeed, MiSeq sequencing that used the same four-channel sequencing chemistry than HiSeq has been shown to disfavor the “CCNGCC” motif in the GFP coding sequence (Van den Hoecke et al., 2016). On the other hand, sequencing technologies such as single molecule real-time (SMRT) sequencing (Pacific Biosciences) is described as giving a less biased coverage across GC-rich regions (Ross et al., 2013). Offering long read lengths, single molecule sequencing technologies also allow to study rAAV vector genome integrity (Radukic et al., 2019; Tai et al., 2018; Xie et al., 2017). Interestingly, rAAV genome truncations have been detected at hairpin-like structures using the AAV-GPseq SMRT-based assay, creating self-complementary viral genomes (Xie et al., 2017). Improving rAAV genome sequencing, and particularly through ITR and ITR-plasmid junctions would also be of great interest in the field. Recently, an HTS-based assay has been developed to identify off-target nuclease activity after AAV-mediated genome edition in vivo (Breton et al., 2020). PCR and adapter optimizations have been realized in this protocol, named ITR-Seq, to specifically amplify ITR-genomic DNA junctions. Combining multiple sequencing technologies could provide complementary information and reduce the risks associated to inherent technical errors of each platform. For instance, SSV-Seq based on Illumina technology that gives a high sequencing depth is likely the preferred method to identify and characterize residual DNA in rAAV stocks and perform SNV analysis, while AAV-GPseq based on SMRT sequencing is more adapted to AAV vector genome integrity (truncated rAAV genomes). The novel SSV-Seq 2.0 protocol allows to circumvent PCR biases and improves the HTS analysis of rAAV genomes harboring regions with high percentage of GC content and long mononucleotide stretches, such as those often found in promoters.