DISCUSSION
The goal of this manuscript was to develop a more accurate method to
characterize DNA species contained in rAAV batches. Several
technological platforms exist for the manufacturing of rAAV vectors for
their use in gene therapy, either using mammalian or insect cells
(Penaud-Budloo et al., 2018b). It is known that both upstream and
downstream processes may impact the purity of the final product,
including the amount and type of residual DNA. In order to assess the
risk for the patient to co-transfer undesired DNA sequences with AAV
vectors, an exhaustive identification and quantification of these DNAs
is of utmost importance and can be achieved thanks to methods based on
high-throughput sequencing technologies. We have previously described a
protocol based on Illumina sequencing, called Single-Stranded DNA Virus
Sequencing (SSV-Seq), to control rAAV purity in term of DNA contaminants
(Lecomte et al., 2019, 2015). This protocol includes a PCR step during
the library preparation, which could be affected by some type of bias
inherent to the PCR technique, such as the presence of AT- (Oyola et
al., 2012) and GC-rich regions (Aird et al., 2011). Several solutions
have been proposed to reduce these artifacts, either by optimizing PCR
conditions (Quail et al., 2011) or by developing alternative methods for
library amplification (van Dijk et al., 2014). Here, we decided to be
more drastic in order to improve our SSV-Seq protocol shifting towards a
PCR-free library preparation kit. Our study clearly shows a correlation
between a high GC and homopolymers content and a poor sequencing
coverage. In order to avoid data misinterpretation, for example as a
large deletion or a biological under-representation of a particular
sequence in the rAAV particles population, it is of great importance to
screen the rAAV genome for GC-rich regions and homopolymers prior to
sequencing-based analysis. To this purpose, the software MISA and the
new bioinformatics tool NTContent developed here (available athttps://github.com/emlec/NTContent) can be extremely useful
as prediction tools. In order to monitor any potential bias in SSV-Seq,
an internal normalizer is also processed in parallel to the rAAV
samples. Composed of a mix of the plasmid vector and other potential
residual DNA species (producer cell DNA, helper plasmids), this control
enables to visualize and compare the sequencing coverages obtained from
the rAAV sample and from the plasmid vector, as shown on Figure
2 .
A coverage drop in the CAG promoter, which has a local GC percentage
higher than 90%, was previously observed using SSV-Seq (Kondratov et
al., 2017). The same observation has been reported by another group
using Fast-Seq, a technique based on Tn5 tagmentation (Maynard et al.,
2019). Kondratov et al. have shown that a PCR-free protocol could
outperformed a PCR-enriched method (8 amplification cycles) in term of
sequencing coverage along GC-rich regions of the AAV vector genome
(Kondratov et al., 2017). The authors used the Accel-NGS 2S PCR-Free DNA
Library Kit from Swift Biosciences to prepare libraries. An initial
amount of 4x1011 vg of a rAAV5-CAG-GFP vector and an
input of 220 ng of dsDNA was used in their protocol. The Accel-NGS
workflow includes two DNA repair steps and two adapter ligation steps
and requires the use of specific adapters that are not compatible with
low-throughput applications. Similar to our data, the authors were able
to reduce the sequencing bias due to a high GC content by evicting PCR
amplification, although coverage drops were still detected in the CAG
promoter. Therefore, biases due to the sequencing technology need to be
further assessed. Indeed, all sequencing technologies exhibit error-rate
biases in low- (≤10%) and high-GC (≥75%) regions, and at long
homopolymers (Ross et al., 2013). G-rich sequences can also be at the
origin of sequence-specific errors (SSE) using Illumina technology (Dohm
et al., 2008) and may cause false SNV (Shin and Park, 2016). However, in
our study, no SNV has been observed in the CAG promoter.
Using our SSV-Seq 2.0 PCR-free method, we still detected minor coverage
drops, as for example within the eGFP transgene (Figure 4 ).
Independent of a PCR amplification bias, this could be related to the
sequencing technology itself. Indeed, MiSeq sequencing that used the
same four-channel sequencing chemistry than HiSeq has been shown to
disfavor the “CCNGCC” motif in the GFP coding sequence (Van den Hoecke
et al., 2016). On the other hand, sequencing technologies such as single
molecule real-time (SMRT) sequencing (Pacific Biosciences) is described
as giving a less biased coverage across GC-rich regions (Ross et al.,
2013). Offering long read lengths, single molecule sequencing
technologies also allow to study rAAV vector genome integrity (Radukic
et al., 2019; Tai et al., 2018; Xie et al., 2017). Interestingly, rAAV
genome truncations have been detected at hairpin-like structures using
the AAV-GPseq SMRT-based assay, creating self-complementary viral
genomes (Xie et al., 2017). Improving rAAV genome sequencing, and
particularly through ITR and ITR-plasmid junctions would also be of
great interest in the field. Recently, an HTS-based assay has been
developed to identify off-target nuclease activity after AAV-mediated
genome edition in vivo (Breton et al., 2020). PCR and adapter
optimizations have been realized in this protocol, named ITR-Seq, to
specifically amplify ITR-genomic DNA junctions. Combining multiple
sequencing technologies could provide complementary information and
reduce the risks associated to inherent technical errors of each
platform. For instance, SSV-Seq based on Illumina technology that gives
a high sequencing depth is likely the preferred method to identify and
characterize residual DNA in rAAV stocks and perform SNV analysis, while
AAV-GPseq based on SMRT sequencing is more adapted to AAV vector genome
integrity (truncated rAAV genomes). The novel SSV-Seq 2.0 protocol
allows to circumvent PCR biases and improves the HTS analysis of rAAV
genomes harboring regions with high percentage of GC content and long
mononucleotide stretches, such as those often found in promoters.