The SSV-Seq 2.0 protocol improves the sequencing coverage along
regions of the rAAV genome that are both rich in GC and homopolymers
After sequencing of the DNA library prepared from the AAV8-CAG-GFP
vector following the PCR-free novel method, the sequencing reads were
passed through our dedicated bioinformatics pipeline, named SSV-Conta
(Lecomte et al., 2019). SSV-Conta is intended for the determination of
the proportion of residual DNA species in a rAAV batch and of the
analysis of the coverage along the vector genome. The percentage of
reads that passed the quality and adapter trimming steps were higher
than 94% for both protocols, although a slightly lower percentage was
observed for the PCR-free libraries (Supplementary Table S4).The filtered reads were then aligned to the vector plasmid to visualize
the sequencing coverage along the two GC-rich regions in the CAG
promoter described above. Using the PCR-free protocol, the coverage
along these regions was significantly restored compared to the
PCR-enriched protocol (Figure 4 ), indicating that the PCR
amplification step is one of the major causes of artefactual drop in the
sequencing coverage. In addition to the rAAV vector genome, read
alignment was realized for other DNA species, i.e. the vector plasmid
backbone, the helper plasmid and the HEK293 cell genome. The number of
reads aligned to each reference is shown in Supplementary Table
S5 . Overall, a minimum of 15.2 M and 23.8 M reads per sample was mapped
to the known references for the PCR-free and PCR protocol, respectively.
Finally, the percentage of each DNA species was calculated as described
in Lecomte et al. (Lecomte et al., 2019) (Table 1 ). Similar to
the SSV-Seq method (Lecomte et al., 2015), the novel SSV-Seq 2.0
protocol is highly reproducible, as indicated by the coverage graph of
each replicate (Figure 4 ). Consistently with a better coverage
along the rAAV genome and less-biased sequencing of the GC-rich regions,
the optimized PCR-free method leads to a higher percentage of reads
aligned to the rAAV-CAG-GFP genome (93.9 ± 0.4% and 91.9 ± 0.3% of the
total mapped reads for PCR-free and PCR protocols, respectively). As
described in our previous study (Lecomte et al., 2015), the predominant
DNA contaminant originates from the vector plasmid backbone. The
relative percentage of this contaminant was reduced using the PCR-free
protocol, since more reads were attributed to the rAAV genome (5.7±0.4%
of the total mapped reads for SSV-Seq 2.0 compared to 7.6±0.3% for
SSV-Seq). Consequently, the SSV-Seq protocol slightly overestimates the
percentage of DNA contaminants when the rAAV vector genome is composed
of sequences that are difficult to amplify by PCR. In conclusion, the
SSV-Seq 2.0 method is the most accurate approach for the high-throughput
sequencing analysis of AAV vector genomes that contain regions with a
high level of GC and homopolymers.