The SSV-Seq 2.0 protocol improves the sequencing coverage along regions of the rAAV genome that are both rich in GC and homopolymers
After sequencing of the DNA library prepared from the AAV8-CAG-GFP vector following the PCR-free novel method, the sequencing reads were passed through our dedicated bioinformatics pipeline, named SSV-Conta (Lecomte et al., 2019). SSV-Conta is intended for the determination of the proportion of residual DNA species in a rAAV batch and of the analysis of the coverage along the vector genome. The percentage of reads that passed the quality and adapter trimming steps were higher than 94% for both protocols, although a slightly lower percentage was observed for the PCR-free libraries (Supplementary Table S4).The filtered reads were then aligned to the vector plasmid to visualize the sequencing coverage along the two GC-rich regions in the CAG promoter described above. Using the PCR-free protocol, the coverage along these regions was significantly restored compared to the PCR-enriched protocol (Figure 4 ), indicating that the PCR amplification step is one of the major causes of artefactual drop in the sequencing coverage. In addition to the rAAV vector genome, read alignment was realized for other DNA species, i.e. the vector plasmid backbone, the helper plasmid and the HEK293 cell genome. The number of reads aligned to each reference is shown in Supplementary Table S5 . Overall, a minimum of 15.2 M and 23.8 M reads per sample was mapped to the known references for the PCR-free and PCR protocol, respectively. Finally, the percentage of each DNA species was calculated as described in Lecomte et al. (Lecomte et al., 2019) (Table 1 ). Similar to the SSV-Seq method (Lecomte et al., 2015), the novel SSV-Seq 2.0 protocol is highly reproducible, as indicated by the coverage graph of each replicate (Figure 4 ). Consistently with a better coverage along the rAAV genome and less-biased sequencing of the GC-rich regions, the optimized PCR-free method leads to a higher percentage of reads aligned to the rAAV-CAG-GFP genome (93.9 ± 0.4% and 91.9 ± 0.3% of the total mapped reads for PCR-free and PCR protocols, respectively). As described in our previous study (Lecomte et al., 2015), the predominant DNA contaminant originates from the vector plasmid backbone. The relative percentage of this contaminant was reduced using the PCR-free protocol, since more reads were attributed to the rAAV genome (5.7±0.4% of the total mapped reads for SSV-Seq 2.0 compared to 7.6±0.3% for SSV-Seq). Consequently, the SSV-Seq protocol slightly overestimates the percentage of DNA contaminants when the rAAV vector genome is composed of sequences that are difficult to amplify by PCR. In conclusion, the SSV-Seq 2.0 method is the most accurate approach for the high-throughput sequencing analysis of AAV vector genomes that contain regions with a high level of GC and homopolymers.