Analysis of the sequencing coverage of rAAV vector genomes containing GC-rich regions and mononucleotide stretches
SSV-Seq is a powerful technique intended, primarily, for the analysis of residual DNA in AAV vector lots. However, based on high-throughput sequencing, this method also provides information about the vector genome identity, including the identification of single nucleotide variants (SNV) and indels. The SSV-Seq protocol consists of the following successive experimental steps (Figure 1 ) (Lecomte et al., 2019): (1) a facultative DNase pretreatment, (2) the DNA extraction from rAAV stocks, (3) the second-strand DNA synthesis using random hexamers, (4) the library preparation and (5) the Illumina sequencing. Briefly, before particle disruption, residual DNA that is not protected by the AAV capsid can be removed by a facultative nuclease treatment. After DNA extraction, the single-stranded rAAV genome is converted into double-stranded DNA (dsDNA) using random hexamers to generate a template compatible with Illumina sequencing library preparation workflow. Illumina-compatible sequencing libraries are then prepared using a custom protocol (Figure 1, protocol ). DNA is sheared by sonication, end-repaired and A-tailed and adapters are ligatedvia a 3-prime T-overhang. DNA fragments that are flanked by adapters are amplified by 15 cycles of PCR. Finally, the library is paired-end sequenced using the Illumina HiSeq platform and the data are processed using our dedicated bioinformatics pipeline (https://github.com/emlec/SSV-Conta) (Lecomte et al., 2019).
The PCR amplification step has already been described as being the principal source of bias during sequencing library preparation (Aird et al., 2011). Indeed, AT- (Oyola et al., 2012) and GC-rich (Benjamini and Speed, 2012) fragments are less efficiently amplified than other regions in the genome which can lead to a bias resulting of a lower sequencing coverage. Consequently, sequencing library preparation protocols including a PCR step can cause an uneven distribution of reads coverage across the DNA. In the context of AAV vector analysis by HTS, this bias could cause an underestimation of AT- and/or GC-rich sequences and the impossibility to identify SNV and indels in these regions.
To determine more precisely the impact of the base composition on the Illumina sequencing coverage, we first developed a new program, named NTContent (https://github.com/emlec/NTContent). This program is based on a sliding window analysis and requires a DNA sequence in FASTA format as input. It generates a tab-delimited text file composed of two columns indicating for each given position the percentage of the requested nucleotide combination (Supplementary Figure S1 ). The NTContent program was applied to a rAAV genome sequence containing the CAG promoter followed by the GFP reporter gene and the SV40 polyadenylation signal sequence. The CAG promoter was chosen because it has high GC content and it is known to be difficult template for PCR amplification. Figure 2a shows the percentage of GC along the rAAV genome, the sequencing coverage obtained by SSV-Seq from a rAAV8 vector batch and from its corresponding plasmid vector. Both for the plasmid and the rAAV sample, two major drops in the sequencing coverage appeared in the CAG promoter around position 666 (asterisk, region 1) and position 1421 (asterisk, region 2) of the rAAV genome. The superimposed graphs revealed that the two sequencing drops in the CAG promoter are clearly related to a high GC content. A more precise analysis of the percentage of GC was performed using the same program NTContent but taking into account the nature (A, T, G, or C) of 50 successive bases with a step of 25 bases. This analysis showed that the two steep drops coincide with regions composed of more than 90% of GC (Supplementary Table S1).
In order to further investigate the origin of the sequencing drops, the presence of A, T, C, and G nucleotide stretches along the rAAV genome was analyzed (Figure 2b-c ). Of note, long G/C homopolymers have already been reported as a source of bias during sequencing on Illumina systems such as HiSeq instrument (Shin and Park, 2016). A succession of C, T and G homopolymers was observed in region 1 of the CAG promoter between position 588 and 676 (Figure 2b ). C and G stretches were also detected in region 2, upstream the steep drop, between position 1290 and 1391 (Figure 2c ). A previous study suggested that the presence of repetitive mononucleotides at the active site of a polymerase can lead to its dissociation from the DNA (Fazekas et al., 2010). For SSV-Seq, the PCR amplification is realized using the PfuUltra II Fusion HotStart DNA Polymerase. The number of nucleotides that fills the active site of this polymerase is 6. Thus, the software MISA was used to seek nucleotide repeats equal or greater than 6. Regions 1 and 2 of the CAG promoter contain 6 out of 14 homopolymers of ≥ 6 bases detected in the AAV vector genome (Supplementary Table S2 ). In particular, the first region showing a drastic coverage drop includes two stretches of 8 and 16 mononucleotides (C and G, respectively).
Overall, we can conclude that drastic drops in the sequencing coverage correlate with the presence of long stretches of G and C nucleotides in the rAAV vector genome which is consistent with the very high GC percentage illustrated on Figure 2a . It remains to be determined if this bias occurs at the PCR step and/or during the HiSeq Illumina sequencing. To address this question, a PCR-free protocol has been developed and compared to the PCR-enriched SSV-Seq method.