Analysis of the sequencing coverage of rAAV vector genomes
containing GC-rich regions and mononucleotide stretches
SSV-Seq is a powerful technique intended, primarily, for the analysis of
residual DNA in AAV vector lots. However, based on high-throughput
sequencing, this method also provides information about the vector
genome identity, including the identification of single nucleotide
variants (SNV) and indels. The SSV-Seq protocol consists of the
following successive experimental steps (Figure 1 ) (Lecomte et
al., 2019): (1) a facultative DNase pretreatment, (2) the DNA extraction
from rAAV stocks, (3) the second-strand DNA synthesis using random
hexamers, (4) the library preparation and (5) the Illumina sequencing.
Briefly, before particle disruption, residual DNA that is not protected
by the AAV capsid can be removed by a facultative nuclease treatment.
After DNA extraction, the single-stranded rAAV genome is converted into
double-stranded DNA (dsDNA) using random hexamers to generate a template
compatible with Illumina sequencing library preparation workflow.
Illumina-compatible sequencing libraries are then prepared using a
custom protocol (Figure 1, protocol ). DNA is sheared by
sonication, end-repaired and A-tailed and adapters are ligatedvia a 3-prime T-overhang. DNA fragments that are flanked by
adapters are amplified by 15 cycles of PCR. Finally, the library is
paired-end sequenced using the Illumina HiSeq platform and the data are
processed using our dedicated bioinformatics pipeline
(https://github.com/emlec/SSV-Conta)
(Lecomte et al., 2019).
The PCR amplification step has already been described as being the
principal source of bias during sequencing library preparation (Aird et
al., 2011). Indeed, AT- (Oyola et al., 2012) and GC-rich (Benjamini and
Speed, 2012) fragments are less efficiently amplified than other regions
in the genome which can lead to a bias resulting of a lower sequencing
coverage. Consequently, sequencing library preparation protocols
including a PCR step can cause an uneven distribution of reads coverage
across the DNA. In the context of AAV vector analysis by HTS, this bias
could cause an underestimation of AT- and/or GC-rich sequences and the
impossibility to identify SNV and indels in these regions.
To determine more precisely the impact of the base composition on the
Illumina sequencing coverage, we first developed a new program, named
NTContent
(https://github.com/emlec/NTContent).
This program is based on a sliding window analysis and requires a DNA
sequence in FASTA format as input. It generates a tab-delimited text
file composed of two columns indicating for each given position the
percentage of the requested nucleotide combination
(Supplementary Figure S1 ). The NTContent program was applied to
a rAAV genome sequence containing the CAG promoter followed by the GFP
reporter gene and the SV40 polyadenylation signal sequence. The CAG
promoter was chosen because it has high GC content and it is known to be
difficult template for PCR amplification. Figure 2a shows the
percentage of GC along the rAAV genome, the sequencing coverage obtained
by SSV-Seq from a rAAV8 vector batch and from its corresponding plasmid
vector. Both for the plasmid and the rAAV sample, two major drops in the
sequencing coverage appeared in the CAG promoter around position 666
(asterisk, region 1) and position 1421 (asterisk, region 2) of the rAAV
genome. The superimposed graphs revealed that the two sequencing drops
in the CAG promoter are clearly related to a high GC content. A more
precise analysis of the percentage of GC was performed using the same
program NTContent but taking into account the nature (A, T, G, or C) of
50 successive bases with a step of 25 bases. This analysis showed that
the two steep drops coincide with regions composed of more than 90% of
GC (Supplementary Table S1).
In order to further investigate the origin of the sequencing drops, the
presence of A, T, C, and G nucleotide stretches along the rAAV genome
was analyzed (Figure 2b-c ). Of note, long G/C homopolymers have
already been reported as a source of bias during sequencing on Illumina
systems such as HiSeq instrument (Shin and Park, 2016). A succession of
C, T and G homopolymers was observed in region 1 of the CAG promoter
between position 588 and 676 (Figure 2b ). C and G stretches
were also detected in region 2, upstream the steep drop, between
position 1290 and 1391 (Figure 2c ). A previous study suggested
that the presence of repetitive mononucleotides at the active site of a
polymerase can lead to its dissociation from the DNA (Fazekas et al.,
2010). For SSV-Seq, the PCR amplification is realized using the PfuUltra
II Fusion HotStart DNA Polymerase. The number of nucleotides that fills
the active site of this polymerase is 6. Thus, the software MISA was
used to seek nucleotide repeats equal or greater than 6. Regions 1 and 2
of the CAG promoter contain 6 out of 14 homopolymers of ≥ 6 bases
detected in the AAV vector genome (Supplementary Table S2 ). In
particular, the first region showing a drastic coverage drop includes
two stretches of 8 and 16 mononucleotides (C and G, respectively).
Overall, we can conclude that drastic drops in the sequencing coverage
correlate with the presence of long stretches of G and C nucleotides in
the rAAV vector genome which is consistent with the very high GC
percentage illustrated on Figure 2a . It remains to be
determined if this bias occurs at the PCR step and/or during the HiSeq
Illumina sequencing. To address this question, a PCR-free protocol has
been developed and compared to the PCR-enriched SSV-Seq method.