Bioinformatics Analysis
Base call (BCL) files were converted into FASTQ files with the Illumina bcl2fastq2 Conversion Software (Illumina, San Diego, CA). Programs that are included in the SSV-Conta package (https://github.com/emlec/SSV-Conta) were then used to quantify and characterize all DNA species that are present in a rAAV vector lot: Quade, a FASTQ files demultiplexer, Sekator, an adapter trimmer, RefMasker to mask sequence homologies and ContaVect to analyze residual DNAs (Lecomte et al., 2019). Briefly, FASTQ files were demultiplexed with Quade according to their barcodes. The paired-end reads were assigned to a sample when the combination of the two barcodes (index read 1 and index read 2) was correct and if each base of the barcodes had a PHRED quality score of at least 25. Passed paired-end reads were trimmed using Sekator, according to the sequence quality and removing the adapter, as described in Lecomte et al (Lecomte et al., 2019). The distribution of residual DNA was determined using RefMasker and ContaVect programs. The reference sequences were indicated in the ContaVect configuration files in the following order: the phage φX174 genome (GenBank accession number J02482.1), the phage λ genome (J02459.1), the rAAV genome, the plasmid backbone sequence, the plasmid helper sequence, the adenovirus 5 (Ad5) sequence (nucleotides 1 to 4344 of the Human adenovirus 5 complete genome, AC_000008) and the human genome (GRCh38 primary assembly). Using RefMasker, homologies between two reference sequences were masked on the second reference sequence in the list order, replacing homologous nucleotides with an N base symbol. ContaVect was run, applying the following main parameters: minimum mean read quality, 30; minimum quality mapping for read validation, 20; minimum mapping size, 25 bases. Unmapped and mapped reads that did not fulfill these criteria were excluded. Sequencing coverage along each base of the vector plasmid was generated using the program SSV-Coverage, a program included in the SSV-Conta package. Sequencing data have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under the accession number PRJEB38306 (https://www.ebi.ac.uk/ena/data/view/PRJEB38306).