Bioinformatics pipeline
We obtained DNA barcodes from raw sequence data with the following process, using bcl2fastq version 2.20 (Illumina), Claident version 2018.05.08 [28], VSEARCH 2.14 [30], and EMBOSS merger V.6.6.0 [31]. Steps 1-8 were executed in Bash scripts, and steps 9-10 in a Python script:
  1. Raw sequence data in bcl format was converted into fastq format, without demultiplexing, using bcl2fastq.
  2. The forward and reverse fastq sequences were each demultiplexed into separate FC and BR fastq files for each of the 450 specimens, using clsplitseq script from Claident. This resulted in a pair of R1 and R2 fastq files for the FC amplicon, and a pair of R1 and R2 fastq files for the BR amplicon, for each of the 450 specimens.
  3. For each amplicon (FC and BR) and specimen, the R1 and R2 sequences contained in each pair of fastq files were merged using -fastq_mergepairs in VSEARCH with minimum overlap (-fastq_minovlen), minimum merged length (-fastq_minmergelen), and maximum merged length (-fastq_maxmergelen) options of 200, 250, and 400 respectively for FC, and 100, 350, and 500 respectively for BR, with simultaneous filtering of merged sequences for errors (-fastq_maxee 1) and ambiguous bases (-fastq_maxn 0).
  4. The filtered sequences from each amplicon and specimen were then dereplicated using -derep_fulllength, denoised using -cluster_unoise with -minsize 1, and filtered for chimeras using -uchime3_denovo, all in VSEARCH. Most of the filtered and denoised sequence files still contained multiple sequences, necessitating additional filtering to recover the correct barcode sequences, based on taxonomy, sequence abundance, and attributes of pairwise FC-BR sequence alignments, using subsequent steps.
  5. A taxonomic identification was obtained for each filtered and denoised FC and BR sequence using BLAST against GenBank nr, accepting the top match in each case.
  6. Up to n = 20 most abundant denoised FC and BR sequences (if any) were output as fasta files for each specimen, using the VSEARCH command -derep_fulllength with option -topn n . The FC and BR sequence files were concatenated together into a single fasta file per specimen.
  7. All possible pairwise sequence alignments with the expected overlap between FC and BR amplicons, namely an overlap of 85 bp with 100 % pairwise sequence identity, were identified between the sequences in the concatenated FC and BR fasta file for each specimen using the VSEARCH command -allpairs_global with options -id 1 and -mincols 85.
  8. If no alignments were detected among the 20 most abundant sequences from a given specimen, steps 6 and 7 were repeated with all sequences (if > 20) from that specimen.
  9. Putatively correct full-length or partial barcodes were then identified per specimen based on comparing FC and BR sequence identifications and FC-BR alignment characteristics. An optimal FC sequence, BR sequence, and aligned FC-BR sequence pair (if any), was selected for each specimen by matching their sequence identities to the a-priori specimen taxonomy, prioritizing the lowest taxonomic rank level. The sequences were also required to have lengths between 324 and 326 b.p. for FC and between 417 and 419 b.p. for BR, and BLAST identification bitscores ≥ 200 for FC and ≥ 250 for BR, to avoid spurious matches. If more than one FC sequence, BR sequence, or aligned FC-BR sequence pair per specimen met these criteria, the most abundant sequence or sequence pair was selected.
  10. If an FC-BR sequence pair with expected taxonomy was detected, and the lowest taxonomic identification rank of this sequence pair was equal to or lower than that of the selected FC and/or BR sequences for that specimen, that aligned FC-BR sequence pair was accepted as the likely correct barcode components and combined by EMBOSS merger. This carries out a pairwise alignment according to the Needleman-Wunsch algorithm, outputting a merged barcode sequence, along with an alignment summary that was checked to ensure the alignment attributes were as expected (overlap of 85 bp and score of 425).
  11. To avoid accepting a merged sequence from a non-target organism, if either of the selected FC or BR sequences had a lower taxonomic identification rank than that of the selected aligned FC-BR sequence pair, that FC or BR sequence was accepted as a likely correct partial barcode, instead of the aligned FC-BR sequence pair.
  12. If an FC or BR sequence (but not an aligned FC-BR sequence pair) with expected taxonomy was detected, that sequence was accepted as a likely correct partial barcode. If both an FC and BR sequence (but no aligned FC-BR sequence pair) with expected taxonomy were detected, the FC or BR sequence with the lowest taxonomic rank was accepted as a likely correct partial barcode. If these FR and BR sequences had the same lowest taxonomic rank, the most abundant sequence was accepted as the likely correct partial barcode.
The merged FC and BR barcodes, plus any partial FC- or BR-only barcodes, were concatenated into a single fasta file, and aligned using MAFFT [32]. An approximately maximum-likelihood phylogeny was generated from the alignment using FastTree 2 [33], and visualised using the R package ggtree [34].
The resulting barcode dataset was examined for any effects of DNA extraction methodology, PCR primer, or taxonomy on successful barcode recovery. For this purpose, any orders and families represented by fewer than five specimens were pooled together as “Others”.
To assess the accuracy of recovered Illumina barcodes, 96 of the 450 specimen DNA extracts (47 beetles, 21 flies, 14 wasps, 11 other insects, and individual mite, millipede, and spider specimens) were also subjected to PCR using primers LCO1490 and HCO2198 (Folmer et al., 1994), followed by Sanger sequencing of the amplicons using standard methods. The resulting Sanger barcode sequences were compared with Illumina-derived barcodes for the same specimens via generation of pairwise sequence alignments using MAFFT (Katoh & Standley, 2013) and determination of sequence identity between each sequence pair.