Bioinformatics pipeline
We obtained DNA barcodes from raw sequence data with the following
process, using bcl2fastq version 2.20 (Illumina), Claident version
2018.05.08 [28], VSEARCH 2.14 [30], and EMBOSS merger V.6.6.0
[31]. Steps 1-8 were executed in Bash scripts, and steps 9-10 in a
Python script:
- Raw sequence data in bcl format was converted into fastq format,
without demultiplexing, using bcl2fastq.
- The forward and reverse fastq sequences were each demultiplexed into
separate FC and BR fastq files for each of the 450 specimens, using
clsplitseq script from Claident. This resulted in a pair of R1 and R2
fastq files for the FC amplicon, and a pair of R1 and R2 fastq files
for the BR amplicon, for each of the 450 specimens.
- For each amplicon (FC and BR) and specimen, the R1 and R2 sequences
contained in each pair of fastq files were merged using
-fastq_mergepairs in VSEARCH with minimum overlap (-fastq_minovlen),
minimum merged length (-fastq_minmergelen), and maximum merged length
(-fastq_maxmergelen) options of 200, 250, and 400 respectively for
FC, and 100, 350, and 500 respectively for BR, with simultaneous
filtering of merged sequences for errors (-fastq_maxee 1) and
ambiguous bases (-fastq_maxn 0).
- The filtered sequences from each amplicon and specimen were then
dereplicated using -derep_fulllength, denoised using -cluster_unoise
with -minsize 1, and filtered for chimeras using -uchime3_denovo, all
in VSEARCH. Most of the filtered and denoised sequence files still
contained multiple sequences, necessitating additional filtering to
recover the correct barcode sequences, based on taxonomy, sequence
abundance, and attributes of pairwise FC-BR sequence alignments, using
subsequent steps.
- A taxonomic identification was obtained for each filtered and denoised
FC and BR sequence using BLAST against GenBank nr, accepting the top
match in each case.
- Up to n = 20 most abundant denoised FC and BR sequences (if
any) were output as fasta files for each specimen, using the VSEARCH
command -derep_fulllength with option -topn n . The FC and BR
sequence files were concatenated together into a single fasta file per
specimen.
- All possible pairwise sequence alignments with the expected overlap
between FC and BR amplicons, namely an overlap of 85 bp with 100 %
pairwise sequence identity, were identified between the sequences in
the concatenated FC and BR fasta file for each specimen using the
VSEARCH command -allpairs_global with options -id 1 and -mincols 85.
- If no alignments were detected among the 20 most abundant sequences
from a given specimen, steps 6 and 7 were repeated with all sequences
(if > 20) from that specimen.
- Putatively correct full-length or partial barcodes were then
identified per specimen based on comparing FC and BR sequence
identifications and FC-BR alignment characteristics. An optimal FC
sequence, BR sequence, and aligned FC-BR sequence pair (if any), was
selected for each specimen by matching their sequence identities to
the a-priori specimen taxonomy, prioritizing the lowest
taxonomic rank level. The sequences were also required to have lengths
between 324 and 326 b.p. for FC and between 417 and 419 b.p. for BR,
and BLAST identification bitscores ≥ 200 for FC and ≥ 250 for BR, to
avoid spurious matches. If more than one FC sequence, BR sequence, or
aligned FC-BR sequence pair per specimen met these criteria, the most
abundant sequence or sequence pair was selected.
- If an FC-BR sequence pair with expected taxonomy was detected, and the
lowest taxonomic identification rank of this sequence pair was equal
to or lower than that of the selected FC and/or BR sequences for that
specimen, that aligned FC-BR sequence pair was accepted as the likely
correct barcode components and combined by EMBOSS merger. This carries
out a pairwise alignment according to the Needleman-Wunsch algorithm,
outputting a merged barcode sequence, along with an alignment summary
that was checked to ensure the alignment attributes were as expected
(overlap of 85 bp and score of 425).
- To avoid accepting a merged sequence from a non-target organism, if
either of the selected FC or BR sequences had a lower taxonomic
identification rank than that of the selected aligned FC-BR sequence
pair, that FC or BR sequence was accepted as a likely correct partial
barcode, instead of the aligned FC-BR sequence pair.
- If an FC or BR sequence (but not an aligned FC-BR sequence pair) with
expected taxonomy was detected, that sequence was accepted as a likely
correct partial barcode. If both an FC and BR sequence (but no aligned
FC-BR sequence pair) with expected taxonomy were detected, the FC or
BR sequence with the lowest taxonomic rank was accepted as a likely
correct partial barcode. If these FR and BR sequences had the same
lowest taxonomic rank, the most abundant sequence was accepted as the
likely correct partial barcode.
The merged FC and BR barcodes, plus any partial FC- or BR-only barcodes,
were concatenated into a single fasta file, and aligned using MAFFT
[32]. An approximately maximum-likelihood phylogeny was generated
from the alignment using FastTree 2 [33], and visualised using the R
package ggtree [34].
The resulting barcode dataset was examined for any effects of DNA
extraction methodology, PCR primer, or taxonomy on successful barcode
recovery. For this purpose, any orders and families represented by fewer
than five specimens were pooled together as “Others”.
To assess the accuracy of recovered Illumina barcodes, 96 of the 450
specimen DNA extracts (47 beetles, 21 flies, 14 wasps, 11 other insects,
and individual mite, millipede, and spider specimens) were also
subjected to PCR using primers LCO1490 and HCO2198 (Folmer et al.,
1994), followed by Sanger sequencing of the amplicons using standard
methods. The resulting Sanger barcode sequences were compared with
Illumina-derived barcodes for the same specimens via generation of
pairwise sequence alignments using MAFFT (Katoh & Standley, 2013) and
determination of sequence identity between each sequence pair.