Sensitivity, specificity and accuracy of Illumina barcode
generation
The balance between sensitivity and specificity of high-throughput
sequencing is often difficult to maintain [37], and is in our case
tilted in favour of sensitivity to increase the rate of barcode
recovery. This high sensitivity of Illumina sequencing enabled the
recovery of numerous complete DNA barcodes from PCRs that work
insufficiently well to produce a product visible by gel electrophoresis.
Such specimens, accounting for almost half of our samples for the FC
fragment, would likely fail to produce Sanger sequences, and be
relegated to the ‘difficult to sequence’ set of preserved specimens
[38]. Furthermore, full-length barcodes were recovered from various
specimens with clearly suboptimal sequencing outcomes, including eight
specimens with a single FC sequence (after error filtering), another
with a single BR sequence, and a further 14 specimens with between two
and five FC or BR sequences. On the other hand, this sensitivity also
allowed the amplification of non-target sequences from well-known
sources of contamination, such as extraction buffers (‘kitome’)
[39], human specimen handling or bacterial symbionts [40], as
well as apparent cross-contaminant sequences from other samples. Indeed,
many of the maximally abundant sequences that were not selected as part
of correct barcodes were similar or identical to those from other
specimens included in the analysis, suggesting that these may variously
result from PCR errors [41], co-amplification of numts [42],
cross-contamination during library preparation [43], or index
switching during sequencing [44]. Similar issues were observed in a
study using the same sequencing approach to barcode a saproxylic beetle
collection [36], indicating such contaminants may be an inevitable
consequence of applying a highly sensitive method to specimens that may
not have been collected with DNA analyses in mind. These issues might be
mitigated in future analyses by stringent laboratory protocols to limit
contamination [45], alternative library generation workflows
[46], and fewer PCR cycles [47], although the latter might be at
the expense of sensitivity. Additionally, bioinformatic tools can be
used to identify correct barcodes among sequencing output. In this case,
for example, the presence of contaminants necessitated the development
of a semi-automated barcode assembly pipeline to accurately resolve
barcode sequences.
To assess recovered barcode sequence accuracy, 96 of the specimens
included in this analysis were also subjected to DNA barcoding via
Sanger sequencing. Full-length barcodes were recovered from most of
these specimens (68) by both methods. However, only Illumina barcodes
(seven full-length and nine partial) were obtained from 16 specimens
that failed to produce Sanger barcodes. This included 15 beetles, which
can be challenging DNA barcoding subjects due to their tough
exoskeletons, further illustrating the sensitivity of the Illumina
approach. On the other hand, only partial barcode sequences were
obtained via Illumina sequencing from ten specimens from which full
Sanger barcodes were obtained. Some of these required multiple PCR
optimization attempts to obtain Sanger barcodes, however, along with
examination and manual editing of sequence chromatograms. High levels of
pairwise sequence identity were observed between Sanger and Illumina
barcodes from most specimens (99 to 100 % for 59 of 68 specimens),
indicating generally high levels of accuracy for both sequencing
approaches. Three specimens (all beetles) had pairwise sequence
identities between 75.5 % and 85.7 %, suggesting that either the
Illumina approach or the Sanger approach recovered a sequence from a
non-target organism in these cases. These FC-BR sequence pairs were each
selected due to their constituent FC and/or BR fragments being the most
abundant sequences identified to the correct taxonomic families, with no
lower rank taxonomic information available among the BLAST results to
further guide correct barcode selection.