Sensitivity, specificity and accuracy of Illumina barcode generation
The balance between sensitivity and specificity of high-throughput sequencing is often difficult to maintain [37], and is in our case tilted in favour of sensitivity to increase the rate of barcode recovery. This high sensitivity of Illumina sequencing enabled the recovery of numerous complete DNA barcodes from PCRs that work insufficiently well to produce a product visible by gel electrophoresis. Such specimens, accounting for almost half of our samples for the FC fragment, would likely fail to produce Sanger sequences, and be relegated to the ‘difficult to sequence’ set of preserved specimens [38]. Furthermore, full-length barcodes were recovered from various specimens with clearly suboptimal sequencing outcomes, including eight specimens with a single FC sequence (after error filtering), another with a single BR sequence, and a further 14 specimens with between two and five FC or BR sequences. On the other hand, this sensitivity also allowed the amplification of non-target sequences from well-known sources of contamination, such as extraction buffers (‘kitome’) [39], human specimen handling or bacterial symbionts [40], as well as apparent cross-contaminant sequences from other samples. Indeed, many of the maximally abundant sequences that were not selected as part of correct barcodes were similar or identical to those from other specimens included in the analysis, suggesting that these may variously result from PCR errors [41], co-amplification of numts [42], cross-contamination during library preparation [43], or index switching during sequencing [44]. Similar issues were observed in a study using the same sequencing approach to barcode a saproxylic beetle collection [36], indicating such contaminants may be an inevitable consequence of applying a highly sensitive method to specimens that may not have been collected with DNA analyses in mind. These issues might be mitigated in future analyses by stringent laboratory protocols to limit contamination [45], alternative library generation workflows [46], and fewer PCR cycles [47], although the latter might be at the expense of sensitivity. Additionally, bioinformatic tools can be used to identify correct barcodes among sequencing output. In this case, for example, the presence of contaminants necessitated the development of a semi-automated barcode assembly pipeline to accurately resolve barcode sequences.
To assess recovered barcode sequence accuracy, 96 of the specimens included in this analysis were also subjected to DNA barcoding via Sanger sequencing. Full-length barcodes were recovered from most of these specimens (68) by both methods. However, only Illumina barcodes (seven full-length and nine partial) were obtained from 16 specimens that failed to produce Sanger barcodes. This included 15 beetles, which can be challenging DNA barcoding subjects due to their tough exoskeletons, further illustrating the sensitivity of the Illumina approach. On the other hand, only partial barcode sequences were obtained via Illumina sequencing from ten specimens from which full Sanger barcodes were obtained. Some of these required multiple PCR optimization attempts to obtain Sanger barcodes, however, along with examination and manual editing of sequence chromatograms. High levels of pairwise sequence identity were observed between Sanger and Illumina barcodes from most specimens (99 to 100 % for 59 of 68 specimens), indicating generally high levels of accuracy for both sequencing approaches. Three specimens (all beetles) had pairwise sequence identities between 75.5 % and 85.7 %, suggesting that either the Illumina approach or the Sanger approach recovered a sequence from a non-target organism in these cases. These FC-BR sequence pairs were each selected due to their constituent FC and/or BR fragments being the most abundant sequences identified to the correct taxonomic families, with no lower rank taxonomic information available among the BLAST results to further guide correct barcode selection.