Introduction
Recent advances in high-throughput sequencing technology are enabling a shift towards environmental DNA (eDNA)-based methods for biodiversity assessment and biosecurity monitoring [1]. While still in their infancy, these tools offer great promise for rapid and accessible biodiversity monitoring applications in terrestrial, aquatic, and marine ecosystems [2, 3]. However, a scarcity of accurately identified reference DNA sequence data from local biota [4] remains a significant obstacle to the application of these eDNA tools to biomonitoring, preventing the confident identification and interpretation of detected organisms.
DNA-based identification methods, regardless of application, rely on the determination of similarity between newly detected sequences and existing reference sequence data [4, 5]. Depending on target taxa, this requires representative taxonomically validated data on established marker genes such as the ~650 bp region of cytochromec oxidase subunit I (COI) for metazoans [6], or combinations of plastid regions (rbcl , matK , trnH–psbA ) and the ribosomal internal transcribed spacer region (ITS) for plants [7]. Large open data sources, such as the GenBank nr database or BOLD [8] are typically employed as reference databases, but rely on data submitters for fidelity of sequence to organism and suffer from geographic sampling biases [9]. Further, this reliance on large pre-existing databases also limits the emergence of new or taxon-specific markers, with individual studies utilising previously established markers even if they may be suboptimal for certain taxa [10]. Curated databases containing only sequences from a targeted ecosystem may result in improved accuracy of sequence identifications compared to a global database [11]. However, the current sparse database coverage of biodiversity from most ecosystems means that targeted reference databases typically must be populated with newly generated and locally relevant reference sequences.
Taxonomically validated reference sequences are difficult to generate. Not only do they require high levels of sequence accuracy (traditionally achieved via Sanger sequencing, more recently possible via PacBio hifi technology[12]), but also accurate taxonomic identification of specimens. The former may be time-consuming, contingent on sample quality, and expensive, especially when applied to large numbers of specimens, while the latter requires specialist taxonomic expertise across taxa. For example, generating a reliable reference database for a previously uncharacterized insect fauna may require taxonomic skills spanning 24 distinct insect orders. Natural history museums and national biological collections, however, are unparalleled repositories of both invaluable taxonomic knowledge [13] and authoritatively identified genetic source material [14], with the potential to allow the efficient generation of taxonomically comprehensive and locally relevant reference DNA sequence databases [15]. Generating full-length DNA barcodes via Sanger sequencing from dried or historical specimens stored over long periods may be difficult due to DNA degradation and low sensitivity of the sequencing approach, often resulting in only partial barcodes [15-18]. Furthermore, museum samples are often indispensable permanent records, and therefore unavailable for destructive DNA extraction. Non-destructive extraction [19, 20] and PCR [21] approaches can be effective, however, depending on the taxa being analysed, and multiplex PCR coupled with high-throughput DNA sequencing technologies has allowed the efficient recovery of barcodes from 50- to 100-year-old museum samples [12, 22], as well as recently collected specimens [23].
There is a pressing need to leverage museum collections for rapid and cost-effective generation of reference databases, in order to aid eDNA-based biodiversity monitoring [24]. Here, we present a fast, cost-effective, and efficient method for developing a reference COI database from a diverse selection of terrestrial invertebrates sourced from the New Zealand Arthropod Collection (NZAC). These taxonomically validated specimens exhibit a variety of field collected methods, specimen treatment and storage conditions, as well as variable accessibility for destructive sampling. We demonstrate the use of a dual indexing approach, in combination with a pair of overlapping short PCR amplicons suitable for sequencing on the Illumina MiSeq platform, for generating full length barcodes from hundreds of invertebrate specimens simultaneously. We provide a taxonomy-informed bioinformatics pipeline for processing and filtering the sequence data and the rapid assembly of successful barcodes. Together, our approach represents a highly sensitive, accurate, and efficient method for targeted reference database generation, providing a foundation for DNA-based assessments and monitoring of biodiversity.