Introduction
We are in the middle of a biodiversity crisis, in which anthropogenic
change is driving many species to extinction, often faster than they can
be characterized (see e.g. Ceballos et al., (2020)). The identification
of species in our environments is paramount to informing conservation
policy and practice. The development of DNA barcoding (Hebert et al.,
2003) was a major step towards large-scale characterizations of
biodiversity. This technique utilizes amplification of standardized
genetic regions to characterize species present within biological
samples. Besides the documentation of biodiversity, this method and
other amplicon-sequencing technologies have been widely used for
monitoring of invasive species, detection of pathogens in environmental
samples, and many other applications in taxonomy, medicine or
evolutionary biology (e.g. reviewed in Kress et al., (2015)).
Third-generation sequencing is able to sequence millions of single
molecules up to several Mbs in lengths (Jain et al., 2018). Currently,
two platforms are readily available for DNA barcoding efforts, PacBio’s
Sequel II and ONT’s MinION. These platforms offer the advantage of
longer reads, at the cost of sequencing errors. While ONT’s MinION still
shows higher error rates >5% (Wick et al., 2018), the new
PacBio HiFi mode allows for the generation of read with <1%
error (Wenger et al., 2019), which will greatly improve the generation
of accurate DNA barcodes. Early on, researchers identified the potential
of third-generation sequencing platforms for sequencing much longer DNA
barcodes than previously possible (see e.g. Krehenwinkel et al.,
(2019a); Tedersoo et al., (2018); Wurzbacher et al., (2019)). Beside the
longer amplicon length, ONT’s MinION also offers the advantage that
sequencing can be carried out almost anywhere in the world, due to its
small size and affordability (reviewed in Krehenwinkel et al., 2019b).
While there has been a considerable software development effort to
assemble high-quality amplicon consensus sequences from error-prone ONT
MinION reads (see e.g. Maestri et al., 2019; Seah et al., 2020;
Srivathsan et al., 2019; reviewed in Krehenwinkel et al., 2019b), only a
few software solutions are available for PacBio-based DNA barcodes (see
e.g. Wurzbacher et al., 2019). To our knowledge, of these, only the
pipeline presented in Wurzbacher et al., 2019 is able to handle both
PacBio and ONT sequencing reads.
Here, we present NGSpeciesID a one-software solution for reconstructing
high-quality amplicon consensus sequences for both PacBio and ONT
sequencing reads. We also investigate the performance of ONT’s Medaka
polishing software compared to Racon (Vaser et al., 2017) for MinION
based DNA barcoding. Compared to other programs, NGSpeciesID can be
easily installed with conda, does not require any specific file name
structures, can handle data from both third-generation sequencing types,
includes different consensus polishing options and only needs fastq
files as input. We show that our tool produces consensus sequences of a
similar quality than other software solutions, while reducing the burden
to users by requiring little to no additional tools or data
reformatting.