Comparison to ONTrack
Next, we compared the performance of NGSpeciesID to the pipeline ONTrack
from Maestri et al. (2019). This pipeline first clusters all reads using
VSEARCH (Rognes et al., 2016), then randomly selects 200 reads, aligns
those with Mafft (Katoh and Standley, 2013), calls the consensus with
EMBOSS cons
(http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html), and
lastly carries out polishing with 200 randomly selected reads using
Nanopolish (https://github.com/jts/nanopolish.). We generated consensus
sequences for all seven DNA barcodes from Maestri et al. (2019), which
comprise Cytochrome C Oxidase Subunit 1 (COI) sequences of two
snails and five beetles (Supplementary Table 1). We provide the
respective alignments in the Supplementary (Supplementary files 7-13).
Previously, Krehenwinkel et al., (2019a) showed that consensus accuracy
can decrease when too many reads (in the realm of a few hundred reads,
depending on the error rate of the individual reads) are selected for
the consensus generation, likely due to an increase in the signal to
noise ratio. We thus randomly subsampled 300 reads using seqtk
(https://github.com/lh3/seqtk), a number which has been shown to work
well with Nanopore data (Krehenwinkel et al., 2019a). We see that the
consensus quality is comparable between the two tools (Table 2), with
accuracy of 99.8% to 100%. In five out of the seven DNA barcode sets
both tools performed equally well, while in one each the two tools
outperformed each other, however, differing by only 1 basepair (Table
2).