Easy-use
NGSpeciesID was designed to be straightforward to use. It works on individual read files, outputted either directly from the basecalling or after demultiplexing (e.g. using Minibar (Krehenwinkel et al. 2019a) or qcat (https://github.com/nanoporetech/qcat)), but can quickly be adjusted to run in a loop over multiple fastq files using a bashscript (see Supplementary File 14). It only requires fastq files as input. In contrast, ONTrack requires the input reads in three formats (fast5, fasta and fastq), which requires additional preprocessing of the sequencing data. Furthermore, NGSpeciesID allows fastq files to have any naming structure, thus making it easy for the user to run and to identify samples and replicates. This saves time on preprocessing of the read data compared to other software solutions.
NGSpeciesID employs quality filtering of the reads based on read phred scores. However, we recommend also removing reads much shorter or longer than the intended target, which often represent chimeras or contaminations using NanoFilt (De Coster et al., 2018) before running NGSpeciesID. While our tool can handle unfiltered data, this might result in the generation of multiple consensus sequences. NGSpeciesID also offers the option to remove priming sites from the amplicon sequences. As many universal primers include ambiguity codes, primer regions can potentially include incorrect bases, and should thus be removed. We further found that primer regions can cause issues for the reverse-complement matching. We thus included an additional reverse-complement matching step after primer removal, in case NGSpeciesID outputs multiple consensus sequences. Our tool outputs multiple consensus sequences in case the clustering results in multiple clusters over a certain percentage of the total reads (by default this is set to 10%). Each consensus sequence is only polished with the corresponding reads from the clustering. This feature is very useful as it allows the user to explore potential contaminant reads or mixed samples through the generating of multiple consensus sequences.
NGSpeciesID and the Mothur + Consension software solution both can handle ONT and PacBio long-read data. While both tools produce consensus sequences of similar accuracy, Mothur + Consension requires an in-depth knowledge of the pipeline requiring (i) preprocessing of the input data, (ii) individual components of the pipeline to be run separately and (iii) has parameter settings that are difficult to interpret, while NGSpecies is designed to be user friendly and packaged as a one command solution.