Materials and Methods

Data collection

We examined the incidence of COI NUMTs in the genomes of marine animals on the NCBI genome browser (Clark et al. 2016). To identify candidate genomes, we compared taxonomic names in the World Register of Marine Species (WoRMS; Horton et al. 2020) with the NCBI genome browser (https://www.ncbi.nlm.nih.gov/genome/browse). All genomes for marine invertebrates were downloaded together with those for at least one species per order of marine vertebrates. When more than one genome was available for a species, the reference genome (if available) or the most recent assembly was selected. In addition, we downloaded the COI sequence from the mitochondrial genome of each species and used Aliview (Larson 2014) to extract the 658 bp recovered by primers targeting the barcode region (Hebert et al. 2003). When available, the reference sequence for the full COI gene was also retained. When a COI sequence was unavailable on GenBank, the Barcode of Life Database (BOLD; Ratnasingham and Hebert 2007) was searched for a sequence.

NUMT search and identification

We conducted BLAST searches for mitochondrial COI against the genome sequence available for each species using the 658 bp barcode region as the query. Using Geneious Prime (version 2020.2.1), we conducted a BLASTn search with a maximum of 1000 hits and a maximum expectation value of e = 0.0001 to generate a list of hits. We excluded BLASTn hits <100 bp in length, or those with both 100% coverage and ≥ 99.8% ID as these likely represented a mitochondrial sequence inadvertently included in the nuclear assembly.
The remaining hits were considered putative COI NUMTs and summary information (hit length, GC content, query coverage, percent similarity, e-value) were exported to Excel (Supp. Table 1). Using MUSCLE in Geneious Prime, each hit was aligned with the mitochondrial COI sequence for that species to visually search for any insertions and deletions. Each sequence was then translated using the appropriate mitochondrial code to determine if premature stop codons were present. The presence of indels or premature stop codons (IPSCs) at any position along the sequence was recorded for six hit length categories: 100–200 bp, 200–300 bp, 300–400 bp, 400–500 bp, 500–600 bp and 600–700 bp.
Since using a longer COI query length could reveal additional NUMTs beyond the 658 bp barcode region, we conducted a second BLAST search among the invertebrates in our dataset using the full-length (≥1500 bp) COI sequence when available. We used the same BLASTn parameters and strategy for identifying IPSCs as for the 658 bp query, and retained all hits >150 bp. This analysis made it possible to ascertain if certain regions of COI were more prone to incorporation into NUMTs by mapping hits to the reference COI sequence and then quantifying the coverage at each nucleotide position. We then plotted the coverage for all species in a particular phylum to determine the frequency with which each nucleotide position of COI appeared in a NUMT.

NUMT diagnosis

To quantify the incidence of NUMTs that presented an interpretational threat to metabarcoding, we ascertained the proportion of hits lacking diagnostic features at four sequence lengths (150, 300, 450, 600 bp) commonly recovered with HTS (Hebert et al., 2018; Reuter et al., 2016; Rhoads & Au, 2015) using the 658 bp COI query. HTS platforms generating short reads will recover all NUMTs, but only those with IPSCs in the target region will be diagnosed. For instance, platforms that generate 150 bp reads will capture all NUMTs ≥150 bp, but they will only be recognizable as NUMTs if they possess IPSCs within the first 150 bp. Accordingly, we considered hits diagnosable if they contained IPSCs in the sequence region recovered by the platform (i.e., 150, 300, 450, 600 bp). The mean number of both diagnosable and non-diagnosable hits per species was compared among the four sequence length categories using a Kruskal-Wallis rank sum test.
In addition, we examined the impact of NUMT divergence values on metabarcoding results, employing a standard sequence divergence threshold (2%) for delineating OTUs. Divergence values are only important for non-diagnosable NUMTs that are not excluded from downstream analysis. We therefore ascertained the proportion of undiagnosable NUMTs that would either inflate the OTU count (>2% divergence) or intraspecific barcode variation (<2% divergence).

Patterns of NUMT abundance among species

We examined the relationship between the number of hits (≥100 bp) and both genome size and the quality of the assembly (contig N50) reported on NCBI using Spearman’s rank correlations. In addition, we examined if species in certain phyla or those with differing ecological or life history traits were more likely to possess NUMTs. For these comparisons, we used results from the COI barcode query length (658 bp) and included all hits ≥100 bp. We compared the average number of hits per species among phyla and trophic groups using a Kruskal-Wallis rank sum tests, and between the presence and absence of other life history traits using Wilcoxon rank sum tests..
Ecological and life history information were primarily compiled from the Encyclopedia of Life (http://eol.org; accessed 12 Sept 2020), but additional information was obtained from the primary literature to fill gaps (see Supp. Data Table 2 for specific references). The traits examined included trophic category, mode of reproduction (asexual/sexual), hermaphroditism and colonialism. We recognized six trophic categories (predator/carnivore, grazer/herbivore, parasite, suspension feeder, omnivore, other) on the basis of adult feeding habits. The ‘suspension feeder’ category included passive suspension feeders, active filter feeders, and mucous net feeders. The ‘omnivore’ category included species with more than one equally prevalent trophic level or guild. ‘Other’ included a chemosymbiotroph, a surface deposit feeder, and two detritivores.