Materials and Methods
Data collection
We examined the incidence of COI NUMTs in the genomes of marine animals
on the NCBI genome browser (Clark et al. 2016). To identify candidate
genomes, we compared taxonomic names in the World Register of Marine
Species (WoRMS; Horton et al. 2020) with the NCBI genome browser
(https://www.ncbi.nlm.nih.gov/genome/browse). All genomes for marine
invertebrates were downloaded together with those for at least one
species per order of marine vertebrates. When more than one genome was
available for a species, the reference genome (if available) or the most
recent assembly was selected. In addition, we downloaded the COI
sequence from the mitochondrial genome of each species and used Aliview
(Larson 2014) to extract the 658 bp recovered by primers targeting the
barcode region (Hebert et al. 2003). When available, the reference
sequence for the full COI gene was also retained. When a COI sequence
was unavailable on GenBank, the Barcode of Life Database (BOLD;
Ratnasingham and Hebert 2007) was searched for a sequence.
NUMT search and
identification
We conducted BLAST searches for mitochondrial COI against the genome
sequence available for each species using the 658 bp barcode region as
the query. Using Geneious Prime (version 2020.2.1), we conducted a
BLASTn search with a maximum of 1000 hits and a maximum expectation
value of e = 0.0001 to generate a list of hits. We excluded BLASTn hits
<100 bp in length, or those with both 100% coverage and ≥
99.8% ID as these likely represented a mitochondrial sequence
inadvertently included in the nuclear assembly.
The remaining hits were considered putative COI NUMTs and summary
information (hit length, GC content, query coverage, percent similarity,
e-value) were exported to Excel (Supp. Table 1). Using MUSCLE in
Geneious Prime, each hit was aligned with the mitochondrial COI sequence
for that species to visually search for any insertions and deletions.
Each sequence was then translated using the appropriate mitochondrial
code to determine if premature stop codons were present. The presence of
indels or premature stop codons (IPSCs) at any position along the
sequence was recorded for six hit length categories: 100–200 bp,
200–300 bp, 300–400 bp, 400–500 bp, 500–600 bp and 600–700 bp.
Since using a longer COI query length could reveal additional NUMTs
beyond the 658 bp barcode region, we conducted a second BLAST search
among the invertebrates in our dataset using the full-length (≥1500 bp)
COI sequence when available. We used the same BLASTn parameters and
strategy for identifying IPSCs as for the 658 bp query, and retained all
hits >150 bp. This analysis made it possible to ascertain
if certain regions of COI were more prone to incorporation into NUMTs by
mapping hits to the reference COI sequence and then quantifying the
coverage at each nucleotide position. We then plotted the coverage for
all species in a particular phylum to determine the frequency with which
each nucleotide position of COI appeared in a NUMT.
NUMT diagnosis
To quantify the incidence of NUMTs that presented an interpretational
threat to metabarcoding, we ascertained the proportion of hits lacking
diagnostic features at four sequence lengths (150, 300, 450, 600 bp)
commonly recovered with HTS (Hebert et al., 2018; Reuter et al., 2016;
Rhoads & Au, 2015) using the 658 bp COI query. HTS platforms generating
short reads will recover all NUMTs, but only those with IPSCs in the
target region will be diagnosed. For instance, platforms that generate
150 bp reads will capture all NUMTs ≥150 bp, but they will only be
recognizable as NUMTs if they possess IPSCs within the first 150 bp.
Accordingly, we considered hits diagnosable if they contained IPSCs in
the sequence region recovered by the platform (i.e., 150, 300, 450, 600
bp). The mean number of both diagnosable and non-diagnosable hits per
species was compared among the four sequence length categories using a
Kruskal-Wallis rank sum test.
In addition, we examined the impact of NUMT divergence values on
metabarcoding results, employing a standard sequence divergence
threshold (2%) for delineating OTUs. Divergence values are only
important for non-diagnosable NUMTs that are not excluded from
downstream analysis. We therefore ascertained the proportion of
undiagnosable NUMTs that would either inflate the OTU count
(>2% divergence) or intraspecific barcode variation
(<2% divergence).
Patterns of NUMT abundance among
species
We examined the relationship between the number of hits (≥100 bp) and
both genome size and the quality of the assembly (contig N50) reported
on NCBI using Spearman’s rank correlations. In addition, we examined if
species in certain phyla or those with differing ecological or life
history traits were more likely to possess NUMTs. For these comparisons,
we used results from the COI barcode query length (658 bp) and included
all hits ≥100 bp. We compared the average number of hits per species
among phyla and trophic groups using a Kruskal-Wallis rank sum tests,
and between the presence and absence of other life history traits using
Wilcoxon rank sum tests..
Ecological and life history information were primarily compiled from the
Encyclopedia of Life (http://eol.org; accessed 12 Sept 2020), but
additional information was obtained from the primary literature to fill
gaps (see Supp. Data Table 2 for specific references). The traits
examined included trophic category, mode of reproduction
(asexual/sexual), hermaphroditism and colonialism. We recognized six
trophic categories (predator/carnivore, grazer/herbivore, parasite,
suspension feeder, omnivore, other) on the basis of adult feeding
habits. The ‘suspension feeder’ category included passive suspension
feeders, active filter feeders, and mucous net feeders. The ‘omnivore’
category included species with more than one equally prevalent trophic
level or guild. ‘Other’ included a chemosymbiotroph, a surface deposit
feeder, and two detritivores.