Discussion

This study evaluated the incidence and attributes of NUMTs in the genomes of 156 species of marine animals and considered the interpretational complexities introduced for biodiversity assessments using metabarcoding. Considering the 85 invertebrate species with a full-length COI sequence, most (58.4%) NUMTs were < 300 bp. This was also true for NUMTs derived from the 658 bp barcode region, where 51.2% were <300 bp. When hits with IPSCs were excluded, 72.5% of NUMTs were removed. On the other hand, when reads <300 bp were retained, NUMTs often lacked these diagnostic features. When reads were clustered into OTUs based on a 2% divergence threshold, these undiagnosable NUMTs inflated apparent species richness by up to 1.57x relative to the true number of species and barcode variation in these species by up to 1.19x. No significant difference in NUMT incidence was detected among species in different phyla, trophic categories or breeding systems, but the strength of these tests was limited by the low number of genomes available for analysis. Based on current information, NUMTs pose a substantial challenge for metabarcoding in marine environments, particularly when short amplicons are targeted as is often the case in dietary studies (Berry et al., 2017; da Silva et al., 2019).

Hit length and distribution

NUMTs did not appear to be evenly distributed along the COI gene, but it seems unlikely that certain regions are more prone to incorporation. NUMTs arise from two processes: novel insertions into the nuclear genome and post-insertion duplication and translocation (Bensasson et al., 2001; Hazkani-Covo et al., 2010; Richly & Leister, 2004). The uneven distribution we observed may be driven by NUMT duplications following translocation from the mitochondrial genome rather than by hotspots prone to NUMT integration (Calabrese et al., 2017). Confirming which mechanism is responsible for a particular NUMT requires detailed examination of mutation patterns and NUMT attributes in allied species (Behura, 2007; Bensasson et al., 2001).
The apparent biomodality of sequence lengths for NUMTs identified by both the 658 bp and 1500 bp queries is likely artefactual. While short NUMTs certainly predominate, the secondary peak of long NUMTs reflects the fact that any NUMT extending beyond the query length is included in the longest size category. If NUMTs were evaluated for entire mitochondrial genomes rather just the barcode region, the bimodal pattern would undoubtedly disappear.

NUMT diagnosis

Our results indicate that the impacts of NUMTs can be greatly reduced by targeted longer amplicons because short NUMTs are excluded and more IPSCs are exposed. As a result, the exclusion of sequences with IPSCs is a key bioinformatics step (Buhay, 2009; Kunz et al., 2019). In our analysis, screening for IPSCs failed to significantly reduce the NUMT count for the shortest read length (150 bp), but many NUMTs were found to possess an IPSCwith longer reads. Porter and Hajibabai (2021) further showed that short (~300 bp) NUMTs are also less likely to contain other diagnostic features such as inappropriate amino acid substitutions that can identified via bioinformatic pipelines. Accordingly, NUMTs pose the highest risk to eDNA studies where amplicons range from 50-400 bp (Langlois, Allison, Bergman, To, & Helbing, 2021), dietary analyses where amplicons are 70-230 bp (Berry et al., 2017; da Silva et al., 2019) and marine metabarcoding studies which commonly target a 313 bp region of COI (Leray et al., 2013), while studies targeting the full 658 bp barcode region face a lower risk of NUMT misinterpretation. However, the analysis of longer amplicons does not eliminate the problem as 29.9% of >600 bp NUMTs lacked an IPSC, reinforcing a pattern seen in other taxa (Antunes & Ramos, 2005; Kunz et al., 2019; Richly & Leister, 2004). Furthermore, NUMTs of any length can be problematic, even when their incidence in the template pool is low, if primers have higher affinity for them than for the mitochondrial target (Kim, Lee, & Ju, 2013; Kunz et al., 2019).
Unrecognized NUMTs misclassified as distinct taxa (species, OTUs, ESVs, haplotypes) inflate diversity estimates generated by metabarcoding. If all NUMTs with >2% divergence in our dataset were mistaken for a distinct OTU, alpha diversity would have been inflated by1.57x. Conversely, if all NUMTs with <2% divergence were assumed to reflect haplotype diversity within a taxon, barcode variation would have been inflated by 1.19x. If intraspecific variation is not under investigation, grouping NUMTs within an OTU will lessen overestimation of taxon richness by eliminating those NUMTs with the lowest divergence which are also least likely to posess an IPSC. However, the latter approach may not outweigh the benefits of using ESVs such as improved resolution and their value as intrinsic units of biological diversity (Callahan, McMurdie, & Holmes, 2017). Whether OTUs or ESVs are used, sequence arrays containing NUMTs can still indicate beta diversity if NUMTs are either uncommon or consistently recovered (Porter & Hajibabaei, 2021).

Patterns of NUMT abundance among species

The incidence of NUMTs is influenced by diverse factors. Some are associated with genome size (e.g., frequency of double-stranded breaks, duplication via repetitive elements), but others are not (e.g., mitochondrial damage, cellular stress) so the linkage between the incidence of NUMTs and genome size is unpredictable (Antunes & Ramos, 2005; Hazkani-Covo et al., 2010; Ricchetti et al., 2004; Richly & Leister, 2004; Gerstein and Zheng, 2006;Wang et al., 2020). We detected a significant positive correlation between genome size and NUMT count, but the low coefficient of determination (ρ = 0.33) helps to explain the variable associations noted in earlier studies. As NUMT frequency varies by marker (Baeza & Fuentes, 2013), a search for their incidence across the entire mitochondrial genome would better evaluate the linkage between the prevalence of NUMTs and genome size.
The quality of genome assemblies, as indicated by contig N50, varied 117,000-fold among the 156 species examined in our study. High quality assemblies possessed significantly fewer NUMTs than lower quality assemblies, a result contrary to previous findings (Hazkani-Covo et al., 2010; Richly & Leister, 2004). Typically, NUMT counts are elevated in high quality assemblies because of their inadvertent exclusion from draft assemblies. For example, Hazkani-Covo et al. (2010) reported that NUMT counts doubled between low and high quality assemblies of 18 eukaryotes, including a 20-fold increase in Drosophila melanogaster and a 3-fold rise in Takifugu rubripes . Since NUMT detection is sensitive to search strategy, genome curation, and level of completion, estimates of NUMT prevalence will need to be updated as the quality of genome assemblies improves. Because the quality of current genome assemblies for most marine animals are low (mean contig N50: 340,000; median: 16,000), our NUMT counts may be substantial underestimates.
Other potential sources of errors in NUMT enumeration include the recovery of COI from bacterial symbionts or other contaminants, heteroplasmy, or PCR/sequencing errors (Baeza & Fuentes, 2013; Song et al., 2008). For example, a recent genome assembly for a nematode inadvertently included many sequences of bacterial origin (Schiffer et al., 2019; Thorne, Kagoshima, Clark, Marshall, & Wharton, 2014). Our study revealed many NUMTs in several molluscs, including bivalves and cephalopods which employ doubly uniparental inheritance of mitochondrial DNA (Doucet-Beaupré et al., 2010; Shizas, 2012; Zouros, Ball, Saavedra, & Freeman, 1994) and other mechanisms that add similar diversity (Strugnell & Lindgren, 2007). Because of such factors, causes aside from NUMTs can contribute to perceived COI diversity (Shizas, 2012).
While our analyses did not reveal significant differences in NUMT counts among phyla, the highest numbers were found in arthropods, reinforcing evidence that their genomes often possess many NUMTs (Song et al., 2008). This fact is well-documented for decapods (Baeza & Fuentes, 2013; Gíslason, Svavarsson, Halldórsson, & Pálsson, 2013; Kim et al., 2013; Williams & Knowlton, 2001; Yuan et al., 2017), a group with high NUMT counts in our study. The capacity to rigorously test for variation in the incidence of NUMTs among phyla was constrained because most (12/17) had only a few genomes available for analysis.
Although no association was evident between NUMT counts and trophic category or breeding system, a parasitic copepod, Lepeophtheirus salmonis , had the highest NUMT count (50). In bacteria, pseudogenes are more common among parasites (Lawrence & Hendrix, 2001), and some parasitoid wasps contain many NUMTs, possibly linked to a high incidence of structural rerarrangements in their mitochondria (Yan et al., 2019). However, many marine organisms have complex life history patterns that include multiple trophic levels and reproductive strategies, complicating the detection of linkages with genome dynamics. Patterns may be apparent once more genomes have been sequenced.

Mitigating the impacts of NUMTs

Because NUMTs are often difficult to recognize, it is worth considering approaches that would minimize their impacts on metabarcoding. Methodological shifts offer one solution. For example, density gradient centrifugation can provide highly purified mitochondrial DNA for PCR (Cristescu, 2019; Deiner et al., 2017; Tsuri et al., 2021), but it requires freshly collected samples (Zhou et al., 2013). Reverse transcription PCR (RT-PCR) offers another powerful option as sequences are only recovered from mitochondrial DNA because NUMTs are not transcribed, and this method can be employed on properly preserved samples. (Bensasson et al., 2001; Porter & Hajibabaei, 2021; Song et al., 2008; D. Wang et al., 2019).
Aside from adjusting wet lab protocols, NUMTs can be excluded by more rigorous bioinfomatic screening (Porter & Hajibabaei, 2021). Tree-based methods that identify NUMTs based on their divergence from the mitochondrial copy can aid studies on a particular taxonomic group (Creedy et al., 2019), but are ineffective for work on communities because taxonomic diversity is high. Read count thresholds can remove 90-95% of NUMTs while retaining most legitimate mtDNA variants (Andújar et al., 2020). Pipelines that incorporate amino acid analysis into their filtering parameters can further improve the recognition of NUMTs (Kunz et al., 2019; Nugent et al., 2020; Porter & Hajibabaei, 2020; Porter & Hajibabaei, 2021). For example, Coil (Nugent et al., 2020) and METAWORKS (Porter & Hajibabaei, 2020) employ profile hidden Markov model (PHMM) analysis to compare the features of an unknown sequence with a translated profile built from a multiple sequence alignment for each taxonomic group to highlight frameshift errors. Because the COI protein has highly conserved amino acids at certain positions, sequences that lead to an amino acid substitution at these sites are likely to be NUMTs (Buhay, 2009; Hebert et al., 2003; Kunz et al., 2019).
Despite the potential of these defensive measures to minimize the impact of NUMTs, comprehensive, well-curated DNA barcode reference libraries are needed to support metabarcoding studies in marine environments (Bucklin, Steinke, & Blanco-Bercial, 2011; Leray & Knowlton, 2016). At present, NUMTs are typically discarded, but their retention in the reference library would aid data interpretation (Matzen da Silva et al., 2011), especially those derived from common, large-bodied species. There are two paths to develop a well-parameterized reference library that includes both COI sequences and their NUMT derivatives. The first involves the assembly of high quality nuclear and mitochondrial genome sequences for all marine species. While this is an important goal (Lewin et al. 2018), its completion for all marine species will be a multi-decadal task. A much faster path involves the collection of fresh specimens and their analysis via RT-PCR to obtain mtCOI sequences, followed by PCR to amplify both mtCOI and NUMTs from the same specimen. This approach has the potential to characterize thousands of species in a single HTS run. Although the analysis of multiple markers can help to resolve ambiguities when NUMTs complicate results (Antunes & Ramos, 2005; Kunz et al., 2019; Richly & Leister, 2004; Song et al., 2008; van der Loos & Nijland, 2020), the challenge in developing a well-parameterized reference library for COI reinforces the need to focus on this task rather than on broadening analysis to additional genes.
DNA metabarcoding has gained rapid adoption due to its capacity to document biodiversity patterns in unprecedented detail at ever-decreasing costs (Taberlet, Coissac, Pompanon, Brochmann, & Willerslev, 2012; van der Loos & Nijland, 2020). At a time when marine biodiversity is in rapid decline (Dulvy, Sadovy, & Reynolds, 2003; Halpern, Selkoe, Micheli, & Kappel, 2007; WWF, 2020), it is critical that the metabarcoding results used to inform conservation and management decisions be reliable. Hence, it is essential to develop the informatics tools and analytical approaches to minimize the impacts of NUMTs on biodiversity assessments. Since marine organisms exhibit tremendous variation in the structure and organization of their genomes (Burger, Jackson, & Waller, 2012; Lavrov & Pett, 2016), further exploration and acknowledgment of NUMT prevalence is not only necessary for reliable metabarcoding data, but will likely inform evolutionary insights regarding genome dynamics across the tree of life (Zhang & Hewitt, 1996).