Introduction

The interplay between the nuclear and mitochondrial genomes is a defining feature of eukaryotes. One aspect of these interactions is the transfer of segments of mitochondrial DNA (mtDNA) through the nuclear membrane and their subsequent incorporation into the nuclear genome during the repair of double-stranded breaks (Blanchard & Schmidt, 1996; du Buy & Riley, 1967; Lopez, Yuhki, Masuda, Modi, & O’Brien, 1994). The resultant pseudogenes, termed NUMTs, are prevalent throughout the animal kingdom (Bensasson, Zhang, Hartl, & Hewitt, 2001; Hazkani-Covo, Zeller, & Martin, 2010; Richly & Leister, 2004), but the factors responsible for their considerable variation in abundance among species is uncertain (Bensasson et al., 2001; Hazkani-Covo et al., 2010; Richly & Leister, 2004; Song, Buhay, Whiting, & Crandall, 2008). Some studies have reported a correlation between genome size and NUMT counts (Bensasson et al., 2001; Hazkani-Covo et al., 2010), but others have not (Gerstein & Zheng, 2006; Richly & Leister, 2004; Wang, Liu, Miao, Huang, & Xiao, 2020). NUMT frequency has also been linked to environmental factors (Gerstein & Zheng, 2006; Ricchetti, Tekaia, & Dujon, 2004; Song, Jiang, Yuan, Guo, & Miao, 2013), cellular stress (Bensasson et al., 2001; Hazkani-Covo et al., 2010), and population dynamics (Antunes & Ramos, 2005; Deceliere, Charles, & Biémont, 2005)
The evolutionary implications of NUMTs have attracted considerable attention (Balakirev & Ayala, 2003). Because rates of nucleotide substitution are typically much slower in nuclear than mitochondrial genomes (Bensasson et al., 2001; Lopez, Culver, Stephens, Johnson, & O’Brien, 1997), each NUMT is a relict of its mitochondrial ancestor. As a result, they provide useful genetic markers (Harrison et al., 2002; Ricchetti et al., 2004) which can help to resolve phylogenies (Ko et al., 2015), identify cases of hybridization (Machida & Lin, 2017), and clarify genome dynamics (Matzen da Silva et al., 2011; Zhang & Hewitt, 1996). Although their evolution is slowed, NUMT sequences are not frozen, and, because they are not transcribed (Bensasson et al., 2001; Boore, 1999), there is no selection for a functional gene product. As a result, NUMTs accumulate indels leading to frameshifts, premature stop codons, and sequence changes coding for amino acid substitutions that would compromise functionality of the gene product (Bensasson et al., 2001; Morgan et al., 2013; Zhang & Hewitt, 1996). NUMTs with such changes are readily diagnosed (Buhay, 2009), but those lacking these features (Bensasson et al., 2001) create complexities if they are mistaken for mtDNA (Hazkani-Covo et al., 2010; Zhang & Hewitt, 1996).
PCR-based approaches run a particular risk of encountering this interpretational complexity because all gene regions with sequence congruence to a particular primer set are recovered whether they derive from the target mitochondrial gene or its NUMTs. Past studies have shown that undiagnosed NUMTs can lead to erroneous phylogenetic trees, distorting evolutionary relationships (Calabrese et al., 2017; Haran, Koutroumpa, Magnoux, Roques, & Roux, 2015). In addition, NUMTs can create complexities for DNA barcoding (Hebert, Cywinska, Ball, & DeWaard, 2003) as it employs sequence diversity in mtCOI as a basis for specimen identification and species discovery in the animal kingdom. In this case, unrecognized NUMTs can inflate both apparent species or haplotype richness and measures of intraspecific variation at COI (Buhay, 2009; Creedy et al., 2019; Song et al., 2008). The extent of these risks depends upon the analytical protocol. When PCR targets DNA extracts from single specimens, most amplicons derive from mtCOI because its copy number is far higher than those of any NUMT(s) (Bogenhagen, 2012; Quiros, Goyal, Jha, & Auwerx, 2017). Because they represent a small fraction of the amplicon pool in this situation, NUMTs rarely impede recovery of the mt COI sequence with Sanger analysis (Hebert, Penton, Burns, Janzen, & Hallwachs, 2004). However, because HTS platforms characterize single molecules, they do recover NUMTs, even when they comprise a small proportion of the amplicon pool. When analysis targets DNA extracts from single individuals, NUMT recovery can be minimized by excluding sequences with low copy numbers. However, NUMTs pose a greater threat for metabarcoding studies because this protocol involves the amplification of DNA extracts from bulk collections (Andújar, Arribas, Yu, Vogler, & Emerson, 2018; Andújar et al., 2020; Elbrecht, Vamos, Steinke, & Leese, 2018; Liu, Clarke, Baker, Jordan, & Burridge, 2019). In this case, NUMTs from species representing a large proportion of the sample biomass can be abundant, creating the prospect of interpretational errors when unrecognized (Buhay, 2009; Kunz, Tay, Elfekih, Gordon, & De Barro, 2019).
Because of this impact of species biomass, sequence count is a weak criterion for excluding COI NUMTs recovered through metabarcoding (Andújar et al., 2018, 2020; Elbrecht et al., 2018; Liu et al., 2019). Stringent quality filters are also of no value because NUMT and mtCOI templates should be equally susceptible to PCR and sequencing errors (Andújar et al., 2020; Elbrecht et al., 2018). Divergence values also cannot be used as a diagnostic feature (Baeza & Fuentes, 2013; Matzen da Silva et al., 2011) because NUMTs and mtDNA show highly variable divergences (Andújar et al., 2020). Two other diagnostic approaches are more useful: 1) many NUMTs are a truncated version of the COI gene (Richly & Leister, 2004), and 2) they often possess diagnostic sequence changes (i.e., indels that lead to frameshifts or premature stop codons). Recognition of these diagnostic features is optimized when the full 658 bp barcode region is recovered, but most HTS platforms generate considerably shorter reads (Reuter, Spacek, & Snyder, 2016; Rhoads & Au, 2015). NUMTs create interpretational complexity when they meet two criteria:1) their length exceeds that of the target amplicon, and 2) they lack frameshift indels or premature stop codons. In such cases, NUMTs with divergences exceeding the threshold (e.g., 2%) used to define operational taxonomic units (OTUs; a species proxy) or recognize species (Alberdi, Aizpurua, Gilbert, & Bohmann, 2018; Ratnasingham & Hebert, 2013) will lead to overestimation of the OTU or species count while NUMTs with lower divergences will inflate estimates of intraspecific COI variation.
NUMTs are likely to pose a particular challenge for metabarcoding studies in marine environments because their resident species are both phylogenetically diverse and poorly represented in DNA barcode reference libraries (Leray & Knowlton, 2016; Radulovici, Archambault, & Dufresne, 2010). For example, a single marine habitat can include representatives from two thirds of all animal phyla (Zeppilli et al., 2015), and most of these species will be undescribed (Appeltans et al., 2012). As a result, marine metabarcoding studies often encounter many sequences that cannot be connected to a species represented in the barcode reference library. Do these sequences represent newly encountered species or NUMTs from known taxa? The risk of encountering NUMTs is certain as they occur in diverse marine lineages including cnidarians (Song et al., 2013), crustaceans (Bensasson et al., 2001; Nguyen, Murphy, & Austin, 2002; Williams & Knowlton, 2001), echinoderms (Jacobs et al., 1983), fishes (Antunes & Ramos, 2005; Morgan et al., 2013), nematodes (Derycke, Vanaverbeke, Rigaux, Backeljau, & Moens, 2010), and tunicates (Ahmed & Ali, 2016). Despite this fact, there has not been a comprehensive effort to ascertain the incidence and attributes of NUMTs in marine taxa, information that would be useful in evaluating the interpretational complexity introduced by them and how best to defend against such impacts.
The present study begins to provide this information by examining the prevalence of COI NUMTs in marine animals and considers their implications for both estimates of richness and measures of intraspecific variation at COI. Analysis began with the quantification of COI NUMTs derived from any segment of the full (circa 1500 bp) COI gene in 85 marine species to ascertain if certain regions are more prone to incorporation in the nuclear genome. Analysis then focused on determining the incidence and attributes of NUMTs derived from the 658 bp barcode region of COI in the genomes of 156 marine species. Particular effort was directed toward ascertaining the impact of target sequence length on the NUMT count, and on the incidence of those with diagnostic traits (indels, stop codons). Because read lengths vary with analytical protocol and HTS platform, we considered four commonly recovered amplicon lengths (150 bp, 300 bp, 450 bp, 600 bp) (Hebert et al., 2018; Reuter et al., 2016; Rhoads & Au, 2015). Finally, we examined the correlation between the incidence of NUMTs and genome size, and investigated if NUMT counts varied among phyla, trophic groups, or life history traits.