Introduction
The interplay between the nuclear and mitochondrial genomes is a
defining feature of eukaryotes. One aspect of these interactions is the
transfer of segments of mitochondrial DNA (mtDNA) through the nuclear
membrane and their subsequent incorporation into the nuclear genome
during the repair of double-stranded breaks (Blanchard & Schmidt, 1996;
du Buy & Riley, 1967; Lopez, Yuhki, Masuda, Modi, & O’Brien, 1994).
The resultant pseudogenes, termed NUMTs, are prevalent throughout the
animal kingdom (Bensasson, Zhang, Hartl, & Hewitt, 2001; Hazkani-Covo,
Zeller, & Martin, 2010; Richly & Leister, 2004), but the factors
responsible for their considerable variation in abundance among species
is uncertain (Bensasson et al., 2001; Hazkani-Covo et al., 2010; Richly
& Leister, 2004; Song, Buhay, Whiting, & Crandall, 2008). Some studies
have reported a correlation between genome size and NUMT counts
(Bensasson et al., 2001; Hazkani-Covo et al., 2010), but others have not
(Gerstein & Zheng, 2006; Richly & Leister, 2004; Wang, Liu, Miao,
Huang, & Xiao, 2020). NUMT frequency has also been linked to
environmental factors (Gerstein & Zheng, 2006; Ricchetti, Tekaia, &
Dujon, 2004; Song, Jiang, Yuan, Guo, & Miao, 2013), cellular stress
(Bensasson et al., 2001; Hazkani-Covo et al., 2010), and population
dynamics (Antunes & Ramos, 2005; Deceliere, Charles, & Biémont, 2005)
The evolutionary implications of NUMTs have attracted considerable
attention (Balakirev & Ayala, 2003). Because rates of nucleotide
substitution are typically much slower in nuclear than mitochondrial
genomes (Bensasson et al., 2001; Lopez, Culver, Stephens, Johnson, &
O’Brien, 1997), each NUMT is a relict of its mitochondrial ancestor. As
a result, they provide useful genetic markers (Harrison et al., 2002;
Ricchetti et al., 2004) which can help to resolve phylogenies (Ko et
al., 2015), identify cases of hybridization (Machida & Lin, 2017), and
clarify genome dynamics (Matzen da Silva et al., 2011; Zhang & Hewitt,
1996). Although their evolution is slowed, NUMT sequences are not
frozen, and, because they are not transcribed (Bensasson et al., 2001;
Boore, 1999), there is no selection for a functional gene product. As a
result, NUMTs accumulate indels leading to frameshifts, premature stop
codons, and sequence changes coding for amino acid substitutions that
would compromise functionality of the gene product (Bensasson et al.,
2001; Morgan et al., 2013; Zhang & Hewitt, 1996). NUMTs with such
changes are readily diagnosed (Buhay, 2009), but those lacking these
features (Bensasson et al., 2001) create complexities if they are
mistaken for mtDNA (Hazkani-Covo et al., 2010; Zhang & Hewitt, 1996).
PCR-based approaches run a particular risk of encountering this
interpretational complexity because all gene regions with sequence
congruence to a particular primer set are recovered whether they derive
from the target mitochondrial gene or its NUMTs. Past studies have shown
that undiagnosed NUMTs can lead to erroneous phylogenetic trees,
distorting evolutionary relationships (Calabrese et al., 2017; Haran,
Koutroumpa, Magnoux, Roques, & Roux, 2015). In addition, NUMTs can
create complexities for DNA barcoding (Hebert, Cywinska, Ball, &
DeWaard, 2003) as it employs sequence diversity in mtCOI as a basis for
specimen identification and species discovery in the animal kingdom. In
this case, unrecognized NUMTs can inflate both apparent species or
haplotype richness and measures of intraspecific variation at COI
(Buhay, 2009; Creedy et al., 2019; Song et al., 2008). The extent of
these risks depends upon the analytical protocol. When PCR targets DNA
extracts from single specimens, most amplicons derive from mtCOI because
its copy number is far higher than those of any NUMT(s) (Bogenhagen,
2012; Quiros, Goyal, Jha, & Auwerx, 2017). Because they represent a
small fraction of the amplicon pool in this situation, NUMTs rarely
impede recovery of the mt COI sequence with Sanger analysis (Hebert,
Penton, Burns, Janzen, & Hallwachs, 2004). However, because HTS
platforms characterize single molecules, they do recover NUMTs, even
when they comprise a small proportion of the amplicon
pool. When analysis targets DNA
extracts from single individuals, NUMT recovery can be minimized by
excluding sequences with low copy numbers. However, NUMTs pose a greater
threat for metabarcoding studies because this protocol involves the
amplification of DNA extracts from bulk collections (Andújar, Arribas,
Yu, Vogler, & Emerson, 2018; Andújar et al., 2020; Elbrecht, Vamos,
Steinke, & Leese, 2018; Liu, Clarke, Baker, Jordan, & Burridge, 2019).
In this case, NUMTs from species representing a large proportion of the
sample biomass can be abundant, creating the prospect of
interpretational errors when unrecognized (Buhay, 2009; Kunz, Tay,
Elfekih, Gordon, & De Barro, 2019).
Because of this impact of species biomass, sequence count is a weak
criterion for excluding COI NUMTs recovered through metabarcoding
(Andújar et al., 2018, 2020; Elbrecht et al., 2018; Liu et al., 2019).
Stringent quality filters are also of no value because NUMT and mtCOI
templates should be equally susceptible to PCR and sequencing errors
(Andújar et al., 2020; Elbrecht et al., 2018).
Divergence values also cannot be
used as a diagnostic feature (Baeza & Fuentes, 2013; Matzen da Silva et
al., 2011) because NUMTs and mtDNA show highly variable divergences
(Andújar et al., 2020). Two other diagnostic approaches are more useful:
1) many NUMTs are a truncated version of the COI gene (Richly &
Leister, 2004), and 2) they often possess diagnostic sequence changes
(i.e., indels that lead to frameshifts or premature stop codons).
Recognition of these diagnostic features is optimized when the full 658
bp barcode region is recovered, but most HTS platforms generate
considerably shorter reads (Reuter, Spacek, & Snyder, 2016; Rhoads &
Au, 2015). NUMTs create interpretational complexity when they meet two
criteria:1) their length exceeds that of the target amplicon, and 2)
they lack frameshift indels or premature stop codons. In such cases,
NUMTs with divergences exceeding the threshold (e.g., 2%) used to
define operational taxonomic units (OTUs; a species proxy) or recognize
species (Alberdi, Aizpurua, Gilbert, & Bohmann, 2018; Ratnasingham &
Hebert, 2013) will lead to overestimation of the OTU or species count
while NUMTs with lower divergences will inflate estimates of
intraspecific COI variation.
NUMTs are likely to pose a particular challenge for metabarcoding
studies in marine environments because their resident species are both
phylogenetically diverse and poorly represented in DNA barcode reference
libraries (Leray & Knowlton, 2016; Radulovici, Archambault, &
Dufresne, 2010). For example, a single marine habitat can include
representatives from two thirds of all animal phyla (Zeppilli et al.,
2015), and most of these species will be undescribed (Appeltans et al.,
2012). As a result, marine metabarcoding studies often encounter many
sequences that cannot be connected to a species represented in the
barcode reference library. Do these sequences represent newly
encountered species or NUMTs from known taxa? The risk of encountering
NUMTs is certain as they occur in diverse marine lineages including
cnidarians (Song et al., 2013), crustaceans (Bensasson et al., 2001;
Nguyen, Murphy, & Austin, 2002; Williams & Knowlton, 2001),
echinoderms (Jacobs et al., 1983), fishes (Antunes & Ramos, 2005;
Morgan et al., 2013), nematodes (Derycke, Vanaverbeke, Rigaux,
Backeljau, & Moens, 2010), and tunicates (Ahmed & Ali, 2016). Despite
this fact, there has not been a comprehensive effort to ascertain the
incidence and attributes of NUMTs in marine taxa, information that would
be useful in evaluating the interpretational complexity introduced by
them and how best to defend against such impacts.
The present study begins to provide this information by examining the
prevalence of COI NUMTs in marine animals and considers their
implications for both estimates of richness and measures of
intraspecific variation at COI. Analysis began with the quantification
of COI NUMTs derived from any segment of the full (circa 1500 bp) COI
gene in 85 marine species to ascertain if certain regions are more prone
to incorporation in the nuclear genome. Analysis then focused on
determining the incidence and attributes of NUMTs derived from the 658
bp barcode region of COI in the genomes of 156 marine species.
Particular effort was directed toward ascertaining the impact of target
sequence length on the NUMT count, and on the incidence of those with
diagnostic traits (indels, stop codons). Because read lengths vary with
analytical protocol and HTS platform, we considered four commonly
recovered amplicon lengths (150 bp, 300 bp, 450 bp, 600 bp) (Hebert et
al., 2018; Reuter et al., 2016; Rhoads & Au, 2015). Finally, we
examined the correlation between the incidence of NUMTs and genome size,
and investigated if NUMT counts varied among phyla, trophic groups, or
life history traits.