Discussion
This study evaluated the incidence and attributes of NUMTs in the
genomes of 156 species of marine animals and considered the
interpretational complexities introduced for biodiversity assessments
using metabarcoding. Considering the 85 invertebrate species with a
full-length COI sequence, most (58.4%) NUMTs were < 300 bp.
This was also true for NUMTs derived from the 658 bp barcode region,
where 51.2% were <300 bp. When hits with IPSCs were excluded,
72.5% of NUMTs were removed. On the other hand, when reads
<300 bp were retained, NUMTs often lacked these diagnostic
features. When reads were clustered into OTUs based on a 2% divergence
threshold, these undiagnosable NUMTs inflated apparent species richness
by up to 1.57x relative to the true number of species and barcode
variation in these species by up to 1.19x. No significant difference in
NUMT incidence was detected among species in different phyla, trophic
categories or breeding systems, but the strength of these tests was
limited by the low number of genomes available for analysis. Based on
current information, NUMTs pose a substantial challenge for
metabarcoding in marine environments, particularly when short amplicons
are targeted as is often the case in dietary studies (Berry et al.,
2017; da Silva et al., 2019).
Hit length and distribution
NUMTs did not appear to be evenly distributed along the COI gene, but it
seems unlikely that certain regions are more prone to incorporation.
NUMTs arise from two processes: novel insertions into the nuclear genome
and post-insertion duplication and translocation (Bensasson et al.,
2001; Hazkani-Covo et al., 2010; Richly & Leister, 2004). The uneven
distribution we observed may be driven by NUMT duplications following
translocation from the mitochondrial genome rather than by hotspots
prone to NUMT integration (Calabrese et al., 2017). Confirming which
mechanism is responsible for a particular NUMT requires detailed
examination of mutation patterns and NUMT attributes in allied species
(Behura, 2007; Bensasson et al., 2001).
The apparent biomodality of sequence lengths for NUMTs identified by
both the 658 bp and 1500 bp queries is likely artefactual. While short
NUMTs certainly predominate, the secondary peak of long NUMTs reflects
the fact that any NUMT extending beyond the query length is included in
the longest size category. If NUMTs were evaluated for entire
mitochondrial genomes rather just the barcode region, the bimodal
pattern would undoubtedly disappear.
NUMT diagnosis
Our results indicate that the impacts of NUMTs can be greatly reduced by
targeted longer amplicons because short NUMTs are excluded and more
IPSCs are exposed. As a result, the exclusion of sequences with IPSCs is
a key bioinformatics step (Buhay, 2009; Kunz et al., 2019). In our
analysis, screening for IPSCs failed to significantly reduce the NUMT
count for the shortest read length (150 bp), but many NUMTs were found
to possess an IPSCwith longer reads. Porter and Hajibabai (2021) further
showed that short (~300 bp) NUMTs are also less likely
to contain other diagnostic features such as inappropriate amino acid
substitutions that can identified via bioinformatic pipelines.
Accordingly, NUMTs pose the highest risk to eDNA studies where amplicons
range from 50-400 bp (Langlois, Allison, Bergman, To, & Helbing, 2021),
dietary analyses where amplicons are 70-230 bp (Berry et al., 2017; da
Silva et al., 2019) and marine metabarcoding studies which commonly
target a 313 bp region of COI (Leray et al., 2013), while studies
targeting the full 658 bp barcode region face a lower risk of NUMT
misinterpretation. However, the analysis of longer amplicons does not
eliminate the problem as 29.9% of >600 bp NUMTs lacked an
IPSC, reinforcing a pattern seen in other taxa (Antunes & Ramos, 2005;
Kunz et al., 2019; Richly & Leister, 2004). Furthermore, NUMTs of any
length can be problematic, even when their incidence in the template
pool is low, if primers have higher affinity for them than for the
mitochondrial target (Kim, Lee, & Ju, 2013; Kunz et al., 2019).
Unrecognized NUMTs misclassified as distinct taxa (species, OTUs, ESVs,
haplotypes) inflate diversity estimates generated by metabarcoding. If
all NUMTs with >2% divergence in our dataset were mistaken
for a distinct OTU, alpha diversity would have been inflated by1.57x.
Conversely, if all NUMTs with <2% divergence were assumed to
reflect haplotype diversity within a taxon, barcode variation would have
been inflated by 1.19x. If intraspecific variation is not under
investigation, grouping NUMTs within an OTU will lessen overestimation
of taxon richness by eliminating those NUMTs with the lowest divergence
which are also least likely to posess an IPSC. However, the latter
approach may not outweigh the benefits of using ESVs such as improved
resolution and their value as intrinsic units of biological diversity
(Callahan, McMurdie, & Holmes, 2017). Whether OTUs or ESVs are used,
sequence arrays containing NUMTs can still indicate beta diversity if
NUMTs are either uncommon or consistently recovered (Porter &
Hajibabaei, 2021).
Patterns of NUMT abundance among
species
The incidence of NUMTs is influenced by diverse factors. Some are
associated with genome size (e.g., frequency of double-stranded breaks,
duplication via repetitive elements), but others are not (e.g.,
mitochondrial damage, cellular stress) so the linkage between the
incidence of NUMTs and genome size is unpredictable (Antunes & Ramos,
2005; Hazkani-Covo et al., 2010; Ricchetti et al., 2004; Richly &
Leister, 2004; Gerstein and Zheng, 2006;Wang et al., 2020). We detected
a significant positive correlation between genome size and NUMT count,
but the low coefficient of determination (ρ = 0.33) helps to explain the
variable associations noted in earlier studies. As NUMT frequency varies
by marker (Baeza & Fuentes, 2013), a search for their incidence across
the entire mitochondrial genome would better evaluate the linkage
between the prevalence of NUMTs and genome size.
The quality of genome assemblies, as indicated by contig N50, varied
117,000-fold among the 156 species examined in our study. High quality
assemblies possessed significantly fewer NUMTs than lower quality
assemblies, a result contrary to previous findings (Hazkani-Covo et al.,
2010; Richly & Leister, 2004). Typically, NUMT counts are elevated in
high quality assemblies because of their inadvertent exclusion from
draft assemblies. For example, Hazkani-Covo et al. (2010) reported that
NUMT counts doubled between low and high quality assemblies of 18
eukaryotes, including a 20-fold increase in Drosophila
melanogaster and a 3-fold rise in Takifugu rubripes . Since NUMT
detection is sensitive to search strategy, genome curation, and level of
completion, estimates of NUMT prevalence will need to be updated as the
quality of genome assemblies improves. Because the quality of current
genome assemblies for most marine animals are low (mean contig N50:
340,000; median: 16,000), our NUMT counts may be substantial
underestimates.
Other potential sources of errors in NUMT enumeration include the
recovery of COI from bacterial symbionts or other contaminants,
heteroplasmy, or PCR/sequencing errors (Baeza & Fuentes, 2013; Song et
al., 2008). For example, a recent genome assembly for a nematode
inadvertently included many sequences of bacterial origin (Schiffer et
al., 2019; Thorne, Kagoshima, Clark, Marshall, & Wharton, 2014). Our
study revealed many NUMTs in several molluscs, including bivalves and
cephalopods which employ doubly uniparental inheritance of mitochondrial
DNA (Doucet-Beaupré et al., 2010; Shizas, 2012; Zouros, Ball, Saavedra,
& Freeman, 1994) and other mechanisms that add similar diversity
(Strugnell & Lindgren, 2007). Because of such factors, causes aside
from NUMTs can contribute to perceived COI diversity (Shizas, 2012).
While our analyses did not reveal significant differences in NUMT counts
among phyla, the highest numbers were found in arthropods, reinforcing
evidence that their genomes often possess many NUMTs (Song et al.,
2008). This fact is well-documented for decapods (Baeza & Fuentes,
2013; Gíslason, Svavarsson, Halldórsson, & Pálsson, 2013; Kim et al.,
2013; Williams & Knowlton, 2001; Yuan et al., 2017), a group with high
NUMT counts in our study. The capacity to rigorously test for variation
in the incidence of NUMTs among phyla was constrained because most
(12/17) had only a few genomes available for analysis.
Although no association was evident between NUMT counts and trophic
category or breeding system, a parasitic copepod, Lepeophtheirus
salmonis , had the highest NUMT count (50). In bacteria, pseudogenes are
more common among parasites (Lawrence & Hendrix, 2001), and some
parasitoid wasps contain many NUMTs, possibly linked to a high incidence
of structural rerarrangements in their mitochondria (Yan et al., 2019).
However, many marine organisms have complex life history patterns that
include multiple trophic levels and reproductive strategies,
complicating the detection of linkages with genome dynamics. Patterns
may be apparent once more genomes have been sequenced.
Mitigating the impacts of NUMTs
Because NUMTs are often difficult to recognize, it is worth considering
approaches that would minimize their impacts on metabarcoding.
Methodological shifts offer one solution. For example, density gradient
centrifugation can provide highly purified mitochondrial DNA for PCR
(Cristescu, 2019; Deiner et al., 2017; Tsuri et al., 2021), but it
requires freshly collected samples (Zhou et al., 2013). Reverse
transcription PCR (RT-PCR) offers another powerful option as sequences
are only recovered from mitochondrial DNA because NUMTs are not
transcribed, and this method can be employed on properly preserved
samples. (Bensasson et al., 2001; Porter & Hajibabaei, 2021; Song et
al., 2008; D. Wang et al., 2019).
Aside from adjusting wet lab protocols, NUMTs can be excluded by more
rigorous bioinfomatic screening (Porter & Hajibabaei, 2021). Tree-based
methods that identify NUMTs based on their divergence from the
mitochondrial copy can aid studies on a particular taxonomic group
(Creedy et al., 2019), but are ineffective for work on communities
because taxonomic diversity is high. Read count thresholds can remove
90-95% of NUMTs while retaining most legitimate mtDNA variants (Andújar
et al., 2020). Pipelines that incorporate amino acid analysis into their
filtering parameters can further improve the recognition of NUMTs (Kunz
et al., 2019; Nugent et al., 2020; Porter & Hajibabaei, 2020; Porter &
Hajibabaei, 2021). For example, Coil (Nugent et al., 2020) and METAWORKS
(Porter & Hajibabaei, 2020) employ profile hidden Markov model (PHMM)
analysis to compare the features of an unknown sequence with a
translated profile built from a multiple sequence alignment for each
taxonomic group to highlight frameshift errors. Because the COI protein
has highly conserved amino acids at certain positions, sequences that
lead to an amino acid substitution at these sites are likely to be NUMTs
(Buhay, 2009; Hebert et al., 2003; Kunz et al., 2019).
Despite the potential of these defensive measures to minimize the impact
of NUMTs, comprehensive, well-curated DNA barcode reference libraries
are needed to support metabarcoding studies in marine environments
(Bucklin, Steinke, & Blanco-Bercial, 2011; Leray & Knowlton, 2016). At
present, NUMTs are typically discarded, but their retention in the
reference library would aid data interpretation (Matzen da Silva et al.,
2011), especially those derived from common, large-bodied species. There
are two paths to develop a well-parameterized reference library that
includes both COI sequences and their NUMT derivatives. The first
involves the assembly of high quality nuclear and mitochondrial genome
sequences for all marine species. While this is an important goal (Lewin
et al. 2018), its completion for all marine species will be a
multi-decadal task. A much faster path involves the collection of fresh
specimens and their analysis via RT-PCR to obtain mtCOI sequences,
followed by PCR to amplify both mtCOI and NUMTs from the same specimen.
This approach has the potential to characterize thousands of species in
a single HTS run. Although the analysis of multiple markers can help to
resolve ambiguities when NUMTs complicate results (Antunes & Ramos,
2005; Kunz et al., 2019; Richly & Leister, 2004; Song et al., 2008; van
der Loos & Nijland, 2020), the challenge in developing a
well-parameterized reference library for COI reinforces the need to
focus on this task rather than on broadening analysis to additional
genes.
DNA metabarcoding has gained rapid adoption due to its capacity to
document biodiversity patterns in unprecedented detail at
ever-decreasing costs (Taberlet, Coissac, Pompanon, Brochmann, &
Willerslev, 2012; van der Loos & Nijland, 2020). At a time when marine
biodiversity is in rapid decline (Dulvy, Sadovy, & Reynolds, 2003;
Halpern, Selkoe, Micheli, & Kappel, 2007; WWF, 2020), it is critical
that the metabarcoding results used to inform conservation and
management decisions be reliable. Hence, it is essential to develop the
informatics tools and analytical approaches to minimize the impacts of
NUMTs on biodiversity assessments. Since marine organisms exhibit
tremendous variation in the structure and organization of their genomes
(Burger, Jackson, & Waller, 2012; Lavrov & Pett, 2016), further
exploration and acknowledgment of NUMT prevalence is not only necessary
for reliable metabarcoding data, but will likely inform evolutionary
insights regarding genome dynamics across the tree of life (Zhang &
Hewitt, 1996).