Results
Marine animal genomes
The 348 marine species on the NCBI Genome Browser included 124
invertebrates and 224 vertebrates (157 fishes, 33 mammals, 26 birds, 8
reptiles). We downloaded genomes for 188 of these species including all
124 invertebrates and 64 vertebrates. Thirty-two of these genomes (19
invertebrates, 13 vertebrates) were subsequently excluded from study,
most because no COI sequence was available on GenBank or BOLD, but one
species (Hofstenia miamia ) was excluded because no FASTA file was
available. Among the remaining 156 species, 85 invertebrates had a full
(≥1500 bp) mitochondrial COI sequence available. At least one species
from each available vertebrate order was represented. Only partial
(487–657 bp) COI sequences were available for 13 species and these were
included in the 658 bp category.
Hit length and
distribution
The 658 bp COI query sequence revealed 389 putative NUMTs ≥100 bp in the
156 genomes with 72 (46.2%) of the species possessing at least one
(Supp. Table 1). The NUMT count averaged 2.49 ± 7.06 (SD) per genome,
and ranged from 0–50. Hit lengths varied from 100–729 bp and averaged
336 bp ± 208 bp (Figures 1). Most hits (37.3%) were short (100–200
bp), but almost a quarter (24.4%) were 600–700 bp. Among the 389 hits,
282 (72.5%) contained IPSCs while 107 lacked them (Figures 1).
Forty of the 85 invertebrate genomes with a full-length COI sequence
contained one or more NUMTs ≥150 bp. In total, 449 NUMTs were revealed
with the full-length COI query (Supp. Table 1) with their lengths
averaging 409 bp ± 284 bp (mean ± SD; Figure 2), but many (58.6%) were
less than 300 bp (Figure 2). Most of these NUMTs (358, 79.7%) contained
IPSCs, but 91 did not (Figure 2). Hits were not evenly distributed along
COI as nucleotide positions showed more than 2-fold variation in the
incidence of their inclusion in NUMTs
(57–126 coverage for a particular
nucleotide position) (Figure 3).
NUMT diagnosis
Longer read lengths reduced the number of NUMTs that were recovered
(Figure 4; Kruskal-Wallis: Χ2 = 13.05, df = 3, p =
0.005, n = 156) and the number without an IPSC (Figure 4;
Kruskal-Wallis: Χ2 = 19.23, df = 3, p <
0.001, n = 156). Removing these diagnosable hits significantly reduced
the hit count for three length categories (Figure 4; Wilcoxon rank sum
tests: 300 bp : W = 10,405, p = 0.01; 450 bp : W = 10,706, p
= 0.02; 600 bp : W = 10,550, p = 0.005, n = 156), but not for150 bp (Figure 4; Wilcoxon rank sum test: W = 10,787, p = 0.047,
n = 156). Among those NUMTs lacking an IPSC, 52.5% were excluded with a
read length of 300 bp, 63.9% with read length of 450 bp, and 76.2%
with a read length of 600 bp (Table 1).
NUMTs with IPSCs possessed an average sequence divergence of 21.9% ±
8.8% from the mtCOI sequence in their parent species with divergences
ranging from 0.3–36.0% (Figures 5 & 6). By comparison, NUMTs lacking
IPSCs possessed an average divergence of 10.8% ± 9.3% (range =
0–30.5% ) (Figures 5 & 6). Among the ≥150 bp hits which lacked an
IPSC, 73.0% (89/122) had divergence values >2% so they
could inflate the OTU count while another 30 with divergence values
<2% could inflate the amount of barcode variation within
their source species. The other three hits showed 0% divergence from
mtCOI so would have no impact. Accordingly, studies targeting short
amplions could increase OTU counts by 1.57x and intraspecific barcode
variation by 1.19x.
Patterns of NUMT abundance among species
Genome sizes varied more than than 2000-fold from 3.03 Mb in the
demosponge Aplysina aerophoba to 6,700 Mb in the ridgetail prawnPalaemon carinicauda (Figure 7; Supp. Table 2). There was a weak
positive correlation between the hit count and genome size (Figure 7;
Spearman’s rank correlation: ρ = 0.33, p <0.0001, n = 156).
Contig N50s ranged 117,000 fold. There was a weak negative correlation
between hit frequency and contig N50 across its 117,000 fold range (198
for Ophionereis fasciata to 23 x106 forChanos chanos ( (Supp. Table 2; Figure 7; Spearman’s rank
correlation: ρ = -0.19, p = 0.02, n = 156).
Arthropods and molluscs had the most COI hits ≥100 bp (Figure 8A), but
mean counts did not differ significantly among phyla (Figure 8A;
Kruskal-Wallis: Χ2= 26.05, df = 16, p = 0.053, n =
156). When hits with IPSCs were removed, there was also no difference in
mean hits among phyla (Figure 8B; Kruskal-Wallis: Χ2 =
19.37, df = 16, p = 0.25, n = 156). Most phyla were represented by four
or fewer representatives (Figure 8) so the power of this test was very
limited. Among phyla with more representatives, echinoderms and
cnidarians contained the highest percent of hits ≥100 bp without IPSCs
at 66.7% and 33.3%, respectively (Supp. Table 2).
The number of hits did not differ among taxa in different trophic
categories (Figure 9; Kruskal-Wallis: Χ2= 9.03, df =
5, p = 0.11, n = 156), but the parasitic salmon louse,Lepeophtheirus salmonis , had the highest NUMT count (Table 2).
NUMT incidence was also unrelated to any life history characteristic
examined including asexual reproduction (Figure 10A; Wilcoxon rank: W =
1885.5, p = 0.052, n = 156), sexual reproduction (Figure 10B; Wilcoxon
rank: W = 233.5, p = 0.96, n = 156), hermaphroditism (Figure 10C;
Wilcoxon rank: W = 2448, p = 0.35, n = 156) or colonialism (Figure 10D;
Wilcoxon rank: W = 786, p = 0.66, n = 156).