Results

Marine animal genomes

The 348 marine species on the NCBI Genome Browser included 124 invertebrates and 224 vertebrates (157 fishes, 33 mammals, 26 birds, 8 reptiles). We downloaded genomes for 188 of these species including all 124 invertebrates and 64 vertebrates. Thirty-two of these genomes (19 invertebrates, 13 vertebrates) were subsequently excluded from study, most because no COI sequence was available on GenBank or BOLD, but one species (Hofstenia miamia ) was excluded because no FASTA file was available. Among the remaining 156 species, 85 invertebrates had a full (≥1500 bp) mitochondrial COI sequence available. At least one species from each available vertebrate order was represented. Only partial (487–657 bp) COI sequences were available for 13 species and these were included in the 658 bp category.

Hit length and distribution

The 658 bp COI query sequence revealed 389 putative NUMTs ≥100 bp in the 156 genomes with 72 (46.2%) of the species possessing at least one (Supp. Table 1). The NUMT count averaged 2.49 ± 7.06 (SD) per genome, and ranged from 0–50. Hit lengths varied from 100–729 bp and averaged 336 bp ± 208 bp (Figures 1). Most hits (37.3%) were short (100–200 bp), but almost a quarter (24.4%) were 600–700 bp. Among the 389 hits, 282 (72.5%) contained IPSCs while 107 lacked them (Figures 1).
Forty of the 85 invertebrate genomes with a full-length COI sequence contained one or more NUMTs ≥150 bp. In total, 449 NUMTs were revealed with the full-length COI query (Supp. Table 1) with their lengths averaging 409 bp ± 284 bp (mean ± SD; Figure 2), but many (58.6%) were less than 300 bp (Figure 2). Most of these NUMTs (358, 79.7%) contained IPSCs, but 91 did not (Figure 2). Hits were not evenly distributed along COI as nucleotide positions showed more than 2-fold variation in the incidence of their inclusion in NUMTs (57–126 coverage for a particular nucleotide position) (Figure 3).

NUMT diagnosis

Longer read lengths reduced the number of NUMTs that were recovered (Figure 4; Kruskal-Wallis: Χ2 = 13.05, df = 3, p = 0.005, n = 156) and the number without an IPSC (Figure 4; Kruskal-Wallis: Χ2 = 19.23, df = 3, p < 0.001, n = 156). Removing these diagnosable hits significantly reduced the hit count for three length categories (Figure 4; Wilcoxon rank sum tests: 300 bp : W = 10,405, p = 0.01; 450 bp : W = 10,706, p = 0.02; 600 bp : W = 10,550, p = 0.005, n = 156), but not for150 bp (Figure 4; Wilcoxon rank sum test: W = 10,787, p = 0.047, n = 156). Among those NUMTs lacking an IPSC, 52.5% were excluded with a read length of 300 bp, 63.9% with read length of 450 bp, and 76.2% with a read length of 600 bp (Table 1).
NUMTs with IPSCs possessed an average sequence divergence of 21.9% ± 8.8% from the mtCOI sequence in their parent species with divergences ranging from 0.3–36.0% (Figures 5 & 6). By comparison, NUMTs lacking IPSCs possessed an average divergence of 10.8% ± 9.3% (range = 0–30.5% ) (Figures 5 & 6). Among the ≥150 bp hits which lacked an IPSC, 73.0% (89/122) had divergence values >2% so they could inflate the OTU count while another 30 with divergence values <2% could inflate the amount of barcode variation within their source species. The other three hits showed 0% divergence from mtCOI so would have no impact. Accordingly, studies targeting short amplions could increase OTU counts by 1.57x and intraspecific barcode variation by 1.19x.

Patterns of NUMT abundance among species

Genome sizes varied more than than 2000-fold from 3.03 Mb in the demosponge Aplysina aerophoba to 6,700 Mb in the ridgetail prawnPalaemon carinicauda (Figure 7; Supp. Table 2). There was a weak positive correlation between the hit count and genome size (Figure 7; Spearman’s rank correlation: ρ = 0.33, p <0.0001, n = 156). Contig N50s ranged 117,000 fold. There was a weak negative correlation between hit frequency and contig N50 across its 117,000 fold range (198 for Ophionereis fasciata to 23 x106 forChanos chanos ( (Supp. Table 2; Figure 7; Spearman’s rank correlation: ρ = -0.19, p = 0.02, n = 156).
Arthropods and molluscs had the most COI hits ≥100 bp (Figure 8A), but mean counts did not differ significantly among phyla (Figure 8A; Kruskal-Wallis: Χ2= 26.05, df = 16, p = 0.053, n = 156). When hits with IPSCs were removed, there was also no difference in mean hits among phyla (Figure 8B; Kruskal-Wallis: Χ2 = 19.37, df = 16, p = 0.25, n = 156). Most phyla were represented by four or fewer representatives (Figure 8) so the power of this test was very limited. Among phyla with more representatives, echinoderms and cnidarians contained the highest percent of hits ≥100 bp without IPSCs at 66.7% and 33.3%, respectively (Supp. Table 2).
The number of hits did not differ among taxa in different trophic categories (Figure 9; Kruskal-Wallis: Χ2= 9.03, df = 5, p = 0.11, n = 156), but the parasitic salmon louse,Lepeophtheirus salmonis , had the highest NUMT count (Table 2). NUMT incidence was also unrelated to any life history characteristic examined including asexual reproduction (Figure 10A; Wilcoxon rank: W = 1885.5, p = 0.052, n = 156), sexual reproduction (Figure 10B; Wilcoxon rank: W = 233.5, p = 0.96, n = 156), hermaphroditism (Figure 10C; Wilcoxon rank: W = 2448, p = 0.35, n = 156) or colonialism (Figure 10D; Wilcoxon rank: W = 786, p = 0.66, n = 156).