Are COI sequences the most appropriate data?
Millette et al . (2020) used 175,247 mitochondrial cytochrome c
oxidase subunit 1 (COI) sequences from 17,082 vertebrate species
deposited in BOLD and GenBank. COI became a popular marker for species
molecular barcoding due to its low within-species and high
between-species variation. However, these characteristics make COI
inappropriate for measuring IGD, as Millette et al . acknowledge,
in addition to potential discordance with nuclear variation. Despite
these well-known issues, the large availability of COI sequences has,
nevertheless, resulted in its continued use to represent IGD in
macro-genetic studies (e.g. Miraldo et al. 2016; Millette et al. 2020;
Theodoridis et al. 2020; Manel et al. 2020).
Even if COI could provide a useful IGD measure, we have identified a
subtle -yet serious- constraint of repurposing publicly-available data
due to inconsistent archiving practices. Specifically, it is common for
only unique or newly-discovered haplotypes to be deposited in
repositories, and not the study’s full dataset. As an example, we
screened 18 Molecular Ecology issues (Table S1): of 40 papers that
deposited mitochondrial sequences in GenBank, 22 deposited all sequences
generated, while 18 deposited only novel haplotypes (sequences detected
for the first time) or exemplars of each haplotype. Therefore, deposited
data may more accurately represent haplotype accumulation curves across
space and time; databases consequently do not allow comparable snapshots
of genetic diversity at different times. This bias compromises attempts
to quantify temporal trends in IGD using GenBank, as done in Millette et
al. (2020), and is a potential issue in many spatial macro-genetic
studies. Macro-genetic studies should extract metadata regarding sample
sizes and complete haplotype (or allele) frequencies from the original
manuscripts (as done by Lawrence et al. 2019) to avoid bias from
inconsistently-archived data.