Are COI sequences the most appropriate data?
Millette et al . (2020) used 175,247 mitochondrial cytochrome c oxidase subunit 1 (COI) sequences from 17,082 vertebrate species deposited in BOLD and GenBank. COI became a popular marker for species molecular barcoding due to its low within-species and high between-species variation. However, these characteristics make COI inappropriate for measuring IGD, as Millette et al . acknowledge, in addition to potential discordance with nuclear variation. Despite these well-known issues, the large availability of COI sequences has, nevertheless, resulted in its continued use to represent IGD in macro-genetic studies (e.g. Miraldo et al. 2016; Millette et al. 2020; Theodoridis et al. 2020; Manel et al. 2020).
Even if COI could provide a useful IGD measure, we have identified a subtle -yet serious- constraint of repurposing publicly-available data due to inconsistent archiving practices. Specifically, it is common for only unique or newly-discovered haplotypes to be deposited in repositories, and not the study’s full dataset. As an example, we screened 18 Molecular Ecology issues (Table S1): of 40 papers that deposited mitochondrial sequences in GenBank, 22 deposited all sequences generated, while 18 deposited only novel haplotypes (sequences detected for the first time) or exemplars of each haplotype. Therefore, deposited data may more accurately represent haplotype accumulation curves across space and time; databases consequently do not allow comparable snapshots of genetic diversity at different times. This bias compromises attempts to quantify temporal trends in IGD using GenBank, as done in Millette et al. (2020), and is a potential issue in many spatial macro-genetic studies. Macro-genetic studies should extract metadata regarding sample sizes and complete haplotype (or allele) frequencies from the original manuscripts (as done by Lawrence et al. 2019) to avoid bias from inconsistently-archived data.