Phylogenetic supermatrix assembly
We compiled a species-level supermatrix of genetic data for the Columbidae plus selected outgroups using sequence data downloaded from NCBI (https://www.ncbi.nlm.nih.gov/). This used specified exemplar reference gene sequences drawn from the key Columbiformes phylogenetic references, with BLASTn+organism NCBI database searches. The emphasis was on widely sampled loci used in published phylogenetic studies with a reasonable proportion of taxa and major lineages. The key references are provided in the Supplementary File, and genes and GenBank accessions listed in Table S1. Species-level taxonomy followed the International Ornithology Council World Bird List v. 9.2 https://www.worldbirdnames.org/updates/. Recent phylogenomic analyses of the neoaves (Jarvis et al. , 2014; Prum et al. , 2015; Reddyet al. , 2017) indicated the Pteroclidae and Mesitornithidae as outgroups to the Columbidae. As there is patchy gene sequence for these taxa, we pooled data for each lineage to create composite family representatives, along with a composite Cuculiformes taxon as a further outgroup. Genetic data was aligned with MAFFT (v. 7.245) (Katoh & Standley, 2013) using the local-pair (L-INS-i) algorithm, alignments assembled into a custom Microsoft Excel database, and nomenclature rationalized to IOC9.2 (with the help of cross-referencing via Wikipedia using common names). Gene trees were inferred by IQ-TREE (Nguyenet al. , 2015) ultrafast bootstrap consensus (Hoang et al. , 2018), using models of sequence evolution identified by ModelFinder as implemented in IQTREE (Kalyaanamoorthy et al. , 2017). These trees were then scanned for non-monophyletic genera and species (using the custom script GTREER5), and the database updated by excluding aberrant accessions or in some cases revising nomenclature. Where necessary sequence sets were then realigned (as above).
Some long genes are routinely sequenced in fragments (e.g. RAG-1, COI), so in order to maximize data for the COI gene we also used a consensus method where the alignment was reduced down to a single consensus sequence per species, based on the most common base per site (with ties scored as ambiguous). This in effect picks the most commonly sequenced sub-lineages, and is a simple way to combine data and discount aberrant sequences (wrong loci/taxa etc). These consensus alignments were then subjected to the same procedure of gene tree and genera monophyly scans as above. We also used mitogenome data as follows. As across the relevant taxa gene order is preserved, we first aligned the entire mitogenomes then excised the set of commonly used genes and added the sequences to their respective alignments. The remainder (referred to as mtg-block) was kept as a separate alignment, after deleting the non-coding D-loop region. Concatenated supermatrix sequence data then used the best (longest accepted) single exemplar sequence per gene per species. These gene alignments were then compacted by removing regions with little or no data (<10% taxa per gene) and ambiguous alignment regions (via GTREER5, ALISCORE v2.2; Misof & Misof 2009).
Two versions of supermatrix - with and without the mtg-block - were analysed; the latter to avoid distorting the result due to biased mitogenome sampling (missing from several key groups) and nucleotide saturation effects on relative divergence (especially for deeper outgroup lineages). Final analyses used the supermatrix without the mtg-block (as the six genes add enough well-sampled mtDNA, and empirically results were very similar). This final supermatrix comprised 247 out of 344 recognised pigeon species (72% complete) including sections of four nuclear and six mitochondrial gene loci, amounting to 11,100 sites 39% data-complete; 1,125,420 defined bases in 1,262 sequences (including 63 COI consensus) from 1527 accessions out of a total database of 3,639 accessions (with 37 rejected). Of 49 IOC9.2 Columbidae genera only three (all monotypic) were missing (Starnoenas , Microgoura and Cryptophaps ).