Clustering Analysis
Population structure in the overall dataset was evaluated using fastSTRUCTURE v.1.04, with Structure_threader utilized to parallelize distinct runs of K (Pina-Martins, Silva, Fino, & Paulo, 2017; Raj, Stephens, & Pritchard, 2013). Initially, samples were grouped by location and no species data were pre-assigned to the individuals. To estimate the most likely number of genetic clusters in the dataset (K), the analysis was run for values of K ranging from 1 to 17 (i.e. , number of sites sampled). The best value was selected using thechooseK.py function from the fastSTRUCTURE package and plots were created by Distruct v.2.3 (http://distruct2.popgen.org). The clustering of individuals into the distinct genetic groups were also visualized using a principal component analysis (PCA) and a discriminant analysis of principal components (DAPC). The most likely number of genetic groups was inferred by the find.clusters algorithm for the PCA and the optimal number of principal components to inform the DAPC was defined using the function optim.a.score . Both were performed in R (Team, 2013) through the adegenet package (Jombart & Collins, 2015).
Any individual with more than 25% of their loci grouping with a second cluster was marked as a hybrid and removed from the phylogenetic analysis. Maximum likelihood phylogeny among individuals was run using RAxML v.8.2.12 (Stamatakis, 2014). An acquisition bias correction was applied to the likelihood calculations as alignments were solely composed of SNPs, with each invariant site removed through Phrynomics (https://github.com/bbanbury/phrynomics) (Leaché, Banbury, Felsenstein, De Oca, & Stamatakis, 2015). The GTR+G nucleotide substitution model was used for each search. A rapid bootstrap analysis and search for the best-scoring maximum likelihood tree was executed using the extended majority rule-based bootstopping criterion to achieve a sufficient number of bootstrap replicates (Pattengale, Alipour, Bininda-Emonds, Moret, & Stamatakis, 2010). Additionally, to cross-validate our results, a second phylogeny was inferred in W-IQ-Tree version 1.6.12 (Trifinopoulos, Nguyen, von Haeseler, & Minh, 2016), using the TVM+F+G4 substitution model determined by ModelFinder (Kalyaanamoorthy, Minh, Wong, Von Haeseler, & Jermiin, 2017; Nguyen, Schmidt, Von Haeseler, & Minh, 2015). Branch support was calculated using 1000 ultrafast bootstraps (Hoang, Chernomor, Von Haeseler, Minh, & Vinh, 2018) and Shimodaira–Hasegawa like approximate likelihood-ratio test (SH-aRLT) (Guindon et al., 2010; Hoang et al., 2018).
As each of these clustering methods consistently supported five genetically distinct clusters, we generated a SNP dataset with individuals assigned to both a sampling location and a cluster (“all-species” dataset) as well as four species-specific datasets. SNPs were generated from the raw reads following the processing methods above except the filtering parameters were increased to only include SNP that occurred in at least 75% of the populations and at least half of the individuals within those populations. Genetic diversity estimates (FIS , HE , andHO ), population differentiation (pairwiseFST ), and isolation-by-distance (IBD) were calculated for each SNP dataset using Genepop v.4.7.0 (Rousset, 2017). Geographic distances were calculated as Euclidean distances among localities.