Clustering Analysis
Population structure in the overall dataset was evaluated using
fastSTRUCTURE v.1.04, with Structure_threader utilized to parallelize
distinct runs of K (Pina-Martins, Silva, Fino, & Paulo, 2017; Raj,
Stephens, & Pritchard, 2013). Initially, samples were grouped by
location and no species data were pre-assigned to the individuals. To
estimate the most likely number of genetic clusters in the dataset (K),
the analysis was run for values of K ranging from 1 to 17 (i.e. ,
number of sites sampled). The best value was selected using thechooseK.py function from the fastSTRUCTURE package and plots were
created by Distruct v.2.3 (http://distruct2.popgen.org). The
clustering of individuals into the distinct genetic groups were also
visualized using a principal component analysis (PCA) and a discriminant
analysis of principal components (DAPC). The most likely number of
genetic groups was inferred by the find.clusters algorithm for
the PCA and the optimal number of principal components to inform the
DAPC was defined using the function optim.a.score . Both were
performed in R (Team, 2013) through the adegenet package (Jombart
& Collins, 2015).
Any individual with more than 25% of their loci grouping with a second
cluster was marked as a hybrid and removed from the phylogenetic
analysis. Maximum likelihood phylogeny among individuals was run using
RAxML v.8.2.12 (Stamatakis, 2014). An acquisition bias correction was
applied to the likelihood calculations as alignments were solely
composed of SNPs, with each invariant site removed through Phrynomics
(https://github.com/bbanbury/phrynomics) (Leaché, Banbury,
Felsenstein, De Oca, & Stamatakis, 2015). The GTR+G nucleotide
substitution model was used for each search. A rapid bootstrap analysis
and search for the best-scoring maximum likelihood tree was executed
using the extended majority rule-based bootstopping criterion to achieve
a sufficient number of bootstrap replicates (Pattengale, Alipour,
Bininda-Emonds, Moret, & Stamatakis, 2010). Additionally, to
cross-validate our results, a second phylogeny was inferred in W-IQ-Tree
version 1.6.12 (Trifinopoulos, Nguyen, von Haeseler, & Minh, 2016),
using the TVM+F+G4 substitution model determined by ModelFinder
(Kalyaanamoorthy, Minh, Wong, Von Haeseler, & Jermiin, 2017; Nguyen,
Schmidt, Von Haeseler, & Minh, 2015). Branch support was calculated
using 1000 ultrafast bootstraps (Hoang, Chernomor, Von Haeseler, Minh,
& Vinh, 2018) and Shimodaira–Hasegawa like approximate
likelihood-ratio test (SH-aRLT) (Guindon et al., 2010; Hoang et al.,
2018).
As each of these clustering methods consistently supported five
genetically distinct clusters, we generated a SNP dataset with
individuals assigned to both a sampling location and a cluster
(“all-species” dataset) as well as four species-specific datasets.
SNPs were generated from the raw reads following the processing methods
above except the filtering parameters were increased to only include SNP
that occurred in at least 75% of the populations and at least half of
the individuals within those populations. Genetic diversity estimates
(FIS , HE , andHO ), population differentiation (pairwiseFST ), and isolation-by-distance (IBD) were
calculated for each SNP dataset using
Genepop v.4.7.0 (Rousset, 2017).
Geographic distances were calculated as Euclidean distances among
localities.