2.4.2 Genomic species delimitating under the multispecies
coalescent model using BPP/GDI
Incorporating genome-scale data with the multispecies coalescent model
has provided unprecedented insight into species boundaries (Flouri et
al., 2018). The Bayesian program BPP includes a full-likelihood
implementation of the multispecies coalescent model (Yang, 2015),
providing powerful approaches for delimiting species and populations
(Degnan and Rosenberg, 2009; Knowles and Carstens, 2007; Yang and
Rannala, 2010). Here we use BPP to test the hypothesis of micro-endemism
vs. phenotypically polymorphic species in Niebla. Due to
computational limitations, we were unable to analyze the complete RADseq
dataset – 25,086 RADseq loci – using BPP. Therefore, we chose to
explore species delimitation results using various subsets of the RADseq
data. We identified the number of informative sites per RADseq locus by
running the complete, partitioned dataset in IQ-TREE v. 2.1.1 (Minh et
al., 2020). Based on these results, we selected data subsets comprised
of the 10 most variable loci, all loci with at least 20 informative
sites (163 of the 25,086 original loci), and the 500 most variable loci.
After excluding loci with no variation, we also generated data subsets
comprised of 100, 500, 1000 and 5000 randomly sampled loci, in
replicates of three. Finally, we included a subset of 5649 loci that
included between 10 and 20 variable sites across the entire locus. To
prepare this data for implementation in BPP, we used a “.fasta” file
of the RAD-Seq data, a “.txt” file with the names of the partitions of
interest, and the “.txt” partition file itself. We created a custom
python script (Dryad file S51) which uses those three files to output a
.phy file of the RAD-Seq data partitions of interest to make them
compatible with BPP. Further, it creates a .txt file with the names of
candidate species from the “.phy”. From this, a designation was made
for the candidate species, e.g. A1, A2, B, C, etc., to create the imap
file for BPP. Afterwards, these files were used to run “A00” and
“A10” analyses in BPP (described below).
BPP analysis ‘A00’ estimates tau (τ) and theta (θ) parameters under the
multispecies coalescent model when the species phylogeny is given (Yang,
2015). Priors for the Niebla species population metrics, θand τ , were estimated from each subset of RADseq loci using the
BPP software’s built-in function for parameter estimation to avoid any
existing taxonomic bias. Following Flouri et al. (2018), an initial
naïve analysis was run using the θ and τ estimates of the
software-provided dataset of Asiatic brown frogs where θ = 0.004 and τ =
0.002. For each subsequent BPP analysis, the model for each data subset
was specified to use the corresponding θ and τ priors
(θ =0.0001 and τ= 0.00003), as well as using BPP’s built-in
function to re-estimate theta depending on the dataset chosen. Note,
these priors differ widely from the parameters of the BPP analyses of
Ramalinaceae reported in Spjut et al. (2020), although specific effects
of different θ and τ priors on species delimitation models
were not reported therein. Each analysis of the 10-, 163- and 500-most
variable loci datasets was replicated six times total to account for
variation in the GDI estimates, while we independently took a random
sample three times and analyzed those in four replications, for a total
sample of 12 per randomly sampled subset. Subsequently, we performed the
‘A10’ analysis in BPP for species delimitation using a fixed guide tree,
which provides posterior probabilities for different estimated species
delimitation models using a Bayesian modelling approach (Yang and
Rannala, 2010). All subsets were run for a total of 500,000 MCMC
simulations, with the exception of the 5000 and 5649 subsets which were
run for 100,000 MCMC simulations to avoid computational constraints.
MCMC simulations were guided by a ‘burn-in’ of an extra 10% of the
total run length, e.g., 50,000 iterations for most runs. RADseq data can
recover populations and recently diverged species as reciprocally
monophyletic (Hou et al., 2015; Wagner et al., 2013). Recent studies
have also shown that multispecies coalescent approaches to species
delimitation can overestimate the number of species by delimiting
population structure, rather than species-level divergence when using
genome-scale data (Leaché et al., 2018; Sukumaran and Knowles, 2017).
Therefore, 17 candidate species were initially circumscribed as
reciprocally monophyletic clades consistent between the IQtree and
SVDQuartets+PAUP* inferences and nested within the candidate species
inferred from the single locus ASAP partitions, e.g., phylogenetic
substructure within the ASAP partitions – see results. The IQtree
inference was used as the initial guide tree, collapsing nodes
comprising multiple specimens within each candidate species. To assess
the effect of increased coverage of RADseq loci in BPP species
delimitation analyses, we used the different subsets of loci as defined
above for our ‘A10’ BPP analyses. Parameters were consistent with the
“frog.ctl” found in BPP v. 4.3.0 tutorial (Flouri et al., 2018), with
the exception of θ and τ priors, as mentioned above. The
same 17 species phylogenetic tree prior was included in all analyses.
For each dataset, MCMC chains were run for 100,000 generations,
including 20,000 burn-in iterations (Dryad file S52), and the same
sample size as used in the ‘A00’ analysis was used to control for
variation between runs (six independent runs for the most informative
loci datasets, and three independent subsets repeated four times each
for the randomly selected loci subsets). The posterior probability of
each species’ delimitation model was then pooled for an aggregate
possibility between the six replicates of highly informative loci
datasets, or 12 replicates of the randomly sampled loci datasets.
Resulting log files were analyzed in Tracer v.1.7.1 (Rambaut and
Drummond, 2005) to assess ESS values and convergence of runs.
Because genome-scale data sets may delimit population boundaries and not
species limits, we coupled our BPP analyses with the recently proposed
heuristic empirical, genealogical divergence index (gdi; Jackson
et al., 2017; Leaché et al., 2018). With the proposed pattern of
micro-endemism and with putatively sympatric species common in our
sampled candidate species, we used BPP’s estimates of τ andθ to calculate the gdi for pairwise species comparisons
(Poelstra et al., 2020) for the equation in Leaché et al. (2018),gdi =1−e-2τ/θ, to calculate the probability
that two sequences coalesce before reaching species divergence
(τ ) when the genealogy is traced backwards in time. We used a
custom python script to extract τ and θ values from the
BPP analyses (Dryad file S53), treating well-supported phylogenetic
substructure in the ML topology as candidate species – see Results. The
script takes in an ‘A00’ BPP output file, calculates the gdi for
each candidate species comparison, then writes a matrix to a .csv file.
The ‘A00’ analyses were executed for the same subsets of RADseq loci in
six or 12 replicates, and gdi scores were calculated individually
for each run and averaged across replicates. Resulting gdi scores
for each comparison between either single species’ group or composite
species’ groups were displayed in a matrix. To assess potential biases
in gdi scores resulting from differences in the number and types
of sampled RADseq loci, we visualized the distribution of gdiscore results from the different sample subsets using a normalization
function in Excel. As a rule of thumb, gdi values <0.2
suggest a single species and gdi values >0.7 suggest
two distinct species, while gdi values within the range indicate
ambiguous delimitation (Jackson et al., 2017). The resulting BPP
inferred gdi delimitations were compared to the initial ‘A10’
analysis performed as described above.
We also used an iterative, ‘tip-down’ approach using BPP+gdiscores following Jackson et al. (2017) and Leaché et al. (2018). After
collapsing shallow clades comprised of morphologically identical species
collected from the same location (N. cornea [sl16863BF &
sl16857BF]; N. dilatata [sl16940 & sl16946]; N.aff. isidiosa [sl16775 & sl16853BF]; N. juncosa var.juncosa [sl16705 & sl16707]; and N. juncosa var.spinulifera [sl16706 & sl16719] – see Results), the
topology inferred using IQtree was used as the initial guide tree,
treating each terminal node as a candidate species – 35 candidate
species total; and specimens representing phenotype-based species were
split into multiple candidate species when not recovered as
monophyletic. In BPP, the ‘A00’ analysis was performed, and we extracted
the θ , τ and taxa labels as described above. Candidate
species were collapsed into a single species group if the average of the
pairwise comparisons of gdi scores between sister groups was ≤
0.2. The process was performed iteratively until the gdi scores
no longer suggested the joining of any candidate species in the
topology. A total of ten analyses per guide tree were completed to
ensure consistent results by the comparing means and standard deviations
of each gdi score. The analyses were run for 2,000,000
iterations, with ten independent runs, calculating the gdi score
from the mean θ and τ values.