2.4.2 Genomic species delimitating under the multispecies coalescent model using BPP/GDI
Incorporating genome-scale data with the multispecies coalescent model has provided unprecedented insight into species boundaries (Flouri et al., 2018). The Bayesian program BPP includes a full-likelihood implementation of the multispecies coalescent model (Yang, 2015), providing powerful approaches for delimiting species and populations (Degnan and Rosenberg, 2009; Knowles and Carstens, 2007; Yang and Rannala, 2010). Here we use BPP to test the hypothesis of micro-endemism vs. phenotypically polymorphic species in Niebla. Due to computational limitations, we were unable to analyze the complete RADseq dataset – 25,086 RADseq loci – using BPP. Therefore, we chose to explore species delimitation results using various subsets of the RADseq data. We identified the number of informative sites per RADseq locus by running the complete, partitioned dataset in IQ-TREE v. 2.1.1 (Minh et al., 2020). Based on these results, we selected data subsets comprised of the 10 most variable loci, all loci with at least 20 informative sites (163 of the 25,086 original loci), and the 500 most variable loci. After excluding loci with no variation, we also generated data subsets comprised of 100, 500, 1000 and 5000 randomly sampled loci, in replicates of three. Finally, we included a subset of 5649 loci that included between 10 and 20 variable sites across the entire locus. To prepare this data for implementation in BPP, we used a “.fasta” file of the RAD-Seq data, a “.txt” file with the names of the partitions of interest, and the “.txt” partition file itself. We created a custom python script (Dryad file S51) which uses those three files to output a .phy file of the RAD-Seq data partitions of interest to make them compatible with BPP. Further, it creates a .txt file with the names of candidate species from the “.phy”. From this, a designation was made for the candidate species, e.g. A1, A2, B, C, etc., to create the imap file for BPP. Afterwards, these files were used to run “A00” and “A10” analyses in BPP (described below).
BPP analysis ‘A00’ estimates tau (τ) and theta (θ) parameters under the multispecies coalescent model when the species phylogeny is given (Yang, 2015). Priors for the Niebla species population metrics, θand τ , were estimated from each subset of RADseq loci using the BPP software’s built-in function for parameter estimation to avoid any existing taxonomic bias. Following Flouri et al. (2018), an initial naïve analysis was run using the θ and τ estimates of the software-provided dataset of Asiatic brown frogs where θ = 0.004 and τ = 0.002. For each subsequent BPP analysis, the model for each data subset was specified to use the corresponding θ and τ priors (θ =0.0001 and τ= 0.00003), as well as using BPP’s built-in function to re-estimate theta depending on the dataset chosen. Note, these priors differ widely from the parameters of the BPP analyses of Ramalinaceae reported in Spjut et al. (2020), although specific effects of different θ and τ priors on species delimitation models were not reported therein. Each analysis of the 10-, 163- and 500-most variable loci datasets was replicated six times total to account for variation in the GDI estimates, while we independently took a random sample three times and analyzed those in four replications, for a total sample of 12 per randomly sampled subset. Subsequently, we performed the ‘A10’ analysis in BPP for species delimitation using a fixed guide tree, which provides posterior probabilities for different estimated species delimitation models using a Bayesian modelling approach (Yang and Rannala, 2010). All subsets were run for a total of 500,000 MCMC simulations, with the exception of the 5000 and 5649 subsets which were run for 100,000 MCMC simulations to avoid computational constraints. MCMC simulations were guided by a ‘burn-in’ of an extra 10% of the total run length, e.g., 50,000 iterations for most runs. RADseq data can recover populations and recently diverged species as reciprocally monophyletic (Hou et al., 2015; Wagner et al., 2013). Recent studies have also shown that multispecies coalescent approaches to species delimitation can overestimate the number of species by delimiting population structure, rather than species-level divergence when using genome-scale data (Leaché et al., 2018; Sukumaran and Knowles, 2017). Therefore, 17 candidate species were initially circumscribed as reciprocally monophyletic clades consistent between the IQtree and SVDQuartets+PAUP* inferences and nested within the candidate species inferred from the single locus ASAP partitions, e.g., phylogenetic substructure within the ASAP partitions – see results. The IQtree inference was used as the initial guide tree, collapsing nodes comprising multiple specimens within each candidate species. To assess the effect of increased coverage of RADseq loci in BPP species delimitation analyses, we used the different subsets of loci as defined above for our ‘A10’ BPP analyses. Parameters were consistent with the “frog.ctl” found in BPP v. 4.3.0 tutorial (Flouri et al., 2018), with the exception of θ and τ priors, as mentioned above. The same 17 species phylogenetic tree prior was included in all analyses. For each dataset, MCMC chains were run for 100,000 generations, including 20,000 burn-in iterations (Dryad file S52), and the same sample size as used in the ‘A00’ analysis was used to control for variation between runs (six independent runs for the most informative loci datasets, and three independent subsets repeated four times each for the randomly selected loci subsets). The posterior probability of each species’ delimitation model was then pooled for an aggregate possibility between the six replicates of highly informative loci datasets, or 12 replicates of the randomly sampled loci datasets. Resulting log files were analyzed in Tracer v.1.7.1 (Rambaut and Drummond, 2005) to assess ESS values and convergence of runs.
Because genome-scale data sets may delimit population boundaries and not species limits, we coupled our BPP analyses with the recently proposed heuristic empirical, genealogical divergence index (gdi; Jackson et al., 2017; Leaché et al., 2018). With the proposed pattern of micro-endemism and with putatively sympatric species common in our sampled candidate species, we used BPP’s estimates of τ andθ to calculate the gdi for pairwise species comparisons (Poelstra et al., 2020) for the equation in Leaché et al. (2018),gdi =1−e-2τ/θ, to calculate the probability that two sequences coalesce before reaching species divergence (τ ) when the genealogy is traced backwards in time. We used a custom python script to extract τ and θ values from the BPP analyses (Dryad file S53), treating well-supported phylogenetic substructure in the ML topology as candidate species – see Results. The script takes in an ‘A00’ BPP output file, calculates the gdi for each candidate species comparison, then writes a matrix to a .csv file. The ‘A00’ analyses were executed for the same subsets of RADseq loci in six or 12 replicates, and gdi scores were calculated individually for each run and averaged across replicates. Resulting gdi scores for each comparison between either single species’ group or composite species’ groups were displayed in a matrix. To assess potential biases in gdi scores resulting from differences in the number and types of sampled RADseq loci, we visualized the distribution of gdiscore results from the different sample subsets using a normalization function in Excel. As a rule of thumb, gdi values <0.2 suggest a single species and gdi values >0.7 suggest two distinct species, while gdi values within the range indicate ambiguous delimitation (Jackson et al., 2017). The resulting BPP inferred gdi delimitations were compared to the initial ‘A10’ analysis performed as described above.
We also used an iterative, ‘tip-down’ approach using BPP+gdiscores following Jackson et al. (2017) and Leaché et al. (2018). After collapsing shallow clades comprised of morphologically identical species collected from the same location (N. cornea [sl16863BF & sl16857BF]; N. dilatata [sl16940 & sl16946]; N.aff. isidiosa [sl16775 & sl16853BF]; N. juncosa var.juncosa [sl16705 & sl16707]; and N. juncosa var.spinulifera [sl16706 & sl16719] – see Results), the topology inferred using IQtree was used as the initial guide tree, treating each terminal node as a candidate species – 35 candidate species total; and specimens representing phenotype-based species were split into multiple candidate species when not recovered as monophyletic. In BPP, the ‘A00’ analysis was performed, and we extracted the θ , τ and taxa labels as described above. Candidate species were collapsed into a single species group if the average of the pairwise comparisons of gdi scores between sister groups was ≤ 0.2. The process was performed iteratively until the gdi scores no longer suggested the joining of any candidate species in the topology. A total of ten analyses per guide tree were completed to ensure consistent results by the comparing means and standard deviations of each gdi score. The analyses were run for 2,000,000 iterations, with ten independent runs, calculating the gdi score from the mean θ and τ values.