3.2.2 Species delimitation using RADseq loci under the multi-species coalescent using BPP
Estimates of θ values under the multispecies coalescent model – analysis ‘A00’ in BPP – were on the same order of magnitude regardless of different RADseq loci subsamples, although standard deviation was much greater in the 163- and 500-most variable loci data subsets. Greater variation in τ values was observed depending on the subsampled RADseq loci (Table 1). Results using the same subset of RADseq loci generally converged on similar gdi estimations. Species delimitation using a fixed guide tree – analysis ‘A10’ in BPP – and the candidate species shown in Fig. 3 inferred up to 17 species, the maximum subdivision specified in the guide tree. Aggregate probabilities across all runs resulted in the highest probabilities across the 13–17 species range, with a 15 species model being most likely across all analyses (Fig. 4). However, the results of the ‘A10’species delimitation analyses in BPP varied across both subsets of data and independent runs, with the randomly sampled subsets showing less proclivity of an unrealistic one-species model. (Table 1; Dryad file S56).
While the candidate species collapsed under each analysis varied widely, the most frequent candidate species groups that were collapsed included: ‘O’, ‘P’ and ‘Q’ collapsed into 1 or 2 species; ‘J’, ‘K’, ‘L’, ‘M’ and ‘N’ were frequently collapsed into 1, 2 or 3 different species groups; and ‘A’, ‘B’, ‘C’ and ‘D’ which were frequently collapsed into 1 or 2 species groups, sometimes including ‘E’ (Fig. 3).
3.2.3 Species delimitation validation using gdi scores
The gdi scores inferred from parameters under a multispecies coalescent model – analysis ‘A00’ in BPP – generally did not provide justification for collapsing any nodes in the 17-species model – based on 17 well-supported clades identified in the ML topology, the maximum subdivision specified in the guide tree (Fig. 3; Dryad file S56), suggesting that our sampling includes at least 17 species-level lineages. However, in the analyses of the 10 and 163 most informative RADseq loci, a number of candidate species were collapsed based ongdi . In the 10-loci subset, candidate species ‘A’, ‘B’, ‘C’, and ‘D’ were combined (gdi scores <0.2), as were candidate species ‘F’ and ‘G’, candidate species ‘J’ and ‘K’, candidate species ‘M’ and ‘N’, and finally candidate species ‘P’ and ‘Q’ (Fig. 3). In the 163-loci subset analysis, candidate species ‘A’, ‘B’, and ‘C’ were collapsed together, as well as ‘F’ and ‘G’. The gdi scores calculated from all other analyses suggested maintaining the 17-species model. In the “tip-down”, iterative approach, gdi scores supported collapsing a number of tips resulting in a 29-species model from the original 35 tips after combining morphologically similar specimens from the same location (Fig. 3; Dryad file S57). Collapsed specimens included two of the three specimens identified as N. cornea from the Channel Islands with two specimens from Baja California (in clade ‘O’), specimens identified as N. fimbriata , N. palmeri , and N. suffnessii (in clade ‘M’), N. flagelliforma with N. juncosa var. juncosa (in clade ‘H’), and N. flagelliforma with N. rugosa (in clade ‘B’) (Fig. 3).
To further characterize the impact of data subsampling on gdi scores, we report average gdi scores across all nodes for each of the RADseq data subsets. The 10- and 163-locus datasets comprising the most variable sites had lower gdi scores when compared with the remaining subsets (Fig. 5). Average gdi values generally increased with an increased coverage of genetic markers, although using the most informative markers rather than a random subset seems to limit this trend. This trend is primarily caused because τ is much higher when estimated using only the most variable loci, with the exception of a very small dataset of 10 loci. Estimates for θ of each individual candidate species were also higher when using the most informative subsets, but so is the standard deviation of the estimated and θ values.