Results
All of the raw sequence reads, alignment files, and genotype datasets will be deposited in the Open Science Framework database upon acceptance, https://osf.io (DOI XXX). In total, 271 individuals were subjected to the ddRADseq procedure and yielded an average of 2.08 million reads per individual. During the initial filtering, 36 individuals had a phred score of less than 25 and were removed from the dataset. Additionally, 29 individuals were found to have more than 75% missing data and were therefore removed. The final dataset included 206 individuals from 17 sites and contained 3612 SNPs. The population structure inferred by fastSTRUCTURE that best explains the data is K = 5. Structure plots showing K = 3-6 can be found in figure S1. At K = 5, most individuals (86%) were unambiguously assigned to one group (98-100% assignment score; Fig. 1). Consistent with these results, the PCA and DAPC grouped these individuals into five main clusters (Figs. 2a & S2). The main difference being that the PCA further segregated one cluster (blue, Fig. 2a) into two separate groups; east and west of the Sierra Nevada mountain range. Further support for the same five clusters was found in the maximum likelihood trees, with a high level of support from each approximation method (Figs. 2b & S3).
The geographic distributions of four of these clusters closely align with the distributions of four of the five subspecies described in Wirth & Jones (1957) (Fig. 1), suggesting these morphological descriptions accurately denoted species-level taxa within the C. variipenniscomplex. Further phylogenetic and morphological study is needed to confirm the validity of these species groupings; however, for the remainder of the manuscript we will refer to each cluster by a species name. Culicoides occidentalis located in Western North America,C. sonorensis in the Western and Southern U.S., C. albertensis in the Midwest U.S. and Ontario, C. variipennis in the Eastern U.S. and Ontario, and a fifth genetic group suggesting the occurrence of an additional, undescribed cryptic species in San Diego, CA. Notably, eight of the 17 sites had more than one species in sympatry, and one site had three species. At four sites, seven individuals were assigned to two genetic groups with an assignment score of ~50% for three individuals (scores = 45, 47 and 41%) and of ~25% for four individuals (scores = 34, 31, 25 and 24%), which suggests the occurrence of putative F1 or other types of hybrids (e.g. F2 or backcrosses), respectively. Interestingly, these hybrids were from three different species parings (C. sonorensis X C. occidentalis ; C. sonorensis X C. variipennis ; andC. albertensis X C. variipennis ). These hybrid individuals also stood out using the PCA analysis, as they segregated between their parental clusters (Fig. 2a), as well as at the base of each parental branch in the phylogenetic tree (Fig. S3). In addition to these hybrids, 20 individuals had a secondary assignment score between 3% to 21%, signifying potential introgression between those pairings.
The samples were then rearranged by species, rather than collection site, and stricter filtering parameters were applied. This dataset contained 566 SNPs from 199 individuals (hybrids were excluded) and was used to calculate the species-level summary statistics as well as determine the loci under selection. The mean FSTbetween the five inferred clusters was 0.7147 (0.6541-0.7470), roughly 9 times higher than the mean FST between the populations (i.e., localities) within each cluster (see below; Tables 2 & S1). The overall dataset was further split into five datasets for species-level population statistics. These datasets contained 22 individuals of C. albertensis from four populations (3423 SNPs), 36 individuals of C. occidentalis from four populations (2714 SNPs), 97 individuals of C. sonorensis from seven populations (2357 SNPs), and 29 individuals of C. variipennis from four populations (2960 SNPs). The expected and observed heterozygosity,FIS , and number of private alleles for each species are reported in Table S2. No species level dataset was created for the San Diego species as only one locality was examined.
When examining each species individually, C. albertensis, had no evidence of population structure (K = 1), and had low genetic differentiation among populations (mean FST = 0.054) (Fig. 3a; Table 2). Although there does seem to be a pattern of isolation by distance, this was found to not be significant in this species (P = 0.238). The low number of populations sampled potentially limits our statistical power for this correlation. The results obtained for C . occidentalis showed much more divergence compared to the other species, with populations being strongly differentiated from each other (mean FST = 0.411) (Table 2). Additionally, fastSTRUCTURE suggests that each population of C. occidentalis is a distinct genetic entity (K = 4) clustering by location (Fig. 3b). While no IBD was found (P = 0.489), there seems to be a considerable amount of geographic isolation among populations of this species, with pairwise FST values ranging from 0.14 to 0.70 (Table S3). Low genetic differentiation among populations was found for C. sonorensis (meanFST = 0.029), despite a slight, but significant IBD in this species (P = 0.039) (Fig. 3c; Table 2). For this reason, the individuals from Colorado were combined into a single population, as were the individuals from Kansas. A fastSTRUCTURE analysis suggested the occurrence of population structure in C. sonorensis (K = 2), with some individuals from Kansas belonging to a distinct group. The combined Kansas populations were not divergent from any other C. sonorensis population (Table S3). Populations of C .variipennis exhibited no evidence of population structure (K = 1) or of isolation by distance (P = 0.587) (Fig. 3d). Consistently, almost no genetic differentiation was found among populations of this species (mean FST = 0.026) (Table 2).
We identified three outlier loci within the C. variipenniscomplex and an additional 23 species-specific loci: two in C. albertensis , seven in C. occidentalis , 11 in C. sonorensis , and two in C. variipennis . Each of these loci had a log10 Bayes factor value over 1 and six had values above 2, corresponding to a 95% and 99% confidence interval, respectively (Fig. 4). Searches of InsectBase were used to assign putative functional annotations (most of which were provided by Nayduch et al. (2014), with orthologous dipteran genes subsequently found using Flybase (Table S4). Roughly 75% of the loci were matched to transcription data, and all but one associated with a dipteran orthologous gene. None of the loci found to have significant evidence of selection were shared across the different species, suggesting that each is under its own set of selective pressures.
Based on the COI gene, four distinct groupings were identified with strong genetic divergence between groups (p- distance = 2.99-3.30%) and little divergence within groups (p -distance = 0.25-0.86%; Fig. 5; Table 3). Consistent with the SNP datasets, the California population of C. occidentalis was separated from the rest of its range. The mean percent divergence between the twoC. occidentalis groups (2.99%) was similar to its divergence from the other species (3.01-3.30%). The San Diego population clusters as a distinct group, with a similar level of divergence from the other species (3.01-3.03%). Interestingly, C. albertensis , C. sonorensis , and C. variipennis were not separated from each other, and in some cases, C. albertensis andC. variipennis shared identical haplotypes (Fig. 5). Furthermore, these three species exhibit a mean percent divergence between individuals (0.80%) similar to the divergence observed among individuals within the other clusters (Table 3). Other than the grouping of C. occidentalis in California, there was no other geographic clustering observed.