Identification of adaptive variation
To detect putative loci under selection, we used four methods that vary
in whether they account for population structure and whether they allow
for correlations among environmental variables. Specifically, we used a
population structure outlier detection method (pcadapt ), which
does not control for population structure and does not correlate with
environmental variables, and three different methods to estimate
genotype-environment associations (GEAs): a Latent Factor Mixed Model
(LFMM), which is a univariate regression model that summarizes
environmental variation in latent factors that correct the model for
population structure; a redundancy analysis (RDA), which is a
multivariate approach that looks for associations of SNPs with each
environmental variable separately, and a partial redundancy analysis
(pRDA), which is an RDA that accounts for population structure
(Forester, Landguth, et al. 2018; Capblancq et al. 2018).
For the population structure outlier detection analysis we used the
function pcadapt from the R package pcadapt version 4.1.0
(Luu et al. 2017) to detect candidate genes under selection
(‘adaptive’ loci) based on principal components analysis. To determine
the number of KPC to retain, we used a threshold for the
singular values of the loading of each principal component (PC) above
5% of variance explained for the axes to be retained and used these
KPC to control for population structure in the outlier
detection analysis. We selected outlier loci based on an alpha threshold
of 0.01 calculated using the qvalue function from theqvalue R package (Dabney et al. 2010), after correcting
the p -values for multiple comparisons with a Benjamini-Hochberg
Procedure using the function p.adjust .
GEA analyses require that the dataset does not have any missing data, so
we split samples by species, and then imputed missing genotypes based on
the most frequent allele across all individuals of that species.
Imputation did not consider population of origin to avoid inflating
population structure, particularly given the low number of samples in
some populations. We acknowledge that there will likely be a bias for
increasing representation of alleles from populations with more
individuals and higher minor allele frequency (MAF) (Mitt et al.2017). However, this approach is more conservative in terms of the
detection of local adaptation, and thus less likely to lead to false
positives in terms of local adaptation. To test for associations between
genetic and environmental data, we obtained bioclimatic data from across
the study area using WorldClim (Fick & Hijmans 2017), as well as
biomass (ESRI 2016), elevation (USGS 2012), land cover (CCRS et
al. 2017), slope degree (ESRI 2013), soil particle size (SoilPartSize)
as a proxy for burrow stability and heat retention (CONSBIO 2015a), soil
temperature (SoilTemp) (Marchenko 2011) because it is likely important
for hibernation (Goldberg 2018), soil orders (SoilOrder) that reflect
exposure to climatic factors and biological processes (CONSBIO 2015b),
land formation (LandForm) (ESRI 2015), and ecological facets (EcoFacets)
which combine elevation, land formation, slope and heat load index (ESRI
2019). For categorical variables (land cover, soil order, land formation
and ecological facets), we coded each category as present or absent and
analyzed them as independent variables. We used the Shapiro-Wilk
normality test in the shapiro.test function from the statsR package (R Core Team 2013) to determine if each variable was normally
distributed in order to determine which correlation coefficient to use.
For our data, none of the variables were normally distributed, so we
then used the chart.Correlation function from thePerformanceAnalytics R package (Peterson et al. 2014) and
the Kendall correlation estimator to examine correlations among
variables (see Results), and the corr and corrplotfunctions from the corrplot R package (Wei et al. 2017) to
plot the correlation matrix. We selected a subset of variables that were
not highly correlated <|0.7| for subsequent
analyses. In cases where correlations were above
|0.7|, we excluded the variable that was correlated
with a higher number of other variables or with the highest average
correlation values, even if below |0.7|.
To ensure that associations between loci and environmental variables
were not confounded by population structure, for the LFMM we performed a
Principal Components Analysis (PCA) on our imputed dataset using therda function from the R package vegan (Oksanen et
al. 2011). To determine which principal components (PCs) to retain, we
then used the broken stick criterion, based on whether the observed
eigenvalues were higher than the corresponding random broken stick
components (Peres-Neto et al. 2003). We used the retained PCs as
conditions when finding associations between the genetic data with the
univariate uncorrelated environmental PC axes through the functionlfmm_ridge from the lfmm R package (Caye et al.2019). We then evaluated the genomic inflation factor (GIF) and selected
a p -value following François et al. (2016), and then
identified candidate loci using a false discovery rate (FDR) of 0.1.
For the RDA, we used the function rda from the vegan R package.
We then used the R function RsquareAdj to determine the
percentage of genetic variance explained by the RDA axes, after
adjusting for the number of predictors. We then used the functionanova.cca from the vegan R package to check the significance of
our model, as well as of each RDA axis individually, in explaining the
variance in the data with an ANOVA. As a first step, we verified the
variance inflation factors (VIF) of each variable using the functionvif.cca from the vegan R package, and excluded the
variable with the highest VIF until all variables showed values below
10, i.e. no longer representing redundant constraints (Oksanen et
al. 2011). Then, we determined which SNPs were candidates for local
adaptation by identifying those present in the tails of the RDA axes
loadings that were >2.5 standard deviations (two-tailedp -value = 0.012) from the mean.
For the pRDA, we performed the same analysis as for the RDA, except that
before using the rda function, we performed a PCA on the genomic
data to summarize neutral population structure by determining which PCs
had eigenvalues greater than random, using the broken stick criterion.
This analysis identified the number of PCs to condition for in therda function, and then used these PCs to exclude the variance
explained by neutral population structure (Rellstab et al. 2015;
Forester, Lasky, et al. 2018).
For the interspecific analysis (IDGS dataset), we conducted all four
analyses for detecting adaptive loci (pcadapt, LFMM, RDA, and pRDA), and
we compared the SNPs identified as candidates in the four analyses using
the venn.diagram function from the VennDiagram R package
(Chen & Boutros 2011). For the intraspecific analyses (NIDGS and SIDGS
datasets), we only conducted the pRDA analysis.