Environmental Association Analysis (EAA)
Environmental values (dependent variable for EAA models) were assigned
to the 1x1 km grid cells where individuals had been sampled, using QGIS
v2.18.16 (QGIS Delopment Team 2015) and a layer of PCA-scores within
polygons defined by sampling locations. The genotypes for each loci
(independent variables for EAA models) were recoded as 0, 1 or 2
depending on whether they were homozygous for the reference allele,
heterozygous or homozygous for the alternative allele, respectively. As
we pruned the SNP dataset by linkage disequilibrium, we avoided
including any collinear predictors in the models.
Our EAA was constrained by the fact that the 95 individuals genotyped
belonged to only five different populations (18-20 individuals per
population). While this ensures a sufficient characterization of genetic
variation within populations, it leads to unavoidable pseudoreplication
of environmental data (and, depending on the extent of genetic
differentiation and aggregation, also of genetic data). To make sure
that this problem did not affect our conclusions, we used four
randomization approaches, with the same model selection steps applied to
each of them, so that they could be compared.
Firstly, we based our EAA on 1,000 random assignments of genotypes
within populations to 1x1 km sampled grid cells (hereafter
intra-population randomization). Given that environmental variation is
several orders of magnitude larger among than within populations (97.6
% of the variance in PCA scores explained by population adscription),
we chose to highlight the among-populations component of the models by
assuming that all genotypes could occupy every grid cell within the
geographical boundaries of their own population (which were determined
by the discontinuous distribution of suitable habitat). This is more
realistic than assuming a large component of genotype-environment
covariation at a local, within-population scale. Moreover, the genetic
component of such covariation could not be detected by our methods of
outlier detection, which were specifically designed to search for
genetic divergence among populations. Our set of 1,000 intra-population
randomizations should therefore capture the actual pattern of genotypic
and environmental covariation at the scale of the sampled gradient.
Secondly, we randomized 1,000 times the geographical grid cell assigned
to each genotype without taking into account its population (hereafter
inter-population randomization). This allowed us to produce a null
hypothesis of no association between genotypic and environmental
variation, but which takes into account the fact that environmental
values are geographically structured by population of origin (and thus
pseudoreplicated).
Thirdly, we used a randomized set of genetic data to control for the
potential effects of genetic structure among populations and genetic
aggregation within them (hereafter randomization by neutral loci). For
that purpose, we constructed 1,000 new sets by randomly selecting 21
loci from each genotype (i.e. the number of detected outliers out of
6,421 SNP) but without taking into account whether they were
characterized as outliers or not. This was done to account for the fact
that neutral genetic variation (randomly selected SNPs) is expected to
have the same degree of aggregation than variation under selection
(outliers). However, while we should expect that at least a fraction of
the genetic variation subject to selection should be correlated with
environmental variation, the opposite is true for the neutral
differentiation of populations.
Finally, we were aware that outlier analyses should effectively sort
through the randomized SNP databases to identify those that explain the
greatest variance among ‘populations’ (or subgroups). The projection of
that variance into environmental PC-space could in turn draw some shape
around the five sampled populations’ environments that would be
considered as suitable habitat, perhaps leading to significant
genotype-environment associations. Because of that reason, the outlier
selection step needs to be incorporated into our attempts to identify
null expectations. To this end, our fourth randomization approach
permuted the 6,421-SNP dataset and reran all the analysis from the
outlier detection step onwards (hereafter complete randomization). We
randomly assigned genotypes for all the 6,421 SNPs to individuals and
ran Bayescan to detect outliers putatively under selection with the same
methods we used with real data. Then we took the top 21 outlier SNPs to
conduct genotype-environment association models and predict the species’
range. If the environmental signal of the SNPs generated by these
simulations (and, as a consequence, expected by pure chance) is still
smaller than that of real data, this must be interpreted as evidence
that the SNPs selected by our EAA are not only correlated with the
environmental gradient portrayed by our sampling populations, but also
that they provide reasonably good proxies of genetic differentiation
involved in local adaptation. We did not repeat the FLK extension test
with this dataset because no significant phylogenetic (i.e.
among-population) patterning is expected in a genetic dataset where
genotypes are randomly assigned to individuals (and hence to
populations). We performed 500 complete randomization tests instead of
1,000 due to computational limitations.
We performed a backward stepwise multiple regression analysis with each
randomized data set (N = 3,500 EAAs in total), with SNPs as predictors
and environmental scores as the dependent variable using lmfunction in R core. By doing this, we obtained a distribution of
adjusted R2 estimates and p-values for each set of
randomizations. Final model building was achieved by considering the
mean p-values of partial correlations calculated for all the datasets
obtained with the intra-population randomization strategy. In each step,
we removed all SNPs with a mean p-value > 0.5 (calculated
from the p-values of all randomized datasets), and we recalculated all
partial correlations with the remaining SNPs. In the last step, when all
remaining markers had mean p-values < 0.5 (the mean number of
SNPs removed in this step was 14 in intra-population randomizations, 18
in inter-population randomizations, 10.83 in randomizations by neutral
loci, and 11.30 in complete randomizations), we removed all SNPs with
mean p-values > 0.05. Our final model (genotype-environment
association model, or GEAM) was built with the mean intercept and mean
beta values of the remaining SNPs.