Genome-wide-association analyses
To estimate SNP-based heritability for each trait, we ran Bayesian mixture models for each trait independently using the BAYESR software . We selected this method as it has been found to be more accurate in estimating additive genetic variance explained by SNPs than alternative methods (e.g., BSLMM or LMM ). All models included TL and sex as covariates (except when modelling TL explicitly, which included only sex as a factor), and all traits were standardised to a mean of 0 and standard deviation of 1. We also aimed to estimate pairwise genetic correlations between traits using bivariate models in BAYESR. However, none of these models converged, likely owing to the large sample sizes usually required to estimate genetic covariances, and hence are not reported further.
To identify SNPs underlying phenotypic variation, we performed a GWA using a linear mixed effects model approach with the program GEMMA . This program fits models that control for relatedness among samples and/or population stratification, which reduced false positives even with small sample sizes. GEMMA was selected because it implements generalized linear mixed models for traits with non-normal distributions, and therefore robustly handles data on different scales. These models estimate the linear coefficient for the relationship between each SNP in turn and a given trait. P -values for each SNP were calculated using Wald tests. We used a false discovery rate of 5% to correct for multiple testing and a P -value cut off of 10-5 when identifying whether a SNP was putatively associated with trait variance. All models included sex as a fixed factor and length as a covariate, except for the model with total length as a response variable which just fix sex. The number of SNPs used in each model was dependent on the number of individuals available for each trait, because gemma only models SNPs with less than 5% missingness. We then identified whether there was overlap between regions that were associated with trait variation in our data, and quantitative trait loci (QTLs) previously mapped on the stickleback genome , using LIFTOVER and custom R scripts.
Genome-environment association analyses
We investigated gene-environment associations using latent-factor mixed models (LFMM) in the R package LEA , which models loci against environmental variables while controlling for unobserved latent variables (i.e. spatial structure or autocorrelation). Models were run for 100 000 iterations, with 10 000 burn‐in cycles and 5 replicates. z‐scores were combined from five runs, and we used adjustedP ‐values using the genomic control method (a recalibration procedure which decreases the false discovery rate ). Following guidance of , the resulting P ‐values for the outlier tests were adjusted for multiple testing to Q‐values using the false‐discovery rate method as implemented in the “qvalue” 2.6.0 package. A Q-value for SNPs of < 0.05 was considered significant (i.e. a “candidate SNP”). We compared DIC of models that included one ecological variable at the time to try to identify which ecological variable best predicted the genomic data. To identify whether gene-environment associations were linked to phenotypic variation, we then investigated whether any of the candidate SNPs identified in any of the LFMM analyses were linked (within 5kb) to SNPs identified in GWA analyses.
To explore the molecular function of genomic regions that showed signatures of selection, we first analysed candidate genes for enrichment of molecular functions. To do this, we identified genes that the candidate SNPs were within 5kb of and compared these candidate genes with the reference set of 20 805 genes across the stickleback genome (‘gene universe’). Gene ontology (GO) information was obtained from the stickleback reference genome on ENSEMBL using the R package BIOMART , and functional enrichment was investigated using the package TOPGO 2.42 and the Fisher’s exact test (at P < 0.01). To reduce false positives, we pruned the GO hierarchy by requiring that each GO term had at least 10 annotated genes in our reference list (“nodeSize = 10”). Secondly, physical overlap on the genome between candidate SNPs identified in environment-association analyses and GWA SNPs indicate regions of the genome under selection. For regions of the genome containing both environmentally associated candidate SNPs and GWAs SNPs, we identified (1) the genes in this region (within 5kb), (2) whether haplotypes on that gene in our dataset are predicted to cause variation in protein translation, and (3) the function of the gene. We used the program “snpEff” to detect whether SNPs fall on coding regions changing amino acid sequence.