Signatures of recurrent selection
We filtered the total VCF with annotations by SNPeff and retained only non-synonymous (replacement) or synonymous (silent) SNPs. We then compared these polymorphisms to the differences identified to D. falleni and D. phalerata to polarize changes to specific branches. Specifically, we sought to determine sites which are polymorphic in our D. innubila populations or are substitutions which fixed along the D. innubila branch of the phylogeny. We used the counts of fixed and polymorphic silent and replacement sites per gene to estimate McDonald-Kreitman-based statistics, specifically direction of selection (DoS) (McDonald and Kreitman 1991; Smith and Eyre-Walker 2002; Stoletzki and Eyre-Walker 2011). We also used these values in SnIPRE (Eilertson et al. 2012), which reframes McDonald-Kreitman based statistics as a linear model, taking into account the total number of non-synonymous and synonymous mutations occurring in user defined categories to predict the expected number of these substitutions and calculate a selection effect relative to the observed and expected number of mutations (Eilertson et al. 2012). We calculated the SnIPRE selection effect for each gene using the total number of mutations on the chromosome of the focal gene. Using FlyBase gene ontologies (Gramates et al. 2017), we sorted each gene into a category of immune gene or classed it as a background gene, allowing a gene to be classed in multiple immune categories. We fit a GLM to identify functional categories with excessively high estimates of adaptation, considering multiple covariates:
\begin{equation} Statistic\ \sim\ Population+Gene\ group+\left(Gene\ group*Population\right)+Chromosome+Chromosome:Position\nonumber \\ \end{equation}
We then calculated the difference in each statistic between our focal immune genes and a randomly sampled nearby (within 100kbp) background gene, finding the average of these differences for each immune category over 10000 replicates, based on (Chapman et al. 2019).
To confirm these results, we also used AsymptoticMK (Haller and Messer 2017) to calculate asymptotic α for each gene category. We generated the non-synonymous and synonymous site frequency spectrum for each gene category, which we then used in AsymptoticMK to calculate asymptotic α and a 95% confidence interval. We then used a permutation test to assess if functional categories of interest showed a significant difference in asymptotic α from the rest of categories.