2.3 Test of selection
Multiple consensus sequences of coding sequences for all samples were extracted using bam2consensus function from BamBam v1.4 (Borowiec, 2016), allowing a minimum read coverage per site of 4X. BamBam uses the individual bam files that were mapped to the G. calmariensisdraft de-novo genome, and extracts consensus for each CDS region based upon the genome annotation. We then assessed summary statistics of the consensus sequences using AMAS v1.0 (Borowiec, 2016). Only CDS regions with the length >300 and low proportions of missing values (<10%) were kept for downstream analysis (N =11,368).
To limit out analysis to orthologous loci among our three species, we assessed orthology among G. calmariensis protein sets and three other Coleoptera species (Asian long-horned beetle [Anoplophora glabripennis ], red flour beetle [Tribolium castaneum ] and mountain pine beetle [Dendroctonus ponderosae ]) using OrthoVenn2 (Xu et al., 2019) with default settings. A total of 4,591 single-copy-orthologs (SCO) were identified in G. calmariensis . Then, we identified these SCOs from the previously identified high quality consensus sequences, which resulted in 4,154 SCOs for downstream analysis.
To detect adaptive evolution, we used HDMKPRF (Zhao et al., 2019), which is an extension of MKPRF to analyse selection across multiple species (Bustamante, Wakeley, Sawyer, & Hartl, 2001; Sawyer & Hartl, 1992). To improve detection of selection, HDMKPRF pools information over many loci and thereby gains power compared with single-gene based methods. The method then uses a Poisson random field on synonymous and non-synonymous polymorphisms to estimate posterior distributions for included genes and simultaneously corrects for differences in population demography, while accounting for the phylogenetic structure (for details see table 1 in Zhao et al., 2019). The script for calculating population parameters and for performing the HDMKPRF to derive selection intensities was kindly provided by Zhao et al. Here we follow terminology from the MKPRF test (Bustamante et al., 2001) in referring to the estimated relative strength of selection per gene as a ‘selection intensity’, akin to a neutrality index measure (Hahn, 2019).
Estimates of positive selection using population resequencing data are usually biased downward by the segregation of slightly deleterious mutations (Bierne & Eyre-Walker, 2004; Fay, Wyckoff, & Wu, 2002). To minimize the impact of such bias, we removed singleton polymorphisms from all gene sets using a custom script (Sattath, Elyashiv, Kolodny, Rinott, & Sella, 2011). In order to test the effect of singleton removal on the power of detecting adaptive evolution, we applied HDMKPRF to gene sets before and after removing singletons with 200,000 burn-in steps, 400,000 total MCMC steps and set a thinning interval of 5. We performed the same analysis three times and retained the average value from triplicates, with these settings demonstrating high levels of concordance. A gene was considered to be under positive selection only when the 95% posterior confidence interval of selection intensity was > 0, and similarly under negative selection when the interval was < 0.
We performed a gene ontology (GO) enrichment test of biological process terms of genes under positive selection using the BIOCONDUCTOR package topGO (Alexa & Rahnefuhrer, 2010). The background genes included the 4,154 genes used in HDMKPRF and those genes that were significantly (positively or negatively) selected were tested against this background for enrichment of biological process terms (FDR<10%) using the parent–child algorithm (Grossmann, Bauer, Robinson, & Vingron, 2007). To ensure a robust result, we only analysed GO terms with at least 5 members (node size=5), and we used EggNOG v5.0 (Huerta-Cepas et al., 2019) before the analysis to assign Gene Ontology (GO) terms to the predicted protein sets using Insecta as taxonomic scope to restrict the functional inferences to an insect-related scale. Afterwards, enriched GO terms were clustered to representative functional subsets using the REVIGO Drosophila database (Supek, Bošnjak, Škunca, & Smuc, 2011).