2.3 Test of selection
Multiple consensus sequences of coding sequences for all samples were
extracted using bam2consensus function from BamBam v1.4 (Borowiec,
2016), allowing a minimum read coverage per site of 4X. BamBam uses the
individual bam files that were mapped to the G. calmariensisdraft de-novo genome, and extracts consensus for each CDS region
based upon the genome annotation. We then assessed summary statistics of
the consensus sequences using AMAS v1.0 (Borowiec, 2016). Only CDS
regions with the length >300 and low proportions of missing
values (<10%) were kept for downstream analysis
(N =11,368).
To limit out analysis to orthologous loci among our three species, we
assessed orthology among G. calmariensis protein sets and three
other Coleoptera species (Asian long-horned beetle [Anoplophora
glabripennis ], red flour beetle [Tribolium castaneum ] and
mountain pine beetle [Dendroctonus ponderosae ]) using
OrthoVenn2 (Xu et al., 2019) with default settings. A total of 4,591
single-copy-orthologs (SCO) were identified in G. calmariensis .
Then, we identified these SCOs from the previously identified high
quality consensus sequences, which resulted in 4,154 SCOs for downstream
analysis.
To detect adaptive evolution, we used HDMKPRF (Zhao et al., 2019), which
is an extension of MKPRF to analyse selection across multiple species
(Bustamante, Wakeley, Sawyer, & Hartl, 2001; Sawyer & Hartl, 1992). To
improve detection of selection, HDMKPRF pools information over many loci
and thereby gains power compared with single-gene based methods. The
method then uses a Poisson random field on synonymous and non-synonymous
polymorphisms to estimate posterior distributions for included genes and
simultaneously corrects for differences in population demography, while
accounting for the phylogenetic structure (for details see table 1 in
Zhao et al., 2019). The script for calculating population parameters and
for performing the HDMKPRF to derive selection intensities was kindly
provided by Zhao et al. Here we follow terminology from the MKPRF test
(Bustamante et al., 2001) in referring to the estimated relative
strength of selection per gene as a ‘selection intensity’, akin to a
neutrality index measure (Hahn, 2019).
Estimates of positive selection using population resequencing data are
usually biased downward by the segregation of slightly deleterious
mutations (Bierne & Eyre-Walker, 2004; Fay, Wyckoff, & Wu, 2002). To
minimize the impact of such bias, we removed singleton polymorphisms
from all gene sets using a custom script (Sattath, Elyashiv, Kolodny,
Rinott, & Sella, 2011). In order to test the effect of singleton
removal on the power of detecting adaptive evolution, we applied HDMKPRF
to gene sets before and after removing singletons with 200,000 burn-in
steps, 400,000 total MCMC steps and set a thinning interval of 5. We
performed the same analysis three times and retained the average value
from triplicates, with these settings demonstrating high levels of
concordance. A gene was considered to be under positive selection only
when the 95% posterior confidence interval of selection intensity was
> 0, and similarly under negative selection when the
interval was < 0.
We performed a gene ontology (GO) enrichment test of biological process
terms of genes under positive selection using the BIOCONDUCTOR package
topGO (Alexa & Rahnefuhrer, 2010). The background genes included the
4,154 genes used in HDMKPRF and those genes that were significantly
(positively or negatively) selected were tested against this background
for enrichment of biological process terms (FDR<10%) using
the parent–child algorithm (Grossmann, Bauer, Robinson, & Vingron,
2007). To ensure a robust result, we only analysed GO terms with at
least 5 members (node size=5), and we used EggNOG v5.0 (Huerta-Cepas et
al., 2019) before the analysis to assign Gene Ontology (GO) terms to the
predicted protein sets using Insecta as taxonomic scope to restrict the
functional inferences to an insect-related scale. Afterwards, enriched
GO terms were clustered to representative functional subsets using the
REVIGO Drosophila database (Supek, Bošnjak, Škunca, & Smuc,
2011).