2.3 Population mapping and statistics
We assessed quality on the resequencing data using FastQC v0.11.5
(Andrews, 2017) before and after filtering, and only retained reads ≥50
bp with a quality score >30 in both read start and end. All
sequence reads were mapped against the Galerucella calmariensisreference genome, which was the least fragmented genome (Yang, Slotte,
Dainat, & Hambäck, 2021), using NextGenMap version 0.4.12 (Sedlazeck,
Rescheneder, & von Haeseler, 2013). The reference genome had an
assembly size of 588 Mbp, containing 39,255 scaffolds and 40,031
predicted proteins with 91.3%
and 85.1% complete orthologs in the genome and proteome, respectively,
compared with the endopterygota_odb10 database (Simão, Waterhouse,
Ioannidis, Kriventseva, & Zdobnov, 2015) (For further info on the
reference genome assembly see Yang et al., 2021). Mapping rates were
similar between samples (85% to 95%). We filtered the resulting bam
files with Samtools v1.3.1 (Li et al., 2009) to retain alignments with
mapping quality>20 (-q 20).
We next called SNPs across all samples using FREEBAYES v0.9.21 (Garrison
& Marth, 2012). For SNP filtering of all sites, we only kept bi-allelic
sites with a minimum read depth of 5X, a quality score >30
and a maximum proportion of missing data of 20%. To ensure there is not
population genetic structure across populations within each species, we
conducted a PCA analysis. For this purpose, we first conducted LD-based
pruning (–indep-pairwise 50 10 0.2), followed by a principal
component analysis (PCA) using Plink v1.9 (Purcell et al., 2007) across
all the samples and for each species separately (Supporting information
Figures S1 and S2). Genetic diversity (nucleotide polymorphism, π) was
estimated for each species using pixy (Korunes & Samuk, 2021).