2.3 Population mapping and statistics
We assessed quality on the resequencing data using FastQC v0.11.5
(Andrews, 2017) before and after filtering, and only retained reads ≥50
bp with a quality score >30 in both read start and end. All
sequence reads were mapped against the Galerucella calmariensisreference genome (Yang, Slotte, Dainat, & Hambäck, 2021) using
NextGenMap version 0.4.12 (Sedlazeck, Rescheneder, & von Haeseler,
2013). The reference genome had an assembly size of 588 Mbp, containing
39,255 scaffolds and 40,031 predicted proteins
with 91.3% and 85.1% complete
orthologs in the genome and proteome, respectively, compared with the
endopterygota_odb10 database (Simão, Waterhouse, Ioannidis,
Kriventseva, & Zdobnov, 2015) (For further info on the reference genome
assembly see Yang et al., 2021). Mapping rates were similar between
samples (85% to 95%). We filtered the resulting bam files with
Samtools v1.3.1 (Li et al., 2009) to retain alignments with mapping
quality>20 (-q 20).
We next called SNPs across all samples using FREEBAYES v0.9.21 (Garrison
& Marth, 2012). For SNP filtering, we only kept bi-allelic sites with a
minimum read depth of 5X, a quality score >30 and a maximum
proportion of missing data of 20%. To ensure there is not population
genetic structure across populations within each species, we conducted a
PCA analysis. For this purpose, we first conducted LD-based pruning
(–indep-pairwise 50 10 0.2), followed by a principal component
analysis (PCA) across all the samples using Plink v1.9 (Purcell et al.,
2007) (Supporting information Figure S1). Genetic diversity (nucleotide
polymorphism, π) was estimated for each species using pixy (Korunes &
Samuk, 2021).