Post-process Filtering
The resulting SNP dataset consisted of 19,256,706 SNPs. To ensure the quality of our dataset, we applied additional filters to the VCF (Variant Calling Format) file. First, we removed all the singleton SNPs from our dataset (as these carry no useful information to address our main questions). Singleton detection was carried in vcftools under –singletons and the variants were extracted from the dataset. We then analyzed the dataset in vcftools (–freq2, –depth, –site-mean-depth, –site-quality, –missing-individuals, –missing-site, –het) to identify additional quality filters. Based on the results, we applied the following filters using vcftools; –minQ 20 (minimum quality score of 20), –min-meanDP 10 (minimum read depth), –max-meanDP 50 (maximum read depth), –maf <0.05 (minor allele frequency cutoff at 0.05). We then used PLINK v2.0 (Chang et al., 2015) to apply “missing data” filters; –mind 0.2 (filter out individuals that have >20% missing data), –geno 0.2 (filter out variants that have >20% missing data). The resulting dataset consisted of 575,096 biallelic SNPs. This SNP set was tested for Linkage Disequilibrium (LD) to remove SNPs that were in complete or in very high LD. We used PLINK –indep-pairwise with a window of 50kbp, sliding by 10kbp with a r2 < 0.7. This resulted in a final set of 366,078 SNPs for all downstream analyses.