Post-process Filtering
The resulting SNP dataset consisted of 19,256,706 SNPs. To ensure the
quality of our dataset, we applied additional filters to the VCF
(Variant Calling Format) file. First, we removed all the singleton SNPs
from our dataset (as these carry no useful information to address our
main questions). Singleton detection was carried in vcftools under
–singletons and the variants were extracted from the dataset. We then
analyzed the dataset in vcftools (–freq2, –depth,
–site-mean-depth, –site-quality, –missing-individuals,
–missing-site, –het) to identify additional quality filters. Based
on the results, we applied the following filters using vcftools;
–minQ 20 (minimum quality score of 20), –min-meanDP 10 (minimum
read depth), –max-meanDP 50 (maximum read depth), –maf
<0.05 (minor allele frequency cutoff at 0.05). We then used
PLINK v2.0 (Chang et al., 2015) to apply “missing data” filters;
–mind 0.2 (filter out individuals that have >20%
missing data), –geno 0.2 (filter out variants that have
>20% missing data). The resulting dataset consisted of
575,096 biallelic SNPs. This SNP set was tested for Linkage
Disequilibrium (LD) to remove SNPs that were in complete or in very high
LD. We used PLINK –indep-pairwise with a window of 50kbp, sliding by
10kbp with a r2 < 0.7. This resulted in a
final set of 366,078 SNPs for all downstream analyses.