2.2 | Filtering raw reads and variant calling
We used Trimgalore 0.4.2 (available here: https://github.com/FelixKrueger/TrimGalore), a wrapper script for CutAdapt (Martin, 2011), to remove sequencing adaptors and to trim low quality ends from reads with a phred quality score below 30. Reads were aligned to the caribou reference genome (Taylor et al., 2019) using Bowtie2 2.3.0 (Langmead & Salzberg, 2012), and the SAM file converted to a BAM file using Samtools 1.5 (Li et al., 2009). We removed duplicate reads and added correct read group information to each BAM file using Picard 2.17.3 (Available: http://broadinstitute.github.io/picard/). We then sorted the BAM file using Samtools 1.5, and built an index using Picard. All BAM files were checked using FastQC 0.11.8 (Andrews, 2010).
In addition to the 30 caribou genomes we sequenced, we also used a reindeer genome from a domesticated animal from Inner Mongolia, sequenced by Li et al., (2017). The reads were downloaded from NCBI (SRR5763125-SRR5763133) and mapped back to the caribou reference genome using the same methods as above. After using FastQC, adaptor contamination was detected, as well as duplicate reads, and so we used ClipReads in GatK to remove Nextera adaptor contamination, and removed duplicates using Picard. We then re-checked the file using FastQC and found we had successfully removed the contamination. Some sequence duplication was still detected, however, and so results from the reindeer sequence may need to be treated with caution.
We ran each BAM file through BUSCO (Benchmarking Universal Single-Copy Orthologs 3.0.2; Waterhouse et al., 2018) to reconstruct 4,104 conserved mammalian genes to assess the completeness of each genome. As our reference genome reconstructed 3,820 (93.1%; Taylor et al., 2019) complete mammalian BUSCO genes, this represents an upper limit for our re-sequenced individuals. We used Haplotype Caller in GATK 3.8 (McKenna et al., 2010) to call variants and produce a variant call format (VCF) file for each caribou. Individual VCF files were combined using the Combine GVCFs function, and then we performed joint genotyping using Genotype GVCFs, both in GATK, to produce a VCF file with all caribou and the reindeer. For some PCA’s (see below), we also made VCF files containing subsets of individuals.
We downloaded the raw reads for a Sitka deer (Odocoileus hemionus sitkensis ) genome from the NCBI database (Bioproject PRJNA476345, run SRR7407804) sequenced as part of the CanSeq150 Initiative, to use as an outgroup. We aligned and filtered the reads in the same way as for the caribou genomes to produce an individual VCF file. We then used the Combine GVCFs function, and performed joint genotyping using Genotype GVCFs, both in GATK, to produce a VCF file with all caribou, the reindeer, and the Sitka deer, for analyses requiring an outgroup.
We did some additional filtering on the combined VCF files to ensure quality. We used VCFtools 0.1.14 (Danecek et al., 2011) to do two rounds of filtering. Firstly, we removed indels, and any site with a depth of less than 10 or more than 80 (roughly double the average depth across the genome), and removed any low-quality genotype calls, with a score below 20, which in VCFtools are changed to missing data. In the second round, we filtered to remove genotypes with more than 10% missing data. We did not filter to remove any SNP with a minor allele frequency (MAF) of less than 0.05 as we have only one or two individuals from each location and this results in removing the private sites, instead relying on very high depth and stringent filtering to ensure a high-quality dataset. However, we did conduct the PCAs using an MAF filter and these looked identical to the data without the MAF filter (Figures S2-S5). The combined VCF file used for analyses with all individuals apart from the Sitka deer contained 34,573,476 SNPs, and the VCF including the Sitka deer contained 65,412,957 SNPs