2.5 Bioinformatics analysis of RAD-seq data
We used the bioinformatics software pipeline, STACKS v.1.44 (Catchen et al., 2013; Catchen et al., 2011) to process the restriction-site-associated DNA markers (RAD-tags) and generate single nucleotide polymorphism (SNP) datasets. First, we executed the “process_radtags” program in STACKS to demultiplex and trim sequence reads by the P1 barcodes and remove low quality reads (Phred quality score less than 20). After removing PCR duplicates with the “clone_filter” script, the processed reads were used to generate RAD loci without a reference genome using “denovo_map.pl” (parameter settings: m = 3 M = 5 n = 4). We empirically determined these parameters to limit the impact of over-splitting loci (see Harvey et al., 2015; Ilut et al., 2014). This involved running thede novo assembly over a wide range of values of M (1–8) with “ustacks”. From these runs, we selected a value of M = 5 since we observed that the percentage of homozygous and heterozygous loci reached a plateau at this value and thus minimized over-splitting of alleles for the final SNP calling.
Stacks calls SNPs (“sstacks”) within RAD loci using a multinomial-based likelihood model that estimates the likelihood of two most frequently observed genotypes at each site and performs a standard likelihood ratio test using a chi-square distribution (Catchen et al., 2011; Hohenlohe et al., 2010). For SNP inference, we used the default alpha significance level of 0.05. Paralogous loci that stacked together were identified and removed by subsequent quality control steps built into STACKS (max number of stacks per loci (m ) = 3; Harvey et al., 2015; Ilut et al., 2014). After the preliminary assembly of catalog loci using “denovo_map.pl”, we ran the STACKS correction mode (rxstacks-cstacks-sstacks) using the bounded SNP model with a 0.05 upper bound for the error rate. The “rxstacks” program made corrections to genotype and haplotype calls based on population information, rebuilt the catalog loci and filtered out loci with average log likelihood ratio of < 8.0.
We used three additional filtering steps to generate a set of high-quality RAD loci for down- stream population genetic analysis. First, we retained only RAD loci that were present in 80% of all samples. Second, we removed RAD loci that contained more than 40 SNPs, as these likely represented sequencing errors or over-clustering of paralogous loci. Lastly, we used the BLAT alignment algorithm (Kent, 2002) to de novo align the RAD loci and removed those that aligned to multiple positions. The final consensus set of RAD loci comprised SNP data from a total of 139 individuals. Genotypes were called, filtered, and bi-allelic SNPs were exported in VCF format using the STACKS “populations” program. SNPs from the last seven bp of the RAD loci were removed as this part of the locus is likely to contain sequence errors at the 3’ end of the reads. The SNP dataset was further filtered with VCFtools v.0.1.14 (Danecek et al., 2011) to remove SNPs below a minor allele frequency (MAF) of 0.05 cutoff to reduce artifacts of sequence and assembly error. The dataset was also filtered to include only one random SNP per RAD locus for use in FastStructure (Raj et al., 2014) in order to avoid linkage disequilibrium between SNPs within RAD loci.