2.3- Data Processing
We used the Anacapa Toolkit (Curd et al, 2019) to process the resulting data. Anacapa is a metabarcoding data processing pipeline that enables simultaneous data processing of multiplexed barcodes and libraries. Within Anacapa , we trimmed the TruSeq adapters using cutadapt (Martin, 2011), removed bases with Q-scores below 35 with the FastX-Toolkit (Gordon and Hannon, 2010), and then trimmed, again using cutadapt. We trimmed the first 5 bases on the 5’ end of the forward read and the first 10 bases of the 5’ end of the reverse read for the PITS data set. We trimmed 40 bases off the 5’ end of the forward read and the first 50 bases of the 5’ end of the reverse read for the FITS data set. We used dada2 (Callahan et al. 2016) to merge the forward and reverse reads, remove chimeric sequences, and identify amplicon sequence variants (ASVs). We performed ASV assignment to taxa via global and local alignment usingBowtie2 (Langmead and Salzberg, 2012) to PITS and FITSCRUX reference databases released with Curd et al. 2019. The top 100 hits of the Bowtie2 alignment to reference were bootstrapped with BLCA (Gao et al, 2017) to assign each ASV to a taxon and provide uncertainty estimates. We used a 60% bootstrap confidence threshold of taxonomic assignment, as suggested in the Anacapadocumentation (Curd et al, 2019). Anacapa output two taxonomy tables (one per amplicon) formatted as matrices of the number of reads from each PCR replicate assigned to a given taxa.
We used the negative PCR and DNA extraction controls to detect bioinformatically and remove contaminants and the positive ginger control to infer the rate of barcode swapping. We converted taxonomy tables and the PCR replicate-associated metadata to phyloseq (v. 1.22.3; McMurdie and Holmes 2013) objects using Ranacapa(Kandlikar et al., 2018). We then used the R package decontam(v1.1.0; Davis et al., 2018) with default settings to compare extraction and PCR negative controls to PCR replicates amplified from sediment, and looked bioinformatically for any index hopping between spiral ginger and soil-derived libraries.
To generate data sets for hypothesis testing, we next rarefied our two decontaminated taxonomy tables to generate PCR replicate sets of equal read sampling depth, and then applied a minimum read cutoff to account for low level contamination, PCR error, sequencing error, and undetected index hopping. We used the rarefy_even_depth() function ofphyloseq to rarefy our data at depths of every thousand between 1,000 to 20,000 (ex. 1k, 2k, 3k…). As we increased rarefaction depth, some libraries that were sequenced less deeply dropped out of the analysis. Following rarefaction, we generated three data sets for each rarified library in which we applied minimum read cutoffs of 2, 5, and 10 reads using a custom R script that turns any value below these values to 0. Increasing the minimum read cutoff is more conservative, though may remove the ability to track low abundance taxa. Rarefaction results at different read depths were plotted as taxon accumulation curves at minimum read cutoff of 5 reads, and to account for stochasticity in rarefaction, the average of 25 rarefactions per PCR replicate were plotted. All raw and rarefied taxonomy tables (at rarefaction depths 1,000 to 20,000, with minimum read cutoffs of 2, 5, and 10) are available athttps://github.com/sashirazi/eDNA-PCR-Project/.
We rarefied datasets prior to estimating various Alpha diversity metrics. We used the R package iNEXT for richness extrapolation. We implemented iNEXT with q=0, datatype=”abundance”, knots=40, se=TRUE, conf=0.95, nboot=50. With iNext, we extrapolated replicates to twice their true read sampling depth and recorded the extrapolated diversity statistic. We performed outlier tests on extrapolated observed richness by identifying points that fall outside values of 1.5 times the interquantile range. We calculated observed richness, the Shannon diversity index (Shannon 1948), and Simpson index (Simpson 1949) with the vegan package in R. While observed α diversity considers only taxon presence, the Shannon and Simpson’s estimators consider both the relative abundance (RA) of taxa within a sample in addition to taxon presence. We then performed two-sided t-tests and chi-square tests in Rstat .
We performed statistical tests for beta diversity usingMicrobiomeSeq in R that draws on the vegan andphyloseq packages. We plotted taxon relative abundance barplots and calculated local contribution to beta diversity (LCBD) using the Canberra β diversity measure (a measure within the Bray-Curtis β diversity family of estimators). The LCBD test calculates the contribution to total β diversity estimated for a particular site from each PCR replicate. We considered replicates as outliers when p<0.05.