2.3- Data Processing
We used the Anacapa Toolkit (Curd et al, 2019) to process the
resulting data. Anacapa is a metabarcoding data processing
pipeline that enables simultaneous data processing of multiplexed
barcodes and libraries. Within Anacapa , we trimmed the TruSeq
adapters using cutadapt (Martin, 2011), removed bases with
Q-scores below 35 with the FastX-Toolkit (Gordon and Hannon,
2010), and then trimmed, again using cutadapt. We trimmed the first 5
bases on the 5’ end of the forward read and the first 10 bases of the 5’
end of the reverse read for the PITS data set. We trimmed 40 bases off
the 5’ end of the forward read and the first 50 bases of the 5’ end of
the reverse read for the FITS data set. We used dada2 (Callahan
et al. 2016) to merge the forward and reverse reads, remove chimeric
sequences, and identify amplicon sequence variants (ASVs). We performed
ASV assignment to taxa via global and local alignment usingBowtie2 (Langmead and Salzberg, 2012) to PITS and FITSCRUX reference databases released with Curd et al. 2019. The top
100 hits of the Bowtie2 alignment to reference were bootstrapped
with BLCA (Gao et al, 2017) to assign each ASV to a taxon and
provide uncertainty estimates. We used a 60% bootstrap confidence
threshold of taxonomic assignment, as suggested in the Anacapadocumentation (Curd et al, 2019). Anacapa output two taxonomy
tables (one per amplicon) formatted as matrices of the number of reads
from each PCR replicate assigned to a given taxa.
We used the negative PCR and DNA extraction controls to detect
bioinformatically and remove contaminants and the positive ginger
control to infer the rate of barcode swapping. We converted taxonomy
tables and the PCR replicate-associated metadata to phyloseq (v.
1.22.3; McMurdie and Holmes 2013) objects using Ranacapa(Kandlikar et al., 2018). We then used the R package decontam(v1.1.0; Davis et al., 2018) with default settings to compare extraction
and PCR negative controls to PCR replicates amplified from sediment, and
looked bioinformatically for any index hopping between spiral ginger and
soil-derived libraries.
To generate data sets for hypothesis testing, we next rarefied our two
decontaminated taxonomy tables to generate PCR replicate sets of equal
read sampling depth, and then applied a minimum read cutoff to account
for low level contamination, PCR error, sequencing error, and undetected
index hopping. We used the rarefy_even_depth() function ofphyloseq to rarefy our data at depths of every thousand between
1,000 to 20,000 (ex. 1k, 2k, 3k…). As we increased rarefaction
depth, some libraries that were sequenced less deeply dropped out of the
analysis. Following rarefaction, we generated three data sets for each
rarified library in which we applied minimum read cutoffs of 2, 5, and
10 reads using a custom R script that turns any value below these values
to 0. Increasing the minimum read cutoff is more conservative, though
may remove the ability to track low abundance taxa. Rarefaction results
at different read depths were plotted as taxon accumulation curves at
minimum read cutoff of 5 reads, and to account for stochasticity in
rarefaction, the average of 25 rarefactions per PCR replicate were
plotted. All raw and rarefied taxonomy tables (at rarefaction depths
1,000 to 20,000, with minimum read cutoffs of 2, 5, and 10) are
available athttps://github.com/sashirazi/eDNA-PCR-Project/.
We rarefied datasets prior to estimating various Alpha diversity
metrics. We used the R package iNEXT for richness extrapolation.
We implemented iNEXT with q=0, datatype=”abundance”, knots=40, se=TRUE,
conf=0.95, nboot=50. With iNext, we extrapolated replicates to twice
their true read sampling depth and recorded the extrapolated diversity
statistic. We performed outlier tests on extrapolated observed richness
by identifying points that fall outside values of 1.5 times the
interquantile range. We calculated observed richness, the Shannon
diversity index (Shannon 1948), and Simpson index (Simpson 1949) with
the vegan package in R. While observed α diversity considers only
taxon presence, the Shannon and Simpson’s estimators consider both the
relative abundance (RA) of taxa within a sample in addition to taxon
presence. We then performed two-sided t-tests and chi-square tests in Rstat .
We performed statistical tests for beta diversity usingMicrobiomeSeq in R that draws on the vegan andphyloseq packages. We plotted taxon relative abundance barplots
and calculated local contribution to beta diversity (LCBD) using the
Canberra β diversity measure (a measure within the Bray-Curtis β
diversity family of estimators). The LCBD test calculates the
contribution to total β diversity estimated for a particular site from
each PCR replicate. We considered replicates as outliers when
p<0.05.