Sequence processing and data analysis
Sequence quality checks were initially conducted using personal genome
machine (PGM) software (Torrent Suite™ v5.6) using default parameters to
conduct the following tasks: 1) removal of mixed clonal libraries on Ion
Sphere Particles (ISPs) known as polyclonals, 2) removal of low-quality
sequences, and 3) removal of sequences with low quality data at the 3’
end of the read.
Unless otherwise stated, all sequence processing was performed using the
Quantitative Insights into Microbial Ecology (QIIME) pipeline, v1.9.1
with default parameters (Caporaso et al., 2010). Briefly, raw
sequences were processed with a Phred quality score cut-off of 25 and
then demultiplexed. Any raw sequence with one or more mismatches in the
primer sequence were detected and excluded. In addition, forward and
reverse primer sequences, and barcode and adapter sequences were
removed. Chimeric sequences were detected using USEARCH v6.1 (Edgar,
2010) and excluded from analysis. To perform operational taxonomic unit
(OTU) clustering, the open reference approach in QIIME was used with
default parameters and a 97% sequence identity threshold. In this
approach, clustering is completed with the UCLUST algorithm (Edgar,
2010), wherein a reference database was used to determine a cluster of
sequences, and unassigned sequences were allowed to cluster de
novo (Caporaso et al., 2010). To assign taxonomy to OTUs,
alignment of candidate OTU sequences was completed in PyNAST (Caporasoet al., 2009) against the GreenGenes database (v13.8) at 90%
sequence identity using UCLUST (Edgar, 2010).
After sequence processing in QIIME, 6,411,635 high quality sequences
(out of 6,602,610 usable sequences) remained for analyses. After
filtering unassigned taxa OTUs, and OTUs from Archaea, mitochondria and
Chloroplasts, a total of 8,038 unique OTUs were identified, with
6,317,692 working sequences (out of 6,411,635) and used for all
subsequent statistical analyses. After sequence filtering and alignment,
any samples with 3,000 reads or less were dropped from all statistical
analyses. The mean sample depth across all samples was 22,871, and the
range was 3,061 to 175,783 reads per gut sample. In the final analysis,
a total of 278 gut samples were used (Supplementary Table 1).