Sequence processing and data analysis
Sequence quality checks were initially conducted using personal genome machine (PGM) software (Torrent Suite™ v5.6) using default parameters to conduct the following tasks: 1) removal of mixed clonal libraries on Ion Sphere Particles (ISPs) known as polyclonals, 2) removal of low-quality sequences, and 3) removal of sequences with low quality data at the 3’ end of the read.
Unless otherwise stated, all sequence processing was performed using the Quantitative Insights into Microbial Ecology (QIIME) pipeline, v1.9.1 with default parameters (Caporaso et al., 2010). Briefly, raw sequences were processed with a Phred quality score cut-off of 25 and then demultiplexed. Any raw sequence with one or more mismatches in the primer sequence were detected and excluded. In addition, forward and reverse primer sequences, and barcode and adapter sequences were removed. Chimeric sequences were detected using USEARCH v6.1 (Edgar, 2010) and excluded from analysis. To perform operational taxonomic unit (OTU) clustering, the open reference approach in QIIME was used with default parameters and a 97% sequence identity threshold. In this approach, clustering is completed with the UCLUST algorithm (Edgar, 2010), wherein a reference database was used to determine a cluster of sequences, and unassigned sequences were allowed to cluster de novo (Caporaso et al., 2010). To assign taxonomy to OTUs, alignment of candidate OTU sequences was completed in PyNAST (Caporasoet al., 2009) against the GreenGenes database (v13.8) at 90% sequence identity using UCLUST (Edgar, 2010).
After sequence processing in QIIME, 6,411,635 high quality sequences (out of 6,602,610 usable sequences) remained for analyses. After filtering unassigned taxa OTUs, and OTUs from Archaea, mitochondria and Chloroplasts, a total of 8,038 unique OTUs were identified, with 6,317,692 working sequences (out of 6,411,635) and used for all subsequent statistical analyses. After sequence filtering and alignment, any samples with 3,000 reads or less were dropped from all statistical analyses. The mean sample depth across all samples was 22,871, and the range was 3,061 to 175,783 reads per gut sample. In the final analysis, a total of 278 gut samples were used (Supplementary Table 1).