Different OTU generation methods strongly influence species richness estimates
Overall community composition is captured well across the three OTU generation methods when analyzing ecological patterns (Fig. 3) and relative abundance at the phylum level (Fig. 4). However, the OTU generation methods differentially capture and represent the members of these communities, so that different sequences are selected to represent the raw reads in the different datasets. The dependence on abundant seed sequences for denoising resulted in fewer OTU_As compared to the two other methods and entire lineages of rare taxa remained undetected with this method, while a large number of OTU_As are recovered from abundant taxa such as Mortierellomycota (Fig. 4b, S9). The detection limits of different OTU generation methods were compared by generating approximately genus-level clusters using sequence similarity threasholds at 90% and species-level clusters at either 99 or 97% across the ITS2 region extracted from all OTU representative sequences. Only 36% of all genus-level clusters (GH_90) in the dataset were represented by an OTU_A sequence, compared to 94 and 96% for OTU_C and OTU_S, respectively (Table 2). The level of detection for SHs represented by up to 50 reads was lower for OTU_A than the other methods. In some cases, even close to 300 reads was not enough to detect a SH_99 with OTU_A (Fig. S10). Even the more inclusive methods did not capture exactly the same genus-level diversity, with just over 7% of all GH_90 represented by a sequence recovered by a single method (Table 2). However, no GH_90 was represented only by an OTU_A sequence.
Species richness estimates are heavily influenced by the OTU generation method used with the lowest numbers estimated with OTU_A for all three ITS2 sequence similarity levels GH_90, SH_97 and SH_90 (Fig. 5). While OTU_A richness was estimated to saturate close to 1000 in both wet and mesic-dry soil conditions (Fig. S4), these may represent only half as many species since the intraspecies variation is collapsed to around 600 SH_99 and just over 500 SH_97 (Fig. 5). OTU richness estimates are highest for OTU_C at almost 1,700 followed by OTU_S at almost 1400 (Fig. S4), and the numbers are only slightly lower when estimating species richness as SH_99 (Fig. 5). Accepting ITS2 sequence similarity at either 99 or 97% as a proxy for species suggests that clustering into OTU_C or OTU_S detects close to three times as many species compared to denoising into OTU_A. Of the three methods, OTU_S is also the method that has the largest number of SH_99 and SH_97 represented by only one OTU (Fig. S11) suggesting that in the current dataset this method provides the best estimate of species richness.