Introduction
Microbial community composition in soil can be assessed in metabarcoding studies of environmental DNA (eDNA) extracts by amplification and sequencing of barcoding regions, often targeting the ribosomal operon. Richness estimates based on eDNA metabarcoding studies indicate that global fungal species richness is at least ten times higher than the number of formally described species (Spatafora et al., 2017), including several class level lineages of currently undescribed fungi (Tedersoo et al., 2017). Non-fungal microeukaryotes, collectively referred to as protists throughout the text, are far less studied in soil compared to fungi, but are increasingly recognized for their diverse ecosystem functions (Geisen, 2016). Recent molecular studies using eDNA have dramatically increased our knowledge of protist diversity in different environments, even indicating that diversity may be higher in soil than in water (Burki et al., 2021; Geisen et al., 2018; Mahé et al., 2017).
Challenges in characterizing soil microeukaryotic communities from metabarcoding data include biases associated with primer choice, tradeoffs between number of samples and sequencing depth, method for estimating species richness as well as accuracy of taxonomic identification of community members. Some of these aspects are discussed below and further explored in this paper. The two internal transcribed spacer (ITS1 and ITS2) are non-coding, hyper-variable portions of the rDNA operon, widely accepted as marker regions for characterization of fungal communities (Schoch et al., 2012). However, due to intraspecific variation and sequencing error, community composition of known and novel species cannot be directly identified from the massive numbers of unique reads generated by high-throughput eDNA sequencing (Ryberg, 2015). Instead, sequence reads are clustered into operational taxonomic units (OTUs) and/or denoised to determine amplicon sequence variants (ASVs), these units are then used as species proxies for estimating alpha and beta diversity. While different bioinformatic tools capture different representations of a sequenced soil microeukaryotic community, large scale ecological patterns are similarly represented with set threshold OTUs compared to denoised ASVs for short read amplicons (Glassman & Martiny, 2018). Further, spatio-temporal turnover patterns are consistently captured using both different sequencing technologies and different amplicon lengths (Furneaux et al., 2021). However, suitable species proxies and taxonomic assignments are important to go beyond large-scale patterns.
The most common approach for OTU generation has been abundance-based greedy clustering of reads using fixed similarity thresholds relative to a centroid sequence, as implemented in USEARCH (Edgar, 2013) and VSEARCH (Rognes et al., 2016). Clustering thresholds are often chosen based on estimates of the level of variation present within species (Tedersoo et al., 2014). However, no universal threshold accurately separates all species (Nilsson et al., 2008; Vu et al., 2019), and a more stringent threshold may cause two sequences which belong to the same species to separate into different OTUs, i.e., splitting of species, while a less stringent threshold may artificially lump multiple species together into a single OTU (Ryberg, 2015). In this study, we use the term OTU_C to refer to the output of such centroid based method. In single-linkage clustering on the other hand, a read is joined to a cluster if it is within the set similarity threshold to any other read in the cluster, i.e. not just compared to a centroid sequence. This approach has been used with similarity thresholds much smaller than the expected sequencing error (e.g., 1 bp) to delimit more “natural” OTUs, as applied in swarm clustering (Mahé et al., 2014). Very small similarity thresholds are only appropriate in a densely populated error space, and the presence of intermediate sequences can cause single-linkage clustering to group fairly distant sequences into an OTU (Mahé et al., 2017; Mahé et al., 2014). In this paper, we use the term OTU_S for the output of such single-linkage based clustering method. Clustering based on similarity thresholds, whether centroid-based or single-linkage, does not differentiate sequencing errors from biological variation. Denoising algorithms, such as DADA2, have been developed to identify ASVs present in a sample, by removing sequencing errors using a model which incorporates the base quality scores and read abundances (Callahan et al., 2016). This approach captures both within and between species variation, even as little as one base pair difference, and so ASVs may be further clustered in order to serve as proxies for species (Frøslev et al., 2017). However, DADA2 does rely on the presence of at least two identical sequences as seeds for generating ASVs, so the method can perform poorly when the majority of reads are singletons (Furneaux et al., 2021). For consistency, instead of ASV we use OTU_A to refer to the output of this denoising method.
Assigning taxonomy to OTUs may allow for functional analysis of community composition, but is highly dependent on curated reference datasets such as the PR2 for protists (Del Campo et al., 2018; Guillou et al., 2012) and UNITE for fungi (Kõljalg et al., 2013). In the well-established fungal sequence database UNITE, OTUs are derived using a range of thresholds from 97% to 99.5% similarity across the ITS2 region and referred to as species hypotheses (SH) with unique numbers and known species names when available (Kõljalg et al., 2013). The development of PacBio sequencing technology (Pacific Biosciences, Menlo Park, CA, SA) has allowed longer eDNA amplicons, including both variable spacers and more conserved functional rDNA regions, to be sequenced from complex samples. In the absence of matching reference sequences, taxonomic assignment of novel lineages is possible based on phylogenetic inference using the more conserved rDNA small subunit (SSU) (Jamy et al., 2020) and/or large subunit (LSU) sequences (Furneaux et al., 2021; Leho Tedersoo et al., 2017). The benefit of phylogenetically supported taxonomic assignment of OTUs is particularly relevant in communities consisting mostly of poorly characterized lineages (Kalsoom Khan et al., 2020).
As a case study to test the impact of different OUT generation methods on apha and beta diversity estaimtes, we chose Kungsängen Nature Reserve, a semi-natural grassland located in Uppsala, Sweden. The site has drawn frequent visitors since Linnaeus’s time because of its vegetation and bird-life. It is also home to the largest population of the plant F. meleagris (Liliaceae) in Sweden, where it was naturalized in the 18th century after being used as a popular garden flower since the 17th century (Linnaeus, 1921 [1753]; Zhang, 1983). It is from this location that the flower draws its Swedish common name, “Kungsängslilja”. Plant diversity has been monitored at the site since 1940 with permanent east – west transect across the field (Sernander, 1948; Zhang, 1983). In this study, we revisited two of the permanent transects at the Kungsängen Nature Reserve and collected plant community data to identify the abrupt change in meadow plant community from the wetter part towards the river, visually distinct from the mesic-dry part further inland. Soil samples were collected on both sides of this transtition zone to test if belowground community compositional shift across the transition from wet to mesic-dry parts of the meadow was consistently captured with different OTU generation methods. Further, the effect of OTU generation method on the characterized community of soil microeukaryotes was explored for richness estimates, taxonomic and phylogentic resolution and detection limits of rare taxa. Apart from providing the first belowground community observations from this study site, we outline an analysis of detection limits and phylogenetic resolution of three different OTU generation methods applied to long read metabarcoding data. This approach is potentially important for the field of eDNA community analysis as long read metabarcoding is becoming an increasingly applied methodology (Furneaux et al., 2021; Jamy et al., 2020; Tedersoo et al., 2017). Burki et al 2021.