Introduction
Microbial community composition in soil can be assessed in metabarcoding
studies of environmental DNA (eDNA) extracts by amplification and
sequencing of barcoding regions, often targeting the ribosomal operon.
Richness estimates based on eDNA metabarcoding studies indicate that
global fungal species richness is at least ten times higher than the
number of formally described species (Spatafora et al., 2017), including
several class level lineages of currently undescribed fungi (Tedersoo et
al., 2017). Non-fungal microeukaryotes, collectively referred to as
protists throughout the text, are far less studied in soil compared to
fungi, but are increasingly recognized for their diverse ecosystem
functions (Geisen, 2016). Recent molecular studies using eDNA have
dramatically increased our knowledge of protist diversity in different
environments, even indicating that diversity may be higher in soil than
in water (Burki et al., 2021; Geisen et al., 2018; Mahé et al., 2017).
Challenges in characterizing soil microeukaryotic communities from
metabarcoding data include biases associated with primer choice,
tradeoffs between number of samples and sequencing depth, method for
estimating species richness as well as accuracy of taxonomic
identification of community members. Some of these aspects are discussed
below and further explored in this paper. The two internal transcribed
spacer (ITS1 and ITS2) are non-coding, hyper-variable portions of the
rDNA operon, widely accepted as marker regions for characterization of
fungal communities (Schoch et al., 2012). However, due to intraspecific
variation and sequencing error, community composition of known and novel
species cannot be directly identified from the massive numbers of unique
reads generated by high-throughput eDNA sequencing (Ryberg, 2015).
Instead, sequence reads are clustered into operational taxonomic units
(OTUs) and/or denoised to determine amplicon sequence variants (ASVs),
these units are then used as species proxies for estimating alpha and
beta diversity. While different bioinformatic tools capture different
representations of a sequenced soil microeukaryotic community, large
scale ecological patterns are similarly represented with set threshold
OTUs compared to denoised ASVs for short read amplicons (Glassman &
Martiny, 2018). Further, spatio-temporal turnover patterns are
consistently captured using both different sequencing technologies and
different amplicon lengths (Furneaux et al., 2021). However, suitable
species proxies and taxonomic assignments are important to go beyond
large-scale patterns.
The most common approach for OTU generation has been abundance-based
greedy clustering of reads using fixed similarity thresholds relative to
a centroid sequence, as implemented in USEARCH (Edgar, 2013) and VSEARCH
(Rognes et al., 2016). Clustering thresholds are often chosen based on
estimates of the level of variation present within species (Tedersoo et
al., 2014). However, no universal threshold accurately separates all
species (Nilsson et al., 2008; Vu et al., 2019), and a more stringent
threshold may cause two sequences which belong to the same species to
separate into different OTUs, i.e., splitting of species, while a less
stringent threshold may artificially lump multiple species together into
a single OTU (Ryberg, 2015). In this study, we use the term OTU_C to
refer to the output of such centroid based method. In single-linkage
clustering on the other hand, a read is joined to a cluster if it is
within the set similarity threshold to any other read in the cluster,
i.e. not just compared to a centroid sequence. This approach has been
used with similarity thresholds much smaller than the expected
sequencing error (e.g., 1 bp) to delimit more “natural” OTUs, as
applied in swarm clustering (Mahé et al., 2014). Very small similarity
thresholds are only appropriate in a densely populated error space, and
the presence of intermediate sequences can cause single-linkage
clustering to group fairly distant sequences into an OTU (Mahé et al.,
2017; Mahé et al., 2014). In this paper, we use the term OTU_S for the
output of such single-linkage based clustering method. Clustering based
on similarity thresholds, whether centroid-based or single-linkage, does
not differentiate sequencing errors from biological variation. Denoising
algorithms, such as DADA2, have been developed to identify ASVs present
in a sample, by removing sequencing errors using a model which
incorporates the base quality scores and read abundances (Callahan et
al., 2016). This approach captures both within and between species
variation, even as little as one base pair difference, and so ASVs may
be further clustered in order to serve as proxies for species (Frøslev
et al., 2017). However, DADA2 does rely on the
presence of at least two identical
sequences as seeds for generating ASVs, so the method can perform poorly
when the majority of reads are singletons (Furneaux et al., 2021). For
consistency, instead of ASV we use OTU_A to refer to the output of this
denoising method.
Assigning taxonomy to OTUs may allow for functional analysis of
community composition, but is highly dependent on curated reference
datasets such as the PR2 for protists (Del Campo et al., 2018; Guillou
et al., 2012) and UNITE for fungi (Kõljalg et al., 2013). In the
well-established fungal sequence database UNITE, OTUs are derived using
a range of thresholds from 97% to 99.5% similarity across the ITS2
region and referred to as species hypotheses (SH) with unique numbers
and known species names when available (Kõljalg et al., 2013). The
development of PacBio sequencing technology (Pacific Biosciences, Menlo
Park, CA, SA) has allowed longer eDNA amplicons, including both variable
spacers and more conserved functional rDNA regions, to be sequenced from
complex samples. In the absence of matching reference sequences,
taxonomic assignment of novel lineages is possible based on phylogenetic
inference using the more conserved rDNA small subunit (SSU) (Jamy et
al., 2020) and/or large subunit (LSU) sequences (Furneaux et al., 2021;
Leho Tedersoo et al., 2017). The benefit of phylogenetically supported
taxonomic assignment of OTUs is particularly relevant in communities
consisting mostly of poorly characterized lineages (Kalsoom Khan et al.,
2020).
As a case study to test the impact of different OUT generation methods
on apha and beta diversity estaimtes, we chose Kungsängen Nature
Reserve, a semi-natural grassland located in Uppsala, Sweden. The site
has drawn frequent visitors since Linnaeus’s time because of its
vegetation and bird-life. It is also home to the largest population of
the plant F. meleagris (Liliaceae) in Sweden, where it was
naturalized in the 18th century after being used as a
popular garden flower since the 17th century (Linnaeus, 1921 [1753];
Zhang, 1983). It is from this location that the flower draws its Swedish
common name, “Kungsängslilja”. Plant diversity has been monitored at
the site since 1940 with permanent east – west transect across the
field (Sernander, 1948; Zhang, 1983). In this study, we revisited two of
the permanent transects at the Kungsängen Nature Reserve and collected
plant community data to identify the abrupt change in meadow plant
community from the wetter part towards the river, visually distinct from
the mesic-dry part further inland. Soil samples were collected on both
sides of this transtition zone to test if belowground community
compositional shift across the transition from wet to mesic-dry parts of
the meadow was consistently captured with different OTU generation
methods. Further, the effect of OTU generation method on the
characterized community of soil microeukaryotes was explored for
richness estimates, taxonomic and phylogentic resolution and detection
limits of rare taxa. Apart from providing the first belowground
community observations from this study site, we outline an analysis of
detection limits and phylogenetic resolution of three different OTU
generation methods applied to long read metabarcoding data. This
approach is potentially important for the field of eDNA community
analysis as long read metabarcoding is becoming an increasingly applied
methodology (Furneaux et al., 2021; Jamy et al., 2020; Tedersoo et al.,
2017). Burki et al 2021.