OTU generation
Soil eukaryote community composition was estimated by generating OTUs
from the raw reads using three different algorithms. The selected
algorithms have different principles for OTU generation and are all
commonly used in metabarcoding studies. We wanted to investigate the
effect of the three OTU generation methods on alpha and beta diversity
estimates, and representative OTU sequences. The OTU_A dataset
consisted of ASVs inferred using DADA2 in the AmpliSeq pipeline (Straub
et al., 2020). This method is designed to identify true sequence
variants in the amplicon library by collapsing variations derived from
sequencing errors. The OTU_C dataset was generated by abundance-based
greedy clustering in VSEARCH (Rognes et al., 2016) with a similarity
threshold of 99%. Finally, the OTU_S dataset was generated using
single-linkage “swarm” clustering with a distance threshold of 30 bp
(approximately 2%) in GeFaST (Müller & Nebel, 2018). This threshold
was selected to ensure that two copies of the same biological sequence,
each containing a maximum of 15 different errors (i.e., 1% error in
1500 bp), would still be clustered together even if the error-free seed
sequence was absent. For OTU_C (VSEARCH) and OTU_S (GeFaST), the CCS
reads corresponding to each cluster were extracted using a custom BASH
script, and a consensus sequence for each cluster was calculated using
PacBio’s c3s (consensus of circular consensus sequences;
https://github.com/PacificBiosciences/c3s), which calculates a consensus
sequence using SPOA (Vaser et al., 2017) with base quality scores used
as weights. This way the sequences representing all three types of OTUs
were inferred with a quality-aware method. Chimeric sequences were
removed from all datasets using the removeBimeraDenovo function of
DADA2. Global singletons (which are not generated by DADA2) were also
removed from the OTU_C and OTU_S datasets before further analysis.