OTU generation
Soil eukaryote community composition was estimated by generating OTUs from the raw reads using three different algorithms. The selected algorithms have different principles for OTU generation and are all commonly used in metabarcoding studies. We wanted to investigate the effect of the three OTU generation methods on alpha and beta diversity estimates, and representative OTU sequences. The OTU_A dataset consisted of ASVs inferred using DADA2 in the AmpliSeq pipeline (Straub et al., 2020). This method is designed to identify true sequence variants in the amplicon library by collapsing variations derived from sequencing errors. The OTU_C dataset was generated by abundance-based greedy clustering in VSEARCH (Rognes et al., 2016) with a similarity threshold of 99%. Finally, the OTU_S dataset was generated using single-linkage “swarm” clustering with a distance threshold of 30 bp (approximately 2%) in GeFaST (Müller & Nebel, 2018). This threshold was selected to ensure that two copies of the same biological sequence, each containing a maximum of 15 different errors (i.e., 1% error in 1500 bp), would still be clustered together even if the error-free seed sequence was absent. For OTU_C (VSEARCH) and OTU_S (GeFaST), the CCS reads corresponding to each cluster were extracted using a custom BASH script, and a consensus sequence for each cluster was calculated using PacBio’s c3s (consensus of circular consensus sequences; https://github.com/PacificBiosciences/c3s), which calculates a consensus sequence using SPOA (Vaser et al., 2017) with base quality scores used as weights. This way the sequences representing all three types of OTUs were inferred with a quality-aware method. Chimeric sequences were removed from all datasets using the removeBimeraDenovo function of DADA2. Global singletons (which are not generated by DADA2) were also removed from the OTU_C and OTU_S datasets before further analysis.