2.4.2. Capture
Raw reads were trimmed for Illumina adapters using Trimmomatic (v 0.38; Bolger, Lohse, & Usadel, 2014) and then quality-filtered with PRINSEQ-lite PERL script (min_qual_mean =25, trim_qual_window=3, trim_qual_step=1, min_len=60; Schmieder & Edwards, 2011). Trimmed reads corresponding to rDNA were extracted using SortMeRNA (v2.1; Kopylova, Noé, & Touzet, 2012) with default parameters. Near-full-length 16S and 18S rDNA sequences were reconstructed using EMIRGE software (v 0.60; Miller, Baker, Thomas, Singer, & Banfield, 2011) and the emirge_amplicon.py script. This tool allows reference-based assembly of reads while allowing the reconstruction of distant variants. The database used was SILVA 132 SSURef NR99, including fragments with lengths from 1200-2000 bp. The parameters used were join_threshold fixed to 1 and 120 iterations. Only sequences longer than 800 bp were kept. Taxonomic affiliation was performed using the plugin “feature-classifier sklearn classifier” from QIIME2 (v. 2019.1; Bokulich et al., 2018; Bolyen et al., 2019) and the full-length SILVA 132 database, with the p-confidence set to 0.7. This type of analysis is further referred to as CBH-long.
Additionally, Kraken2-based analysis (Wood, Lu, & Langmead, 2019; Wood & Salzberg, 2014) was performed starting from paired reads to evaluate all captured diversity (without gene reconstruction), as too low coverage of some taxa could hinder the possibility of reconstructing longer sequences and thus cause the lack of these taxa in the final dataset. The database used was the prepackaged SILVA database provided by Kraken2. We tested the confidential score from 0.0 to 1.0 with 0.1 steps. For the final analyses, a score of 0.7 was retained, ensuring good specificity of taxonomic affiliation. This is in line with a previous report that values from 0.6 up to 0.7 indicated the best results for sensitivity and precision (Wood & Salzberg, 2014). Data related to these analyses are further mentioned as CBH-short.