Bioinformatic processing
Raw sequencing data was processed using a pipeline based on OBITools
(Boyer et al., 2016) in combination with other software (described
below). A FastQC analysis (Andrews, 2010) was conducted to check
sequencing quality, and sequences were trimmed at the position where the
average Phred score was lower than 28. Then, they were paired-end merged
and alignments with a score lower than 30 were discarded. Sequences that
passed the filtering were demultiplexed based on the 8 bp sample tag
combinations and the primer sequences were trimmed away. Demultiplexed
sequences were then filtered based on their length, selecting only those
in the range of 310-330 bp long for COI and 260-375 bp long for 16S
(this range was selected based on the abundance distribution of the read
lengths, which for rRNA genes is much more variable across taxonomic
groups than protein-coding genes, in this case showing two peaks around
270 and 350 bp). Dereplicated reads were searched for chimeras using
VSEARCH v2.7.1 (Rognes et al., 2016). Subsequently, the sequences were
clustered into MOTUs (i.e. putative species) using SWARM v2.1.13 (Mahé
et al., 2015), with maximum distance d = 9 and d = 5, for COI and 16S
respectively, and combined with abundance information to generate MOTU
occurrence tables. These were curated with LULU v0.1.0 (Frøslev et al.,
2017) to reassign false MOTUs to their parent MOTU. The representative
or centroid sequences of each MOTU were assigned taxonomic identity with
ecotag (OBITools) using a local reference database built with a
combination of i) all the barcodes from all orders of Arthropoda with
terrestrial representatives from BOLD (Ratnasingham & Hebert, 2007)
accessed in June 2018, and ii) the invertebrate and fungi sequences from
EMBL’s release r137 (Kulikova et al., 2004) (COI, ~4.1 M
sequences), or just the invertebrate sequences from EMBL’s release r137
(16S, ~60 k sequences). The final dataset was refined by
collapsing all MOTUs with the same species identification, as well as
removing those occurrences that represented less than 0.4% of the total
reads of the sample, and those MOTUs with less than 10 reads in total
across all samples. The minimum abundance per sample threshold was
defined after running an additional library with controlled empty sample
tag combinations and calculating the proportion of reads recovered from
those combinations from the total generated in that library. COI and 16S
datasets are available in Tables S5-6.