Bioinformatic processing

Raw sequencing data was processed using a pipeline based on OBITools (Boyer et al., 2016) in combination with other software (described below). A FastQC analysis (Andrews, 2010) was conducted to check sequencing quality, and sequences were trimmed at the position where the average Phred score was lower than 28. Then, they were paired-end merged and alignments with a score lower than 30 were discarded. Sequences that passed the filtering were demultiplexed based on the 8 bp sample tag combinations and the primer sequences were trimmed away. Demultiplexed sequences were then filtered based on their length, selecting only those in the range of 310-330 bp long for COI and 260-375 bp long for 16S (this range was selected based on the abundance distribution of the read lengths, which for rRNA genes is much more variable across taxonomic groups than protein-coding genes, in this case showing two peaks around 270 and 350 bp). Dereplicated reads were searched for chimeras using VSEARCH v2.7.1 (Rognes et al., 2016). Subsequently, the sequences were clustered into MOTUs (i.e. putative species) using SWARM v2.1.13 (Mahé et al., 2015), with maximum distance d = 9 and d = 5, for COI and 16S respectively, and combined with abundance information to generate MOTU occurrence tables. These were curated with LULU v0.1.0 (Frøslev et al., 2017) to reassign false MOTUs to their parent MOTU. The representative or centroid sequences of each MOTU were assigned taxonomic identity with ecotag (OBITools) using a local reference database built with a combination of i) all the barcodes from all orders of Arthropoda with terrestrial representatives from BOLD (Ratnasingham & Hebert, 2007) accessed in June 2018, and ii) the invertebrate and fungi sequences from EMBL’s release r137 (Kulikova et al., 2004) (COI, ~4.1 M sequences), or just the invertebrate sequences from EMBL’s release r137 (16S, ~60 k sequences). The final dataset was refined by collapsing all MOTUs with the same species identification, as well as removing those occurrences that represented less than 0.4% of the total reads of the sample, and those MOTUs with less than 10 reads in total across all samples. The minimum abundance per sample threshold was defined after running an additional library with controlled empty sample tag combinations and calculating the proportion of reads recovered from those combinations from the total generated in that library. COI and 16S datasets are available in Tables S5-6.