Validation using environmental samples
To test the method on more complex samples, we compared Rltpopulations in root nodules from two locations in Denmark: a clover
trial station in Store Heddinge on Zealand and a lawn at Aarhus
University in Jutland (the Field-Samples-1 dataset;Supplementary Figure S5 ). One hundred nodules were
pooled for each sample and each plot was sampled in four replicates.
Platinum Taq polymerase enzyme was used for amplification. Each clover
root nodule is usually colonised by a single Rhizobium strain, so
a maximum of 100 unique sequences per gene is expected per sample.
For Field-Samples-1, the total number of distinct sequences for MAUI-seq
and DADA2 were in the same range as the number of distinct alleles
observed in a population of 196 natural European Rlt isolates
(Table 2 ). In contrast, UNOISE3 produced a substantially higher
number of distinct sequences, suggesting that its default filtering
might be too lenient for our data (Table 2 ). The sequences
accepted as true by MAUI-seq were nearly all also included in the DADA2
and UNOISE3 outputs (Figure 3 ). On the other hand, DADA2 and
UNOISE3 both accepted a number of sequences that were filtered out by
MAUI-seq, and many of these were eliminated by MAUI-seq because a high
ratio of secondary to primary occurrences strongly suggested that they
represent errors and not real sequences (Figure 3 andAdditional file 2 ). To provide independent evidence as to
whether sequences were likely to be genuine, we checked whether they
matched (or differed by a single nucleotide from) known sequences in
either a reference database of 196 natural European Rlt isolates
, or the NCBI whole-genome shotgun database (Figure 3 ). The
great majority of sequences rejected by MAUI-seq did not have exact
matches to these known sequences. A few sequences that exactly matched
known alleles were included by DADA2 and UNOISE, but not by MAUI-seq.
These sequences were not reported by MAUI-seq because their UMI counts
were below the abundance threshold, not because the secondary/primary
occurrence filter identified them as erroneous (Figure 3 ). The
count threshold could be lowered to include rarer sequences, if the
study required it.
The allele frequency distributions were different at Aarhus and Store
Heddinge (Figure 3 ), and the two sites were clearly separated
by the first principal component in a Principal Component analysis (PCA)
for MAUI-seq, DADA2 and UNOISE3 sequences. (Figure 4 andSupplementary Figure S6-S8 ). The amplicon sequencing has
sufficient resolution to characterize geospatial variation in allele
frequencies. For example, MAUI-seq, DADA2 and UNOISE3 can all clearly
identify several highly abundant sequences from one location that are
either absent or present in very low frequency in samples from the other
location (Figure 3 ). To quantify the genetic differentiation
between the Aarhus and Store Heddinge sites, we calculated fixation
indices (F ST). Considering all four target genes
combined, the MAUI-seq output resulted in the highestF ST value followed by DADA2 and UNOISE3
(Table 2, Figure 4 and Supplementary Figure S9-S11 ).
For all individual genes, MAUI-seq also produced the highestF ST estimates, and the differences were
especially pronounced for nodA , which also showed the highest
overall level of differentiation (Table 2 andSupplementary Figure S9-S11 ). The lower genetic differentiation
estimated based on DADA2 and UNOISE3 results, compared to those of
MAUI-seq, reflects the inclusion of an increased number of erroneous
sequences, which are less differentiated between the two sampled sites
than the real sequences (Figure 3 ).
Since it was clear from the DNA mixture experiment that the choice of
DNA polymerase could significantly affect error rates, we sampled root
nodules from 13 additional clover field plots (the Field-Samples-2
dataset) and amplified each sample (a pool of one hundred root nodules)
using Platinum and Phusion polymerases in parallel. For samples
amplified using Platinum, MAUI-seq detected fewer sequences than DADA2
and UNOISE3 for the two core genes, but the same number of reference
sequences were detected (Table 3 ). DADA2 included two chimeric
sequences that were filtered out by MAUI-seq due to a high ratio of
secondary to primary occurrences (Additional File 2 ). UNOISE3
detected twice as many sequences as DADA2 and MAUI-seq for the accessory
genes, but most of the additional sequences had no associated UMIs and
were classified as “other” (Table 3, Additional File
2 ). For samples amplified using Phusion, MAUI-seq and DADA2 detected a
similar number of sequences (Table 3 ). All nine UNOISE3rpoB sequences that were not accepted by either MAUI-seq or DADA2
(Additional File 2 ) are putative chimeric sequences with two
parental sequences of higher abundance. For nodA , MAUI-seq
includes three sequences that have a single nucleotide difference from a
reference sequence, but all have a good ratio of secondary to primary
reads, so we hypothesise that these are true sequences. Some reference
or exact blast hit sequences were included by DADA2 but not by MAUI-seq
because their abundance was estimated by DADA2 to be above the 0.001
threshold, but MAUI-seq estimated that they were rarer.
Both MAUI-seq and DADA2 identify and remove sequences that appear to be
errors (base substitutions or chimeras), but they use completely
different evidence. As a result, they do not always make the same
decision, as illustrated for a small set of representative data inTable 4 (the rpoB sequences amplified by Phusion). While
DADA2 examines the sequences and rejects those that are likely to be
generated from more abundant sequences in the sample, MAUI-seq does not
use the actual sequence but bases decisions on how frequently a sequence
occurs as a secondary sequence with the same UMI as another (primary)
sequence. Sequences ranked 5 and 6 (Table 4 ) are both potential
chimeras of the more abundant sequences 1-4. Both DADA2 and MAUI-seq
reject sequence 6 and accept sequence 5. Sequence 6 has a
secondary/primary ratio of 103/118, which is above the default threshold
of 0.7, so MAUI-seq rejects it as a likely error. On the other hand, the
ratio for sequence 5 is 71/229. This is well below the threshold, but it
is higher than other sequences with a similar primary count, e.g.
sequence 9 (15/270). A possible explanation is that some of the reads
for sequence 5 are generated as chimeras but others are genuine, since
it is entirely plausible that new alleles are generated by recombination
between existing alleles. To some extent, MAUI-seq compensates for this
because it allocates sequence 5 a relatively low count and hence lower
ranking (8) than it has in the raw reads or the DADA2 analysis. There
are two further sequences, 10 and 29, that are rejected by DADA2 as
potential chimeras but accepted by MAUI-seq (Additional file 2Field-Samples-2-phusion-rpoB); in both cases they have secondary
sequence counts well below the threshold, so MAUI-seq accepts them as
genuine. DADA2 included an rpoB sequence that does not have any
associated UMIs (sequence 41), and appears to be a chimera of two more
abundant sequences (sequence 3/4/5 and sequence 11) (Table 4 ).
MAUI-seq counts UMIs, not individual reads, and the default setting is
to require that the primary sequence has at least two more reads than
the next most frequent sequence (if any) that has the same UMI. This
enriches for genuine sequences, which are generally more abundant than
errors, but it means, of course, that the number of counts is much lower
than the number of reads. In fact, for this particular set of data, the
number of UMIs is orders of magnitude smaller than either the raw reads
or the DADA2 count, although still sufficient to provide good estimates
of the relative abundance of the sequences that make up the bulk of the
population. The main reason for the low UMI count is that the number of
reads per UMI was suboptimal in these data for the rpoB gene:
only 18% of the UMIs had more than one read, and MAUI-seq discards
single-read UMIs by default. By contrast, in the equivalent data for therecA gene in the same study (Additional file 2Field-Samples-2-phusion-recA), 37.5% of UMIs had more than one read,
making more effective use of the available sequence reads.