Using UMIs to filter out chimeras and other errors
In the MAUI-seq approach, UMIs are used to reduce errors in two distinct
ways. Since all reads with the same UMI should, in principle, be derived
from the same initial template copy, any variation among them reflects
errors. In some implementations, a consensus sequence is calculated ,
but we adopt the simpler approach of accepting the most abundant
sequence, which will usually give the same result. Requiring more than
one identical read before accepting a UMI creates an important quality
filter that greatly reduces the number of rare (and usually erroneous)
sequences, but as more reads are required, an increasing number of the
original reads are discarded and the number of accepted counts declines.
To strike a balance between quantity and quality, we chose to count a
sequence provided it had at least two more reads than the next most
frequent sequence with the same UMI, but this threshold could be
adjusted if, for example, a markedly larger number of reads were
available.
While the most abundant sequence associated with a UMI will usually be
the correct one, it will sometimes happen that an erroneous sequence
will predominate among the small number of reads actually sequenced,
leading to these sequences being included among the recorded counts.
These errors can be detected, though, by aggregating information across
the whole set of samples. When a UMI is associated with more than one
sequence, the secondary sequences are most often erroneous, so sequences
that are relatively more abundant as secondary sequences than as the
primary sequences associated with UMIs are likely to be erroneous. We
recorded the number of times each sequence was found as the second
sequence associated with a UMI, and found empirically that a suitable
threshold for accepting sequences as genuine was that they occurred less
than 0.7 times as often as secondary sequences as they occurred as
primary sequences. This threshold can, however, be adjusted to reflect
the error distribution observed in a particular study. We found that
this approach was very effective in identifying known errors,
particularly chimeras, which were generally the most abundant errors.
Chimeras were rejected more effectively by MAUI-seq than by the two
established ASV clustering methods, DADA2 and UNOISE3. Both of these
rely on de novo rejection of sequences that could be constructed
as recombinants of other sequences that are more abundant in the sample
. This method risks rejecting sequences that appear to be recombinant
but are genuine alleles, which may not be uncommon, particularly in
intraspecific samples. Our approach, by contrast, uses information on
the observed error rates in the data (detected using UMIs) to decide
whether a sequence is likely to be genuine, regardless of its actual
sequence and relationship to other sequences. Sequences that could be
generated as chimeras, or that differ by a single nucleotide from a more
abundant sequence, may be accepted as genuine if they are more abundant
than expected from their rate of occurrence as minor sequences
associated with UMIs. In our study, this approach eliminated many known
errors and substantially improved our confidence in the remaining data,
providing a powerful additional reason for using UMIs in metabarcoding
studies of all kinds. While we found that a simple empirical threshold
was effective, we noticed that the proportion of secondary sequences
varied markedly across studies and genes, suggesting that an adjustable
threshold might give further improvement. A useful future development
might be to use the abundance of minor sequences associated with UMIs to
generate a statistical model of error processes that would provide a
firmer theoretical basis for the classification of sequences.