Conclusions
Some potential advantages of incorporating UMIs in amplicon diversity
studies have been explored previously, but here we propose a new way to
use the extra information that they provide. Error processes lead to
more than one sequence being associated with the same UMI, and this can
be used to identify erroneous sequences regardless of their relative
abundance or their relationship to other sequences in the sample. The
method is experimentally and computationally straightforward, and we
demonstrate its effectiveness using known strain mixtures and real
environmental samples. It allows decontamination of amplicon sequence
data by flagging chimeras and other errors, and can readily be adapted
to any target gene of interest in microbiome studies.