Introduction
The evaluation of DNA diversity in environmental samples has become a
pivotal approach in microbial ecology and is increasingly also used to
assess the distribution of larger organisms . If a core gene can be
amplified from environmental DNA with universal primers, the relative
abundance of species in the community can be estimated from the
proportions of species-specific variants among the amplicons. High
throughput amplicon sequencing (HTAS), often termed metabarcoding, is a
cost-effective way to detect multiple species simultaneously within a
range of environmental samples . While shotgun sequencing of the whole
community (metagenomics) can provide a richer description of the
functions in a community, HTAS remains a more efficient tool for
comparing the species diversity of a large number of community samples.
Despite the extensive use of HTAS for interspecies ecological diversity
studies, few investigations have utilised HTAS for intraspecies analysis
. As 16S rRNA amplicons are too highly conserved to estimate microbial
within-species diversity, other target gene candidates need to be
considered in order to sufficiently discern intraspecies sequence
variation.
Many studies have evaluated the extent of PCR-based amplification errors
and bias for HTAS diversity studies . Numerous known PCR biases reduce
the accuracy of diversity and abundance estimations, with the major
concern being the inability to confidently distinguish PCR error from
natural sequence variation in environmental samples, which is an
especially limiting factor for intraspecific studies.
Polymerase errors, production of chimeric sequences by template
switching, and the stochasticity of PCR amplification can be major
causes of PCR errors . Polymerase errors introduce new sequences into
the template population during amplification. These sequence errors
include not only substitutions but also insertions and deletions. The
use of proofreading polymerases, optimised DNA template concentration,
and reduced PCR cycle number have been suggested to reduce these errors
.
In order to account for the introduction of sequence variants in PCR
amplification, several sequence-classification approaches have been
established to manage diversity estimates. The most common method is the
use of operational taxonomic units (OTUs) in microbial diversity studies
which analyse target gene sequences and cluster based on an arbitrary
fixed similarity threshold (QIIME ; UPARSE . Within species boundaries
this technique could dramatically reduce the resolution of naturally
occurring sequence variation.
Most recent methods rely on the formation of sequence groups called
amplicon sequence variants (ASVs) (DADA2, ; UNOISE3, . This approach
allows sequence resolution down to one nucleotide, which is advantageous
for determining intraspecies allelic variation, but noise from PCR
errors is also more evident. Variation induced by PCR errors often
cannot be differentiated from rare natural allelic variation without the
use of sequence denoising methods . DADA2 relies on a quality-aware
parametric error model, which is developed on a per sequencing run
basis. This increases the run time compared to UNOISE3, which uses a
one-pass technique .
An approach that can reduce sequencing noise is to assign a unique
molecular identifier (UMI) to every initial DNA template within an HTAS
sample, which also enables evaluation of PCR amplification bias .
Additionally, the UMI provides a potential route to address polymerase
errors in metabarcoding studies. The UMI is provided by a set of random
bases in the gene-specific forward inner primer, which introduces a
unique DNA sequence into every initial DNA template upstream of the
amplicon region during the first round of amplification. Once all
original DNA template strands are assigned a unique UMI, an outer
forward primer and the gene-specific reverse primer can be used for
further amplification. Consequently, all subsequent DNA amplified from
the original template will have the same UMI, so the number of reads
amplified from the initial template can be calculated. Grouping
sequences by shared UMI allows identification of a consensus, which is
assumed to be the correct sequence . To our knowledge, UMIs have
previously only been used for single-amplicon interspecies
investigations .
Here, we present a method for metabarcoding using amplicons with unique
molecular identifiers to improve error correction – MAUI-seq. The
innovative approach is that we use variation among sequences associated
with a single UMI to identify erroneous sequences, and we show that this
improves error correction compared to non-UMI based analysis using the
state-of-the-art software packages DADA2 and UNOISE3.