Introduction
The evaluation of DNA diversity in environmental samples has become a pivotal approach in microbial ecology and is increasingly also used to assess the distribution of larger organisms . If a core gene can be amplified from environmental DNA with universal primers, the relative abundance of species in the community can be estimated from the proportions of species-specific variants among the amplicons. High throughput amplicon sequencing (HTAS), often termed metabarcoding, is a cost-effective way to detect multiple species simultaneously within a range of environmental samples . While shotgun sequencing of the whole community (metagenomics) can provide a richer description of the functions in a community, HTAS remains a more efficient tool for comparing the species diversity of a large number of community samples. Despite the extensive use of HTAS for interspecies ecological diversity studies, few investigations have utilised HTAS for intraspecies analysis . As 16S rRNA amplicons are too highly conserved to estimate microbial within-species diversity, other target gene candidates need to be considered in order to sufficiently discern intraspecies sequence variation.
Many studies have evaluated the extent of PCR-based amplification errors and bias for HTAS diversity studies . Numerous known PCR biases reduce the accuracy of diversity and abundance estimations, with the major concern being the inability to confidently distinguish PCR error from natural sequence variation in environmental samples, which is an especially limiting factor for intraspecific studies.
Polymerase errors, production of chimeric sequences by template switching, and the stochasticity of PCR amplification can be major causes of PCR errors . Polymerase errors introduce new sequences into the template population during amplification. These sequence errors include not only substitutions but also insertions and deletions. The use of proofreading polymerases, optimised DNA template concentration, and reduced PCR cycle number have been suggested to reduce these errors .
In order to account for the introduction of sequence variants in PCR amplification, several sequence-classification approaches have been established to manage diversity estimates. The most common method is the use of operational taxonomic units (OTUs) in microbial diversity studies which analyse target gene sequences and cluster based on an arbitrary fixed similarity threshold (QIIME ; UPARSE . Within species boundaries this technique could dramatically reduce the resolution of naturally occurring sequence variation.
Most recent methods rely on the formation of sequence groups called amplicon sequence variants (ASVs) (DADA2, ; UNOISE3, . This approach allows sequence resolution down to one nucleotide, which is advantageous for determining intraspecies allelic variation, but noise from PCR errors is also more evident. Variation induced by PCR errors often cannot be differentiated from rare natural allelic variation without the use of sequence denoising methods . DADA2 relies on a quality-aware parametric error model, which is developed on a per sequencing run basis. This increases the run time compared to UNOISE3, which uses a one-pass technique .
An approach that can reduce sequencing noise is to assign a unique molecular identifier (UMI) to every initial DNA template within an HTAS sample, which also enables evaluation of PCR amplification bias . Additionally, the UMI provides a potential route to address polymerase errors in metabarcoding studies. The UMI is provided by a set of random bases in the gene-specific forward inner primer, which introduces a unique DNA sequence into every initial DNA template upstream of the amplicon region during the first round of amplification. Once all original DNA template strands are assigned a unique UMI, an outer forward primer and the gene-specific reverse primer can be used for further amplification. Consequently, all subsequent DNA amplified from the original template will have the same UMI, so the number of reads amplified from the initial template can be calculated. Grouping sequences by shared UMI allows identification of a consensus, which is assumed to be the correct sequence . To our knowledge, UMIs have previously only been used for single-amplicon interspecies investigations .
Here, we present a method for metabarcoding using amplicons with unique molecular identifiers to improve error correction – MAUI-seq. The innovative approach is that we use variation among sequences associated with a single UMI to identify erroneous sequences, and we show that this improves error correction compared to non-UMI based analysis using the state-of-the-art software packages DADA2 and UNOISE3.