Future directions for MAUI-seq
HTAS is a valuable and widely-used approach for the study of microbial community diversity, but handling erroneous sequences introduced by the amplification and sequencing procedures has always been challenging. The use of UMIs allows MAUI-seq to greatly reduce the incidence of errors through two mechanisms. Firstly, the requirement that a UMI is associated with at least two identical reads eliminates many rare sequences that are predominantly erroneous. Secondly, sequences that are frequently generated as errors can be identified and removed because they occur unexpectedly often as minor components associated with UMIs that are assigned to more abundant sequences. These mechanisms are independent of any reference database and can recognise and retain genuine alleles that differ by a single nucleotide or match a potential chimera. This makes MAUI-seq particularly suited to studies of intraspecific variation, where the range of sequence divergence may be limited and not fully known in advance. However, the efficient elimination of erroneous sequences is also important in community studies such as those based on widely-used 16S primers, and MAUI-seq should be readily adaptable to this field. The analysis pipeline is very fast because no sequence alignment or database searching is involved; only the accepted final sequences would need to be characterised by comparison to a reference database.
Most HTAS studies report the relative proportions of the taxa in a community, but it would sometimes be valuable to estimate the absolute abundance of the microbes in the environmental sample. UMIs can potentially provide such information, if the initial template copying is carefully controlled so that the total number of distinct UMIs reflects the number of templates . While this would necessitate some additional steps at the start of the experimental protocol, it should still be possible to analyse the resulting sequences using the error-removal approaches provided by MAUI-seq. Alternatively, absolute abundance can be estimated by adding a spike of a known quantity of a recognisable target sequence to the sample before processing .
The addition of a UMI shortens the maximum length of target sequence that can be read, and the counting of UMIs rather than reads requires a higher depth of sequencing, but these limitations are increasingly unimportant as improvements in sequencing technology lead to increasing length, enabling long-read amplicon sequencing , and numbers of reads. As implemented in MAUI-seq, UMIs are very effective in reducing the errors inherent in HTAS, and have the potential to improve the quality of any amplicon-based study of diversity. There are several parameters (minimum difference between primary and secondary reads of a UMI, ratio of secondary to primary reads of a sequence, minimum relative abundance) that are user-specified and can be adjusted to suit each study. In principle, it should be possible to optimize these using a statistical model of mutational errors, like that implemented in DADA2 and of chimera formation, which is not modelled in detail by DADA2. The UMIs provide an additional source of information to parameterize the model, linking sequences that have a common origin. Such a model would be complex, however, and parameterizing and testing it would need a dataset that was optimized for the purpose. At the same time, it would also be interesting to explore the use of UMIs at both ends of the amplicon, which would provide an additional means to identify and eliminate chimeras .