Analysis protocol: filtering using UMI-based error rates
The resulting paired-end reads were merged and then separated by gene
prior to downstream analysis, where UMIs are critical in two ways.
Firstly, sequences are clustered by UMI, and the number of unique UMIs
is counted for each distinct sequence, selecting the most abundant
sequence associated with each UMI (Figure 1C ). UMIs are
discarded as ambiguous if the most abundant sequence does not have at
least two reads more than the next in abundance. The most abundant
sequence will usually be the correct one (Figure 2A Case 1)
but, because most UMIs are represented by just a small number of reads,
it can sometimes happen that an erroneous sequence is sampled more often
than the true sequence, so the primary sequence of the UMI becomes this
erroneous sequence (Figure 2A Case 2). Secondly, we reasoned
that it may be possible to eliminate these errors by using the UMIs to
provide information on global error rates across all samples. We
implemented this in MAUI-seq by noting both the most abundant (primary)
and the second most abundant (secondary) sequence if two or more
sequences were associated with the same UMI. MAUI-seq then distinguishes
between true and erroneous sequences based on the ratio of primary and
secondary occurrences of each sequence, eliminating sequences that show
a high ratio (default is 0.7) of secondary to primary occurrences
(Figure 1C and Figure 2B ). The 0.7 threshold was
chosen empirically, based on the ratios observed for known true and
erroneous sequences, but it is a compromise because the incidence of
secondary sequences varies across genes and studies. An examination of
the results may suggest choosing different thresholds in other studies.
Finally, globally rare sequences are discarded (default threshold is
0.1% averaged across samples - a lower threshold could be used if
samples were sequenced to a greater depth). Python scripts for
separating the genes and for the UMI analysis are available athttps://github.com/jpwyoung/MAUI.