Analysis protocol: filtering using UMI-based error rates
The resulting paired-end reads were merged and then separated by gene prior to downstream analysis, where UMIs are critical in two ways. Firstly, sequences are clustered by UMI, and the number of unique UMIs is counted for each distinct sequence, selecting the most abundant sequence associated with each UMI (Figure 1C ). UMIs are discarded as ambiguous if the most abundant sequence does not have at least two reads more than the next in abundance. The most abundant sequence will usually be the correct one (Figure 2A Case 1) but, because most UMIs are represented by just a small number of reads, it can sometimes happen that an erroneous sequence is sampled more often than the true sequence, so the primary sequence of the UMI becomes this erroneous sequence (Figure 2A Case 2). Secondly, we reasoned that it may be possible to eliminate these errors by using the UMIs to provide information on global error rates across all samples. We implemented this in MAUI-seq by noting both the most abundant (primary) and the second most abundant (secondary) sequence if two or more sequences were associated with the same UMI. MAUI-seq then distinguishes between true and erroneous sequences based on the ratio of primary and secondary occurrences of each sequence, eliminating sequences that show a high ratio (default is 0.7) of secondary to primary occurrences (Figure 1C and Figure 2B ). The 0.7 threshold was chosen empirically, based on the ratios observed for known true and erroneous sequences, but it is a compromise because the incidence of secondary sequences varies across genes and studies. An examination of the results may suggest choosing different thresholds in other studies. Finally, globally rare sequences are discarded (default threshold is 0.1% averaged across samples - a lower threshold could be used if samples were sequenced to a greater depth). Python scripts for separating the genes and for the UMI analysis are available athttps://github.com/jpwyoung/MAUI.