Features of GeneMiner
The results for assessing the impact of sequencing depth on assembly performance are stored in Table S1 (Weighted-node) and Figure 3-B. As the sequencing depth increased from 1 to 100x, the number of high-quality results (level1 ) increased from 24 to 297 while the sum of high-quality and medium-quality (levels 1&2 ) results increased from 270 to 330. The overall accuracy rate also improved significantly with increasing sequencing depth, rising from 82.84% to 99.47%. The assembly performance was observed to be adequate when the sequencing depth exceeded 20x, with statistics gradually trending towards convergence as it surpassed 50x (Figure 3-B). Notably, any increase in sequencing deepth above 50x only had a marginal effect on improving the assembly. Compared to using counts of k-mers, the employment of the Weighted-node model increases the number of high-quality results. Additionally, in instances where the depth is above 10x, both the indel rate and substitution rate are notably reduced (Figure 3-B).
The results of the grid test with different variances of reference sequences from 0 to 30% and depth variations from 1 to 100x are stored in Table S2. The heatmap of the results for high-quality genes obtained (level 1 ) can be seen in Figure 3-D, and the heatmaps forlevels 2 through 4, indel rate, substitution rate, and recovered genes are in Figure S1. The results in Figure 3-D show that when the sequencing depth is below 10x, or the variance in reference sequences is above 18%, the number of high-quality genes obtained is less than 50% of the total number of genes, and the number of successfully recovered genes is also greatly reduced. These data demonstrate that the sequencing depth and the variance in the reference sequences are the most important factors that affect gene recovery.
The results of the evaluation of the bootstrap parameter are shown in Table S3. For each set of results, we conducted a detailed breakdown based on bootstrap scores, dividing them into five parts: [0, 50), [50,90), [90, 99), [99,100), and 100. We then compared the changes in the number of genes of different qualities, substitution rate, and other parameters after removing genes with different scores. The results indicate that a higher bootstrap score (above 90) contains the majority of high-quality (level 1 ) and medium-quality (level 2 & 3 ) genes and excludes some low-quality (level 4 ) genes. This trend is particularly evident when the variance in reference sequences is greater than 11%. However, if users select a bootstrap score that is too strict, for example, removing all sequences with a bootstrap score that is not equal to 100, many high-quality sequences will be lost. Of note, when the variance in reference sequences is less than or equal to 15%, removing genes with a bootstrap score lower than 90 hardly sacrifices high-quality results and can effectively eliminate low-quality results that are caused by highly variable reference sequences. We visually compared the results of removing those with a bootstrap score lower than 90 with the untreated results (Figure 3-C). These results reveal that when the variance in reference sequences is above 11%, filtering through the bootstrap score can effectively reduce base variations caused by incorrect assembly.