Features of GeneMiner
The results for assessing the impact of sequencing depth on assembly
performance are stored in Table S1 (Weighted-node) and Figure 3-B. As
the sequencing depth increased from 1 to 100x, the number of
high-quality results (level1 ) increased from 24 to 297 while the
sum of high-quality and medium-quality (levels 1&2 ) results
increased from 270 to 330. The overall accuracy rate also improved
significantly with increasing sequencing depth, rising from 82.84% to
99.47%. The assembly performance was observed to be adequate when the
sequencing depth exceeded 20x, with statistics gradually trending
towards convergence as it surpassed 50x (Figure 3-B). Notably, any
increase in sequencing deepth above 50x only had a marginal effect on
improving the assembly. Compared to using counts of k-mers, the
employment of the Weighted-node model increases the number of
high-quality results. Additionally, in instances where the depth is
above 10x, both the indel rate and substitution rate are notably reduced
(Figure 3-B).
The results of the grid test with different variances of reference
sequences from 0 to 30% and depth variations from 1 to 100x are stored
in Table S2. The heatmap of the results for high-quality genes obtained
(level 1 ) can be seen in Figure 3-D, and the heatmaps forlevels 2 through 4, indel rate, substitution rate, and
recovered genes are in Figure S1. The results in Figure 3-D show that
when the sequencing depth is below 10x, or the variance in reference
sequences is above 18%, the number of high-quality genes obtained is
less than 50% of the total number of genes, and the number of
successfully recovered genes is also greatly reduced. These data
demonstrate that the sequencing depth and the variance in the reference
sequences are the most important factors that affect gene recovery.
The results of the evaluation of the bootstrap parameter are shown in
Table S3. For each set of results, we conducted a detailed breakdown
based on bootstrap scores, dividing them into five parts: [0, 50),
[50,90), [90, 99), [99,100), and 100. We then compared the changes
in the number of genes of different qualities, substitution rate, and
other parameters after removing genes with different scores. The results
indicate that a higher bootstrap score (above 90) contains the majority
of high-quality (level 1 ) and medium-quality (level 2 &
3 ) genes and excludes some low-quality (level 4 ) genes. This
trend is particularly evident when the variance in reference sequences
is greater than 11%. However, if users select a bootstrap score that is
too strict, for example, removing all sequences with a bootstrap score
that is not equal to 100, many high-quality sequences will be lost. Of
note, when the variance in reference sequences is less than or equal to
15%, removing genes with a bootstrap score lower than 90 hardly
sacrifices high-quality results and can effectively eliminate
low-quality results that are caused by highly variable reference
sequences. We visually compared the results of removing those with a
bootstrap score lower than 90 with the untreated results (Figure 3-C).
These results reveal that when the variance in reference sequences is
above 11%, filtering through the bootstrap score can effectively reduce
base variations caused by incorrect assembly.