Pre-assembly filtering
One key factor that makes GeneMiner faster in comparison to other software is its use of k-mer-based reads filtering (see Methods section), as other tools employ sequence alignment-based methods, such as BLAST, for read filtering. Although the k-mer based filtering method is quick, it has several drawbacks. Firstly, when the value of k is small, different sequences may generate the same k-mer, leading to a higher false-positive rate. Simultaneously, the k-mer method may encounter difficulties when dealing with genes containing a large amount of complex structural variation and repetitive regions. To address the consequence of a small k, we designed the re-filtering method to iteratively (automatically) filter the labeled filtered datasets using the gradient rising kf (the length of the k-mer in the reads filtering step). This re-filtering method prevents the slowdown of the subsequent assembly caused by excessively large, filtered datasets. Additionally, a larger kf is recommended for repeat regions. Regarding the second issue, we suggest that GeneMiner should only be used for genes that do not have complex structural variation and repetitive regions. Typically, genes used as molecular phylogenetic markers have simpler structures and do not contain repetitive regions.