Pre-assembly filtering
One key factor that makes GeneMiner faster in comparison to other
software is its use of k-mer-based reads filtering (see Methods
section), as other tools employ sequence alignment-based methods, such
as BLAST, for read filtering. Although the k-mer based filtering method
is quick, it has several drawbacks. Firstly, when the value of k is
small, different sequences may generate the same k-mer, leading to a
higher false-positive rate. Simultaneously, the k-mer method may
encounter difficulties when dealing with genes containing a large amount
of complex structural variation and repetitive regions. To address the
consequence of a small k, we designed the re-filtering method to
iteratively (automatically) filter the labeled filtered datasets using
the gradient rising kf (the length of the k-mer
in the reads filtering step). This re-filtering method prevents the
slowdown of the subsequent assembly caused by excessively large,
filtered datasets. Additionally, a larger kf is
recommended for repeat regions. Regarding the second issue, we suggest
that GeneMiner should only be used for genes that do not have complex
structural variation and repetitive regions. Typically, genes used as
molecular phylogenetic markers have simpler structures and do not
contain repetitive regions.