The impact of reference sequences
In Test II, when the variance in the reference sequences 18%, the number of high-quality assembly results significantly decreases. In our test datasets, each gene included 50 reference sequences with the same mutation rate, which mildly increased the software’s tolerance to variance. In practice, one does not know the actual mutation rate between the reference sequence and the target sequence in advance, although it can be estimated through the pairwise comparison results for the reference sequence. For the best performance of GeneMiner, we suggest that the average mutation rate in the pairwise comparison of reference sequences should not exceed 15%.
Most often, users may have only very limited reference sequences available to them . If this is the case, we suggest that users use all the genes of species from the same family as reference sequences, without giving special consideration to the mutation rate of those sequences. Particularly, if there is only a single or few very close reference sequences available, the tolerance of GeneMiner to reference sequences will significantly decrease. This is because the k-mer based pre-filtering method is less effective at providing correct reads based on limited reference sequences. Additionally, the assembly process might fail to acquire the correct seed and allocate enough weight to the correct results. In such cases, if providing more reference sequences is not possible, users should try to reduce the values ofkf and ka in the program (corresponding to ”k1” and ”k2”, respectively, in GeneMiner). This adjustment should result in increased output and longer contigs, but this might also increase the assembly errors in the results.
In addition, paralogous genes can also impact the results. This is because the paralogs can lead to significant changes in the branch length and even the topology of the phylogenetic tree. The most prevalent method of identification of orthologs predominantly relies on sequence similarity and sequence length, such as reciprocal best hits (RBH), where orthologs are assumed if two genes from different genomes find each other as the best hit in the other genome. For GeneMiner, we use reference-guided assembly and constraint of sequence length to obtain putative orthologs. If there are multiple very similar paralogous genes in both the reference sequence and the target species, the target gene may be incorrectly assembled from different paralogous genes. Therefore, we recommend that researchers should clean up the paralogous genes in the reference sequence in advance when providing it, to obtain more accurate results. Similarly, GeneMiner is not designed or robust enough to handle polyploids. Researchers should aim to use sequencing data from diploid species whenever possible. Additionally, GeneMiner can export contigs with degenerate bases, which can be further improved using our PPD script (Zhou et al., 2022). This feature allows us to accurately identify putative paralogs that may be overlooked by other tools. PPD identifies paralogs by identifying shared heterozygosity at a locus between individuals and heterozygosity at a locus within individuals.