Reads filtering
GeneMiner stores reads with high similarity to the reference of closely related taxa. The more closely related the input sequences are to the reference sequence, the better the results of the program. The process includes two stages: building and retrieving the reference hash table (Figure 2-A). To create the reference hash table, we split the reference sequences of multiple genes are split into a k-mers dictionary, which is simply a dictionary of length-k strings of nucleotide. A record of the position information, the occurrence number, and the gene name of all k-mers are added in the hash table. The k-mer size,kf , is set by the user. To use the reference hash table, raw reads from NGS data (from Illumina, Roche-454, ABI, or other sequencing platforms) in FASTQ format are used as input. Similar to the reference sequences, each raw read is also split into k-mers where the same kf is k-mer sized as previosuly designated by the user. If the k-mer of a read matches a k-mer in the reference hash table, the whole read will be kept and assigned to the target gene’s filtered dataset. In this process, GeneMiner can concurrently process multiple genes and samples, leading to an accelerated analysis speed.