The Assembly module
For assembly, NOVOWrap accepts both gzip-compressed and uncompressed NGS data (FASTQ format) as input. Because memory consumption is positively correlated to data size, oversize files can hardly be handled on normal personal computers. The Assembly module offers options for extracting partial data from the original data files to reduce memory usage.
With user-provided taxon information (scientific name or others) and the assistance of NCBI Taxonomy database, the module downloads the plastid genome of the most closely related species deposited in NCBI RefSeq database. Alternatively, the program could run offline if the user provides the reference locally. Subsequently, the sequences of seeds were extracted according to the annotations.
With the above discussed inputs (NGS data files and taxonomy) and all parameters that can be automatically detected by the program (for instance, read length, pair-end or not, assembly type, and file path of each input and output), several configuration files using different seeds are generated for NOVOPlasty.
Next, NOVOPlasty is called by the module. The outputs of NOVOPlasty are first checked to verify the completeness of the assembly and then are organized to avoid conflicts between different assembly processes.
To assemble samples, the module could read input information from a comma separated values file to handle all assemblies automatically. The three-column CSV file should contain the file names of “reads 1” data, “reads 2” data (optional, only for pair-end sequencing) and the taxon information of samples. Owing to the limited memory resources of personal computers, the program is executed sequentially to complete batch assembly, although users can also achieve parallelism by opening multiple programs if the hardware resources are abundant.