The Assembly module
For assembly, NOVOWrap accepts both gzip-compressed and uncompressed NGS
data (FASTQ format) as input. Because memory consumption is positively
correlated to data size, oversize files can hardly be handled on normal
personal computers. The Assembly module offers options for extracting
partial data from the original data files to reduce memory usage.
With user-provided taxon information (scientific name or others) and the
assistance of NCBI Taxonomy database, the module downloads the plastid
genome of the most closely related species deposited in NCBI RefSeq
database. Alternatively, the program could run offline if the user
provides the reference locally. Subsequently, the sequences of seeds
were extracted according to the annotations.
With the above discussed inputs (NGS data files and taxonomy) and all
parameters that can be automatically detected by the program (for
instance, read length, pair-end or not, assembly type, and file path of
each input and output), several configuration files using different
seeds are generated for NOVOPlasty.
Next, NOVOPlasty is called by the module. The outputs of NOVOPlasty are
first checked to verify the completeness of the assembly and then are
organized to avoid conflicts between different assembly processes.
To assemble samples, the module could read input information from a
comma separated values file to handle all assemblies automatically. The
three-column CSV file should contain the file names of “reads 1” data,
“reads 2” data (optional, only for pair-end sequencing) and the taxon
information of samples. Owing to the limited memory resources of
personal computers, the program is executed sequentially to complete
batch assembly, although users can also achieve parallelism by opening
multiple programs if the hardware resources are abundant.