Affiliations:
1State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
2University of Chinese Academy of Sciences, Beijing 100049, China
3 Shaanxi University of Science and Technology, Xi’an 710021, China
*To whom correspondence should be addressed.
Abstract
Plastid genomes are unique to plants and play an important role in genomics and evolutionary biology. Next-generation sequencing has revolutionized plastid genome data acquisition in a way that genome assembly and annotation became bottlenecks for large plastid genome data usage. Here we develop a novel open-source, cross-platform tool, NOVOWrap, with both command-line and graphical user interfaces for plastid genome automatic assembly using personal computers. With minimum inputs and user intervention, NOVOWrap could automatically assemble plastid genomes, validate results and standardize the structure with affordable computer resources. The performance of the software has been successfully benchmarked against eleven plastid genomes of species belonging to lycopods, gymnosperms, and angiosperms. The program is expected to liberate researchers from laborious computer manipulations and create reliable and standard genomic data.
KEYWORDS: plastid genome, assembly, quadripartite structure
Availability
The source code and portable packages for various operating systems are available at https://github.com/wpwupingwp/novowrap. The software is released under the AGPL-3.0 license.
Introduction
Plastids (or chloroplasts in green plants) are considered to originate from a single endosymbiotic event involving an alpha-proteobacterium and cyanobacterium (Keeling 2010). During the process of evolution, the genome of the ancient cyanobacterium shrank and became the plastid genome of approximately 120-160 kb in size (Green 2011).
A typical plastid genome is a circular, double-stranded DNA molecule with a quadripartite structure, which consists of two inverted repeated regions (IR), a small single-copy region (SSC) and a large single-copy region (LSC). The plastid genome usually encodes 110-130 genes with high homology and collinearity among plant species (Wicke et al.2011).
Owing to its conserved structure, moderate sequence variability (Smith 2015), and high copy number in a cell, plastid genomes of partial or full-length have been widely used in plant phylogeny, comparative genomics and biotechnology (Tonti-Filippini et al. 2017).
The first two plastid genomes were determined in 1986 fromNicotiana tabacum (Sugiura et al. 1986) and Marchantia polymorpha(Brassell et al. 1986). To date, there are ~5,000 complete plastid genomes deposited in GenBank (Sayers et al.2019), and the number is soaring since the advent of next-generation sequencing (NGS) technologies (Twyford & Ness 2017). Furthermore, several software for annotating plastid genomes have been developed (Huang & Cronk 2015; Qu et al. 2019; Shi et al. 2019; Tillich et al. 2017). Thus, the biggest obstacle to the data acquisition of plastid genomes seems to be in the assembly step.
The mainstream strategy to determine a plastid genome is by sequencing the total DNA including plastid, mitochondrion and nuclear genome components using NGS technology. Therefore, handling mixed NGS reads is a challenge in plastid genome assembly. One solution is to map all the reads to a close related reference genome allowing reads of plastid to be filtered and collected for assembly. Another solution is to construct contigs byde novo assembly, and plastid related contigs are then screened and assembled into a complete genome. Because all reads are used for assembly, this method requires relatively high computing power that personal computers can hardly provide. Both methods require manual adjustment of the IR regions to form a complete plastid genome and a highly similar reference genome is sometimes indispensable (Twyford & Ness 2017).
A novel and increasingly popular strategy is to use a universal seed sequence to bait plastid reads and extend the assemblage cyclically until the full circle is formed. Such method not only overcomes the computing burden of processing all the reads, but also obviates the requirement of complete genomes as a reference (Freudenthal et al. 2019). Two widely used implementations of such strategy are NOVOPlasty (Dierckxsens et al. 2016) and GetOrganelle (Jinet al. 2019). The former hashes all the reads before the extension step and has its own assembly algorithm instead of calling SPAdes (Bankevich et al. 2012), which requires less running time.
Unfortunately, all available software only have command-line interfaces and involve complex inputs or settings, which is a challenge for those who have limited computer skills or knowledge of operating systems (Attwood et al. 2019). Manual intervention to handle the questionable outputs is also unavoidable. Even NOVOPlasty and GetOrganelle, which usually generate full-length genome sequences, produce multiple outputs with opposite directions, different starting sites, alternate orientation of LSC/SSC, or sometimes mis-assembly. Moreover, although the commonly used rbcL for baiting reads usually works well, it may fail in some cases such as gene transfer events or poor quality. Thus, developing more seeds could be helpful (Lim et al. 2018).
Here, we present NOVOWrap, a user-friendly, cross-platform Python package for plastid genome assembly. The program could work effectively on a personal computer and generate reliable assembly results with a standardized structure, with minimal user intervention during the process. By providing a highly automatic solution, the program could help to empower researchers with limited bioinformatics skills or computer resources to more easily determine plastid genomes for phylogeny, genomics, and biotechnology.
Methods