Affiliations:
1State Key Laboratory of Systematic and Evolutionary
Botany, Institute of Botany, Chinese Academy of Sciences, Beijing
100093, China
2University of Chinese Academy of Sciences, Beijing
100049, China
3 Shaanxi University of Science and Technology, Xi’an
710021, China
*To whom correspondence should be addressed.
Abstract
Plastid genomes are unique to plants and play an important role in
genomics and evolutionary biology. Next-generation sequencing has
revolutionized plastid genome data acquisition in a way that genome
assembly and annotation became bottlenecks for large plastid genome data
usage. Here we develop a novel open-source, cross-platform tool,
NOVOWrap, with both command-line and graphical user interfaces for
plastid genome automatic assembly using personal computers. With minimum
inputs and user intervention, NOVOWrap could automatically assemble
plastid genomes, validate results and standardize the structure with
affordable computer resources. The performance of the software has been
successfully benchmarked against eleven plastid genomes of species
belonging to lycopods, gymnosperms, and angiosperms. The program is
expected to liberate researchers from laborious computer manipulations
and create reliable and standard genomic data.
KEYWORDS: plastid genome, assembly, quadripartite structure
Availability
The source code and portable packages for various operating systems are
available at https://github.com/wpwupingwp/novowrap. The software is
released under the AGPL-3.0 license.
Introduction
Plastids (or chloroplasts in green plants) are considered to originate
from a single endosymbiotic event involving an alpha-proteobacterium and
cyanobacterium (Keeling 2010). During the process of evolution, the
genome of the ancient cyanobacterium shrank and became the plastid
genome of approximately 120-160 kb in size (Green 2011).
A typical plastid genome is a circular, double-stranded DNA molecule
with a quadripartite structure, which consists of two inverted repeated
regions (IR), a small single-copy region (SSC) and a large single-copy
region (LSC). The plastid genome usually encodes 110-130 genes with high
homology and collinearity among plant species (Wicke et al.2011).
Owing to its conserved structure, moderate sequence variability (Smith
2015), and high copy number in a cell, plastid genomes of partial or
full-length have been widely used in plant phylogeny, comparative
genomics and biotechnology (Tonti-Filippini et al. 2017).
The first two plastid genomes were determined in 1986 fromNicotiana
tabacum (Sugiura et al. 1986) and Marchantia polymorpha(Brassell et al. 1986). To date, there are ~5,000
complete plastid genomes deposited in GenBank (Sayers et al.2019), and the number is soaring since the advent of next-generation
sequencing (NGS) technologies (Twyford & Ness 2017). Furthermore,
several software for annotating plastid genomes have been developed
(Huang & Cronk 2015; Qu et al. 2019; Shi et al. 2019;
Tillich et al. 2017). Thus, the biggest obstacle to the data
acquisition of plastid genomes seems to be in the assembly step.
The
mainstream strategy to determine a plastid genome is by sequencing the
total DNA including plastid, mitochondrion and nuclear genome components
using NGS technology. Therefore, handling mixed NGS reads is a challenge
in plastid genome assembly. One solution is to map all the reads to a
close related reference genome allowing reads of plastid to be filtered
and collected for assembly. Another solution is to construct contigs byde novo assembly, and plastid related contigs are then screened
and assembled into a complete genome. Because all reads are used for
assembly, this method requires relatively high computing power that
personal computers can hardly provide. Both methods require manual
adjustment of the IR regions to form a complete plastid genome and a
highly similar reference genome is sometimes indispensable (Twyford &
Ness 2017).
A novel and increasingly popular strategy is to use a universal seed
sequence to bait plastid reads and extend the assemblage cyclically
until the full circle is formed. Such method not only overcomes the
computing burden of processing all the reads, but also obviates the
requirement of complete genomes as a reference (Freudenthal et
al. 2019). Two widely used implementations of such strategy are
NOVOPlasty (Dierckxsens et al. 2016) and GetOrganelle (Jinet al. 2019). The former hashes all the reads before the
extension step and has its own assembly algorithm instead of calling
SPAdes (Bankevich et al. 2012), which requires less running time.
Unfortunately, all available software only have command-line interfaces
and involve complex inputs or settings, which is a challenge for those
who have limited computer skills or knowledge of operating systems
(Attwood et al. 2019). Manual intervention to handle the
questionable outputs is also unavoidable. Even NOVOPlasty and
GetOrganelle, which usually generate full-length genome sequences,
produce multiple outputs with opposite directions, different starting
sites, alternate orientation of LSC/SSC, or sometimes mis-assembly.
Moreover, although the commonly used rbcL for baiting reads
usually works well, it may fail in some cases such as gene transfer
events or poor quality. Thus, developing more seeds could be helpful
(Lim et al. 2018).
Here, we present NOVOWrap, a user-friendly, cross-platform Python
package for plastid genome assembly. The program could work effectively
on a personal computer and generate reliable assembly results with a
standardized structure, with minimal user intervention during the
process. By providing a highly automatic solution, the program could
help to empower researchers with limited bioinformatics skills or
computer resources to more easily determine plastid genomes for
phylogeny, genomics, and biotechnology.
Methods