Benchmark with sequencing data
Ten species were used to benchmark the usability of the software and the
applicability of the seeds. Species representing gymnosperms, basal
angiosperms, monocots, and eudicots were sequenced. Pycnostachys
reticulate and Nepeta stewartiana are from newly sequenced genus
and another four are newly sequenced species. Additionally, in order to
test the ability of the program to recognize DR structure and to expand
its application to wider taxonomy, Selaginella sanguinolenta , a
species of lycopods, was also used.
Purified DNA was acquired from the Plant DNA Bank of China. Sequencing
was carried out at the Huada Genomics Institute (BGI, Shenzhen, China).
Approximately 50 GB of PE-100 sequencing data per sample were collected.
In order to reduce the memory requirement, only 10,000,000 reads of each
sample were used for assembly. NOVOWrap v0.95 was run on a PC with
Microsoft Windows 10 operating system and 32 GB memory to assemble all
data using default options. The reference genomes of each sample were
automatically chosen by the program. The output genomes were
crossverified based on the results of the collinearity analysis. For
further verification, the assemblies of four species, Cycas
debaoensis , Brasenia schreberi , Oryza alta andFirmiana simplex, were aligned to existing plastid genomes
deposited in GenBank using MAFFT v7.409 (Katoh & Standley 2013).
The assembled sequences were then annotated using Plann (Huang & Cronk
2015) and GeSeq (Tillich et al. 2017). The start and stop codon of genes
were manually corrected with Unipro UGENE (Okonechnikov et al.2012) if necessary. The annotation results were finally visualized with
OGDRAW (Greiner et al. 2019).
The assembled results are listed in Table 2. All samples were
successfully assembled using NOVOWrap v0.95. The memory usage of the
program stabilized at ~12 GB and the time cost ranged
between 6 and 20 minutes. For the candidate seeds, two samples failed
with the frequently used rbcL but succeeded with other seeds.
While rbcL , psaB and psaC have similar success
rates, rrn23 only succeeded on Nepeta stewartiana , one of
the eleven samples. Each seed generated one or two identical results for
each sample, with one’s structure adjusted if necessary. The vast
majority of the samples have identical assembly results with different
seeds with the exception of Pycnostachys reticulate . In this
case, 140 candidate results were generated, with lengths ranging between
152,466 and 152,520 bp (Supplementary material 2). The genome structures
of all these candidates are relatively conserved compared to the
reference, Ocimum tenuiflorum (species in same family) and have
the same size of SSC (17,671 bp). Their LSC ranged between 83,397 and
83,423 bp and IRs ranged between 25,695 and 25,713 bp. The variations in
size are due to the existence of simple tandem repeats, a difficulty
that NGS alone could hardly solve sometimes.
The Validation module successfully recognized the quadripartite regions
of the sequences, including the DR structure of the lycopod species. The
starting site of each sequence was adjusted according to the respective
references. Inverted LSC and SSC regions were detected and adjusted in
the subsequent collinearity analysis. Most newly determined genomes
displayed high collinearity to the references, except for a few gaps in
the alignment. In addition, the IRs of each genome were similar in size
to the references with no obvious expansion or contraction (Figure 2).
The annotation results further verified the reliability of the assembly
results. No gene loss or pseudogenization occurred in most of the
samples. The only exception is Salvia campanulata , which has a
premature termination in one side of the SSC flanked by IRa causing
truncated ycf1 . Moreover, all the boundaries of the quadripartite
regions determined by the Validation module are consistent with the
results of OGDRAW (Supplementary material 3).
According to the alignment results, the four resequenced genomes are
identical to their references, except for a few one-base insertions or
deletions located in mono nucleotide repeats. This is likely due to
sequencing errors of NGS platforms or intraspecific differences given
that the re-sequenced samples and samples existing in GenBank are from
the same species but different individuals. Annotations demonstrate that
such mutations have no effect on coding sequences.
The Merge module is designed to merge contigs originating from each seed
into one complete plastid genome. For the cases in this study, the Merge
module is skipped because all samples are successfully assembled with
one of the seeds. To test the Merge module, several artificially
generated datasets were used (Supplementary material 4). The result
showed that the Merge module generates a complete plastid genome in most
of the datasets.
Usability
NOVOWrap v0.95 provides both a command-line interface and a graphical
user interface. It is a cross-platform and easy to install. In addition,
it offers portable packages for Linux, Microsoft Windows, and MacOS
operating systems. For the installation, all dependent third-party tools
could be automatically installed if necessary. To further improve
user-friendliness, extensive tests have been conducted to improve the
robustness of the software. Moreover, preventive measures have been set
up for factors that may interrupt the running of the program and cause
inconvenience for users, such as Internet accessibility, the validity of
input data, and the permission to access the work path. The program
requires very few inputs to run and uses universal file formats for both
input and output. Finally, NOVOWrap supports remote access of input
files on the same local area network for convenience.
NOVOWrap is designed to be run on a normal personal computer. The
Validation and Merge modules could finish running in minutes, with
negligible consumption of memory and running time. Although the Assembly
module requires much more resources, based on the feedback from
voluntary testers and our own tests, 16 GB of memory is sufficient to
handle approximately 10,000,000 PE100 reads, 8,000,000 PE150 reads, or
3,000,000 PE200 reads, which is sufficient for obtaining complete
plastid genomes in general cases.
Discussion
The rbcL gene, the most frequently used seed for plastid genome
assembly, usually works efficiently, but may fail occasionally. Hence,
three novel seeds were introduced in this study. The psaB andpsaC functioned as well as rbcL . Although rrn23only succeeded in the assembly of Nepeta stewartiana , it could be
valuable for some extreme cases. Compared with other newly developed
seeds, rrn23 is very conservative in 1,331 evaluated genera. The
possible cause of the lower success rate may be the location difference,
that rrn23 is located in the repeated region, while others are
distributed in single-copy regions. For assembly, starting with IRs
might be harder than LSC/SSC, since the former has extra possible
directions to extend the sequences.
In the benchmark, each sample was assembled with at least one seed.
Although identical sequences derived from different seeds can be used as
supporting evidence for validating the assembly, assembly with one seed
is sufficient for practical use, given that more attempts require much
more time. According to the above, the order of using seeds is optimized
to rbcL first, then psaB , psaC, and rrn23 .
If none of the seeds work, the whole reference plastid genome is used as
a backup seed.
In early time, the plastid genome is considered to have two natural
isomers with SSC having the opposite polarity (Palmer 1983). The author
considered that the inversion is mediated by the IR structure. However,
some recent studies have used inversion as a phylogenetic character
which is thought to be divergent in different lineages. (Walker et
al. 2015). Such misinterpretation has deepened, given the fact that
GenBank does not have a mandatory standard for the direction of the SSC.
In this study, both isoforms were found in the assembly results of
multiple samples. In addition, such structural variations are more
likely to be misassembled owing to the existence of IRs (Jin et al.
2019; Wick et al. 2015). NOVOWrap automatically adjusts the
orientation of SSC and LSC as the default output. In the case of
potential needs, unchanged sequences are also generated.
Recently, several studies on the structural variation of plastid
genomes, including the abnormal structure of the genome, the expansion
and shrinking of IRs, and the recombination of orders of genes and other
variations have been conducted (Weng et al. 2017; Zhu et
al. 2016). However, the inconsistent structure of existing data may
hamper further studies. Not only the orientation of SSC/LSC, but also
other features such as different starting sites, inconsistent order of
quadripartite regions and opposite strands of the whole genome, could
cause problems in downstream analysis, such as annotation, alignment,
collinearity analysis and phylogenetic reconstruction. For instance, the
unadjusted sequence of Salvia campanulata has been reported to
have drastically shortened IRs, with a length of 3,845 bp. However, the
adjusted sequence showed a normal length of IRs, with a length of 25,535
bp (Supplementary material 3).
Although manual adjustment may solve the problem, the lack of automatic
tools and standards makes it laborious. The Validation module of
NOVOWrap ensures the consistency of assembly results automatically,
liberating users from tedious error-prone manual manipulation. It is
expected to accelerate plastid genome assembly and ease data upload to
public databases with all consistent genome structures.
Acknowledgements
Thanks Yixuan Huang for testing and packaging program in MacOS. We are
grateful to numerous volunteers who have tested the program with
different data and various operating systems on their own personal
computers or workstations.
This work is funded by the Strategic Priority Research Program of
Chinese Academy of Sciences (XDA23080204 and XDA19050303).
Author Contributions
Ping Wu wrote the program and the manuscript. Hao Chen tested and
optimized the code and wrote the user manual. Chao Xu collected the
data. Jie Yang analyzed data of lycopods and Xianchun Zhang guided the
analysis of DR structure. Shiliang Zhou designed the study and edited
the final manuscript.
Data Accessibility
The assembly results have been uploaded to GenBank. The source code of
the program and the user manual are available athttps://github.com/wpwupingwp/novowrap.
References
Attwood TK, Blackford S, Brazas MD, Davies A, et al. (2019). A
global perspective on evolving bioinformatics and data science training
needs. Briefings in Bioinformatics , 20, 398-404.
Bankevich A, Nurk S, Antipov D, Gurevich AA, et al. (2012)
SPAdes: a new genome assembly algorithm and its applications to
single-cell sequencing. Journal of Computational Biology, 19 (5),
455-477. doi:10.1089/cmb.2012.0021
Bellot S, Renner SS. (2016). The plastomes of two species in the
endoparasite genus pilostyles (apodanthaceae) each retain just five or
six possibly functional genes. Genome Biol Evol , 8, 189-201.
Blazier CC, Guisinger MM, Jansen RK. (2011). Recent loss of
plastid-encoded ndh genes within Erodium (Geraniaceae). Plant
Molecular Biology , 76, 263-272.
Brassell SC, Eglinton G, Marlowe IT, Pflaumann U, et al. (1986).
Chloroplast gene organization deduced from complete sequence of
liverwort Marchantia polymorpha chloroplast DNA. Nature , 320,
129-133.
Dierckxsens N, Mardulyn P, Smits G. (2016). NOVOPlasty: De novo assembly
of organelle genomes from whole genome data. Nucleic Acids
Research , 45.
Freudenthal JA, Pfaff S, Terhoeven N, Korte A, et al. (2019) The
landscape of chloroplast genome assembly tools. bioRxiv .
doi:https://doi.org/10.1101/665869
Green BR (2011) Chloroplast genomes of photosynthetic eukaryotes.Plant Journal, 66 , 34-44. doi:10.1111/j.1365-313X.2011.04541.x
Greiner S, Lehwark P, Bock R (2019) OrganellarGenomeDRAW (OGDRAW)
version 1.3.1: expanded toolkit for the graphical visualization of
organellar genomes. Nucleic Acids Research, 47 (W1), W59-W64.
doi:10.1093/nar/gkz238
Huang DI, Cronk QC (2015) Plann: A command-line application for
annotating plastome sequences. Applications in Plant Sciences,
3 (8). doi:10.3732/apps.1500026
Jin J, Yu W, Yang J, Song Y, et al. (2019) GetOrganelle: a fast
and versatile toolkit for accurate de novo assembly of organelle
genomes. bioRxiv . doi:https://doi.org/10.1101/256479
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Molecular
Biology and Evolution, 30 (4), 772-780. doi:10.1093/molbev/mst010
Keeling PJ (2010) The endosymbiotic origin, diversification and fate of
plastids. Philosophical transactions of the Royal Society of
London. Series B, Biological sciences, 365 , 729-748.
doi:10.1098/rstb.2009.0103
Li Z, Chen Y, Mu D, Yuan J, et al. (2012). Comparison of the two
major classes of assembly algorithms: Overlap-layout-consensus and
de-bruijn-graph. Briefings in Functional Genomics , 11, 25-37.
Lim CE, Kim GB, Ryu SA, Yu HJ, et al. (2018). The complete
chloroplast genome of Artemisia hallaisanensis Nakai (Asteraceae), an
endemic medicinal herb in Korea. Mitochondrial DNA Part B:
Resources , 3, 359-360.
Madden TL, Busby B, Ye J. (2019). Reply to the paper : Misunderstood
parameters of NCBI BLAST impacts the correctness of bioinformatics
workflows.1-2.
Nei M. (1987). Molecular Evolutionary Genetics : Columbia
University Press.
Okonechnikov K, Golosova O, Fursov M, team U (2012) Unipro UGENE: a
unified bioinformatics toolkit. Bioinformatics, 28 (8), 1166-1167.
doi:10.1093/bioinformatics/bts091
Palmer JD. (1983). Chloroplast DNA exists in two orientations.Nature , 301, 92-93.
Pruitt KD, Tatusova T, Brown GR, Maglott DR. (2012). NCBI Reference
Sequences (RefSeq): Current status, new features and genome annotation
policy. Nucleic Acids Research , 40, 130-135.
Qu XJ, Moore MJ, Li DZ, Yi TS. (2019). PGA: A software package for
rapid, accurate, and flexible batch annotation of plastomes. Plant
Methods , 15, 1-12.
Sayers EW, Cavanaugh M, Clark K, Ostell J, et al. (2019) GenBank.Nucleic Acids Research, 47 (D1), D94-D99. doi:10.1093/nar/gky989
Shi L, Chen H, Jiang M, Wang L, et al. (2019). CPGAVAS2, an
integrated plastome sequence annotator and analyzer. Nucleic Acids
Research , 47, W65-W73.
Smith DR (2015) Mutation rates in plastid genomes: They are lower than
you might think. Genome Biol Evol, 7 , 1227-1234.
doi:10.1093/gbe/evv069
Spellerberg IF, Fedor PJ. (2003). A tribute to Claude-Shannon
(1916-2001) and a plea for more rigorous use of species richness,
species diversity and the ’Shannon-Wiener’ Index. Global Ecology
and Biogeography , 12, 177-179.
Sugiura M, Shinozaki K, Zaita N, Kusuda M, et al. (1986). Clone
bank of the tobacco (Nicotiana tabacum) chloroplast genome as a set of
overlapping restriction endonuclease fragments: mapping of eleven
ribosomal protein genes. Plant Science , 44, 211-217.
Tillich M, Lehwark P, Pellizzer T, Ulbricht-Jones ES, et al.(2017). GeSeq - Versatile and accurate annotation of organelle genomes.Nucleic Acids Research , 45, W6-W11.
Tonti-Filippini J, Nevill PG, Dixon K, Small I (2017) What can we do
with 1000 plastid genomes? Plant Journal, 90 , 808-818.
doi:10.1111/tpj.13491
Twyford AD, Ness RW. (2017). Strategies for complete plastid genome
sequencing. Molecular Ecology Resources , 17, 858-868.
Walker JF, Jansen RK, Zanis MJ, Emery NC. (2015). Sources of inversion
variation in the small single copy (SSC) region of chloroplast genomes.American Journal of Botany , 102, 1751-1752.
Weng ML, Ruhlman TA, Jansen RK. (2017). Expansion of inverted repeat
does not decrease substitution rates in Pelargonium plastid genomes.New Phytologist , 214, 842-851.
Wick RR, Schultz MB, Zobel J, Holt KE. (2015). Bandage: Interactive
visualization of de novo genome assemblies. Bioinformatics , 31,
3350-3352.
Wicke S, Schneeweiss GM, dePamphilis CW, Müller KF, et al. (2011)
The evolution of the plastid chromosome in land plants: Gene content,
gene order, gene function. Plant Molecular Biology, 76 , 273-297.
doi:10.1007/s11103-011-9762-4
Zhu A, Guo W, Gupta S, Fan W, et al. (2016). Evolutionary
dynamics of the plastid inverted repeat: The effects of expansion,
contraction, and loss on substitution rates. New Phytologist ,
209, 1747-1756.
Figure legends
Figure 1. Schematic diagram of the Rotate algorithm. In order
to avoid possible interference of truncated region at the starting site,
the original sequence was firstly duplicated to form a full-length
plastid genome in the middle of the sequence. With the self-to-self
comparison with BLAST, the longest repeated region with the highest
number should be the IRs. Subsequently, according to the location of
IRs, the boundaries of LSC and SSC were determined. Finally, full-length
plastid genomes were extracted from the extended sequences and the
collinearity analysis was performed.
Figure 2. Collinearity analysis of the assembly results. A) plastid
genomes with DR structure; B) assembly has inverted LSC compared with
the reference; C) normal comparison result; D) inverted SSC. Red strips
represent the collinearity between the reference and the plus strand of
the assembly, and green strips represent the minus strand. Blank strips
indicate the mismatch of sequences. Crossed strips could be the matches
between IRs in both strands, or indicate the presence of inverted
region, such as B.
Figure