Benchmark with sequencing data
Ten species were used to benchmark the usability of the software and the applicability of the seeds. Species representing gymnosperms, basal angiosperms, monocots, and eudicots were sequenced. Pycnostachys reticulate and Nepeta stewartiana are from newly sequenced genus and another four are newly sequenced species. Additionally, in order to test the ability of the program to recognize DR structure and to expand its application to wider taxonomy, Selaginella sanguinolenta , a species of lycopods, was also used.
Purified DNA was acquired from the Plant DNA Bank of China. Sequencing was carried out at the Huada Genomics Institute (BGI, Shenzhen, China). Approximately 50 GB of PE-100 sequencing data per sample were collected. In order to reduce the memory requirement, only 10,000,000 reads of each sample were used for assembly. NOVOWrap v0.95 was run on a PC with Microsoft Windows 10 operating system and 32 GB memory to assemble all data using default options. The reference genomes of each sample were automatically chosen by the program. The output genomes were crossverified based on the results of the collinearity analysis. For further verification, the assemblies of four species, Cycas debaoensis , Brasenia schreberi , Oryza alta andFirmiana simplex, were aligned to existing plastid genomes deposited in GenBank using MAFFT v7.409 (Katoh & Standley 2013).
The assembled sequences were then annotated using Plann (Huang & Cronk 2015) and GeSeq (Tillich et al. 2017). The start and stop codon of genes were manually corrected with Unipro UGENE (Okonechnikov et al.2012) if necessary. The annotation results were finally visualized with OGDRAW (Greiner et al. 2019).
The assembled results are listed in Table 2. All samples were successfully assembled using NOVOWrap v0.95. The memory usage of the program stabilized at ~12 GB and the time cost ranged between 6 and 20 minutes. For the candidate seeds, two samples failed with the frequently used rbcL but succeeded with other seeds. While rbcL , psaB and psaC have similar success rates, rrn23 only succeeded on Nepeta stewartiana , one of the eleven samples. Each seed generated one or two identical results for each sample, with one’s structure adjusted if necessary. The vast majority of the samples have identical assembly results with different seeds with the exception of Pycnostachys reticulate . In this case, 140 candidate results were generated, with lengths ranging between 152,466 and 152,520 bp (Supplementary material 2). The genome structures of all these candidates are relatively conserved compared to the reference, Ocimum tenuiflorum (species in same family) and have the same size of SSC (17,671 bp). Their LSC ranged between 83,397 and 83,423 bp and IRs ranged between 25,695 and 25,713 bp. The variations in size are due to the existence of simple tandem repeats, a difficulty that NGS alone could hardly solve sometimes.
The Validation module successfully recognized the quadripartite regions of the sequences, including the DR structure of the lycopod species. The starting site of each sequence was adjusted according to the respective references. Inverted LSC and SSC regions were detected and adjusted in the subsequent collinearity analysis. Most newly determined genomes displayed high collinearity to the references, except for a few gaps in the alignment. In addition, the IRs of each genome were similar in size to the references with no obvious expansion or contraction (Figure 2).
The annotation results further verified the reliability of the assembly results. No gene loss or pseudogenization occurred in most of the samples. The only exception is Salvia campanulata , which has a premature termination in one side of the SSC flanked by IRa causing truncated ycf1 . Moreover, all the boundaries of the quadripartite regions determined by the Validation module are consistent with the results of OGDRAW (Supplementary material 3).
According to the alignment results, the four resequenced genomes are identical to their references, except for a few one-base insertions or deletions located in mono nucleotide repeats. This is likely due to sequencing errors of NGS platforms or intraspecific differences given that the re-sequenced samples and samples existing in GenBank are from the same species but different individuals. Annotations demonstrate that such mutations have no effect on coding sequences.
The Merge module is designed to merge contigs originating from each seed into one complete plastid genome. For the cases in this study, the Merge module is skipped because all samples are successfully assembled with one of the seeds. To test the Merge module, several artificially generated datasets were used (Supplementary material 4). The result showed that the Merge module generates a complete plastid genome in most of the datasets.
Usability
NOVOWrap v0.95 provides both a command-line interface and a graphical user interface. It is a cross-platform and easy to install. In addition, it offers portable packages for Linux, Microsoft Windows, and MacOS operating systems. For the installation, all dependent third-party tools could be automatically installed if necessary. To further improve user-friendliness, extensive tests have been conducted to improve the robustness of the software. Moreover, preventive measures have been set up for factors that may interrupt the running of the program and cause inconvenience for users, such as Internet accessibility, the validity of input data, and the permission to access the work path. The program requires very few inputs to run and uses universal file formats for both input and output. Finally, NOVOWrap supports remote access of input files on the same local area network for convenience.
NOVOWrap is designed to be run on a normal personal computer. The Validation and Merge modules could finish running in minutes, with negligible consumption of memory and running time. Although the Assembly module requires much more resources, based on the feedback from voluntary testers and our own tests, 16 GB of memory is sufficient to handle approximately 10,000,000 PE100 reads, 8,000,000 PE150 reads, or 3,000,000 PE200 reads, which is sufficient for obtaining complete plastid genomes in general cases.
Discussion
The rbcL gene, the most frequently used seed for plastid genome assembly, usually works efficiently, but may fail occasionally. Hence, three novel seeds were introduced in this study. The psaB andpsaC functioned as well as rbcL . Although rrn23only succeeded in the assembly of Nepeta stewartiana , it could be valuable for some extreme cases. Compared with other newly developed seeds, rrn23 is very conservative in 1,331 evaluated genera. The possible cause of the lower success rate may be the location difference, that rrn23 is located in the repeated region, while others are distributed in single-copy regions. For assembly, starting with IRs might be harder than LSC/SSC, since the former has extra possible directions to extend the sequences.
In the benchmark, each sample was assembled with at least one seed. Although identical sequences derived from different seeds can be used as supporting evidence for validating the assembly, assembly with one seed is sufficient for practical use, given that more attempts require much more time. According to the above, the order of using seeds is optimized to rbcL first, then psaB , psaC, and rrn23 . If none of the seeds work, the whole reference plastid genome is used as a backup seed.
In early time, the plastid genome is considered to have two natural isomers with SSC having the opposite polarity (Palmer 1983). The author considered that the inversion is mediated by the IR structure. However, some recent studies have used inversion as a phylogenetic character which is thought to be divergent in different lineages. (Walker et al. 2015). Such misinterpretation has deepened, given the fact that GenBank does not have a mandatory standard for the direction of the SSC. In this study, both isoforms were found in the assembly results of multiple samples. In addition, such structural variations are more likely to be misassembled owing to the existence of IRs (Jin et al. 2019; Wick et al. 2015). NOVOWrap automatically adjusts the orientation of SSC and LSC as the default output. In the case of potential needs, unchanged sequences are also generated.
Recently, several studies on the structural variation of plastid genomes, including the abnormal structure of the genome, the expansion and shrinking of IRs, and the recombination of orders of genes and other variations have been conducted (Weng et al. 2017; Zhu et al. 2016). However, the inconsistent structure of existing data may hamper further studies. Not only the orientation of SSC/LSC, but also other features such as different starting sites, inconsistent order of quadripartite regions and opposite strands of the whole genome, could cause problems in downstream analysis, such as annotation, alignment, collinearity analysis and phylogenetic reconstruction. For instance, the unadjusted sequence of Salvia campanulata has been reported to have drastically shortened IRs, with a length of 3,845 bp. However, the adjusted sequence showed a normal length of IRs, with a length of 25,535 bp (Supplementary material 3).
Although manual adjustment may solve the problem, the lack of automatic tools and standards makes it laborious. The Validation module of NOVOWrap ensures the consistency of assembly results automatically, liberating users from tedious error-prone manual manipulation. It is expected to accelerate plastid genome assembly and ease data upload to public databases with all consistent genome structures.
Acknowledgements
Thanks Yixuan Huang for testing and packaging program in MacOS. We are grateful to numerous volunteers who have tested the program with different data and various operating systems on their own personal computers or workstations.
This work is funded by the Strategic Priority Research Program of Chinese Academy of Sciences (XDA23080204 and XDA19050303).
Author Contributions
Ping Wu wrote the program and the manuscript. Hao Chen tested and optimized the code and wrote the user manual. Chao Xu collected the data. Jie Yang analyzed data of lycopods and Xianchun Zhang guided the analysis of DR structure. Shiliang Zhou designed the study and edited the final manuscript.
Data Accessibility
The assembly results have been uploaded to GenBank. The source code of the program and the user manual are available athttps://github.com/wpwupingwp/novowrap.
References
Attwood TK, Blackford S, Brazas MD, Davies A, et al. (2019). A global perspective on evolving bioinformatics and data science training needs. Briefings in Bioinformatics , 20, 398-404.
Bankevich A, Nurk S, Antipov D, Gurevich AA, et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19 (5), 455-477. doi:10.1089/cmb.2012.0021
Bellot S, Renner SS. (2016). The plastomes of two species in the endoparasite genus pilostyles (apodanthaceae) each retain just five or six possibly functional genes. Genome Biol Evol , 8, 189-201.
Blazier CC, Guisinger MM, Jansen RK. (2011). Recent loss of plastid-encoded ndh genes within Erodium (Geraniaceae). Plant Molecular Biology , 76, 263-272.
Brassell SC, Eglinton G, Marlowe IT, Pflaumann U, et al. (1986). Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature , 320, 129-133.
Dierckxsens N, Mardulyn P, Smits G. (2016). NOVOPlasty: De novo assembly of organelle genomes from whole genome data. Nucleic Acids Research , 45.
Freudenthal JA, Pfaff S, Terhoeven N, Korte A, et al. (2019) The landscape of chloroplast genome assembly tools. bioRxiv . doi:https://doi.org/10.1101/665869
Green BR (2011) Chloroplast genomes of photosynthetic eukaryotes.Plant Journal, 66 , 34-44. doi:10.1111/j.1365-313X.2011.04541.x
Greiner S, Lehwark P, Bock R (2019) OrganellarGenomeDRAW (OGDRAW) version 1.3.1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids Research, 47 (W1), W59-W64. doi:10.1093/nar/gkz238
Huang DI, Cronk QC (2015) Plann: A command-line application for annotating plastome sequences. Applications in Plant Sciences, 3 (8). doi:10.3732/apps.1500026
Jin J, Yu W, Yang J, Song Y, et al. (2019) GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. bioRxiv . doi:https://doi.org/10.1101/256479
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30 (4), 772-780. doi:10.1093/molbev/mst010
Keeling PJ (2010) The endosymbiotic origin, diversification and fate of plastids. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 365 , 729-748. doi:10.1098/rstb.2009.0103
Li Z, Chen Y, Mu D, Yuan J, et al. (2012). Comparison of the two major classes of assembly algorithms: Overlap-layout-consensus and de-bruijn-graph. Briefings in Functional Genomics , 11, 25-37.
Lim CE, Kim GB, Ryu SA, Yu HJ, et al. (2018). The complete chloroplast genome of Artemisia hallaisanensis Nakai (Asteraceae), an endemic medicinal herb in Korea. Mitochondrial DNA Part B: Resources , 3, 359-360.
Madden TL, Busby B, Ye J. (2019). Reply to the paper : Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows.1-2.
Nei M. (1987). Molecular Evolutionary Genetics : Columbia University Press.
Okonechnikov K, Golosova O, Fursov M, team U (2012) Unipro UGENE: a unified bioinformatics toolkit. Bioinformatics, 28 (8), 1166-1167. doi:10.1093/bioinformatics/bts091
Palmer JD. (1983). Chloroplast DNA exists in two orientations.Nature , 301, 92-93.
Pruitt KD, Tatusova T, Brown GR, Maglott DR. (2012). NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Research , 40, 130-135.
Qu XJ, Moore MJ, Li DZ, Yi TS. (2019). PGA: A software package for rapid, accurate, and flexible batch annotation of plastomes. Plant Methods , 15, 1-12.
Sayers EW, Cavanaugh M, Clark K, Ostell J, et al. (2019) GenBank.Nucleic Acids Research, 47 (D1), D94-D99. doi:10.1093/nar/gky989
Shi L, Chen H, Jiang M, Wang L, et al. (2019). CPGAVAS2, an integrated plastome sequence annotator and analyzer. Nucleic Acids Research , 47, W65-W73.
Smith DR (2015) Mutation rates in plastid genomes: They are lower than you might think. Genome Biol Evol, 7 , 1227-1234. doi:10.1093/gbe/evv069
Spellerberg IF, Fedor PJ. (2003). A tribute to Claude-Shannon (1916-2001) and a plea for more rigorous use of species richness, species diversity and the ’Shannon-Wiener’ Index. Global Ecology and Biogeography , 12, 177-179.
Sugiura M, Shinozaki K, Zaita N, Kusuda M, et al. (1986). Clone bank of the tobacco (Nicotiana tabacum) chloroplast genome as a set of overlapping restriction endonuclease fragments: mapping of eleven ribosomal protein genes. Plant Science , 44, 211-217.
Tillich M, Lehwark P, Pellizzer T, Ulbricht-Jones ES, et al.(2017). GeSeq - Versatile and accurate annotation of organelle genomes.Nucleic Acids Research , 45, W6-W11.
Tonti-Filippini J, Nevill PG, Dixon K, Small I (2017) What can we do with 1000 plastid genomes? Plant Journal, 90 , 808-818. doi:10.1111/tpj.13491
Twyford AD, Ness RW. (2017). Strategies for complete plastid genome sequencing. Molecular Ecology Resources , 17, 858-868.
Walker JF, Jansen RK, Zanis MJ, Emery NC. (2015). Sources of inversion variation in the small single copy (SSC) region of chloroplast genomes.American Journal of Botany , 102, 1751-1752.
Weng ML, Ruhlman TA, Jansen RK. (2017). Expansion of inverted repeat does not decrease substitution rates in Pelargonium plastid genomes.New Phytologist , 214, 842-851.
Wick RR, Schultz MB, Zobel J, Holt KE. (2015). Bandage: Interactive visualization of de novo genome assemblies. Bioinformatics , 31, 3350-3352.
Wicke S, Schneeweiss GM, dePamphilis CW, Müller KF, et al. (2011) The evolution of the plastid chromosome in land plants: Gene content, gene order, gene function. Plant Molecular Biology, 76 , 273-297. doi:10.1007/s11103-011-9762-4
Zhu A, Guo W, Gupta S, Fan W, et al. (2016). Evolutionary dynamics of the plastid inverted repeat: The effects of expansion, contraction, and loss on substitution rates. New Phytologist , 209, 1747-1756.
Figure legends
Figure 1. Schematic diagram of the Rotate algorithm. In order to avoid possible interference of truncated region at the starting site, the original sequence was firstly duplicated to form a full-length plastid genome in the middle of the sequence. With the self-to-self comparison with BLAST, the longest repeated region with the highest number should be the IRs. Subsequently, according to the location of IRs, the boundaries of LSC and SSC were determined. Finally, full-length plastid genomes were extracted from the extended sequences and the collinearity analysis was performed.
Figure 2. Collinearity analysis of the assembly results. A) plastid genomes with DR structure; B) assembly has inverted LSC compared with the reference; C) normal comparison result; D) inverted SSC. Red strips represent the collinearity between the reference and the plus strand of the assembly, and green strips represent the minus strand. Blank strips indicate the mismatch of sequences. Crossed strips could be the matches between IRs in both strands, or indicate the presence of inverted region, such as B.
Figure