Development of new seeds for assembly
In order to determine the seeds used for assembly, BarcodeFinder
(https://github.com/wpwupingwp/BarcodeFinder) was used to screen plastid
genome sequences of diverse taxa. Briefly, 5,254 plastid genomes of
green plants were downloaded from NCBI RefSeq database (Pruitt et
al. 2012). In order to remove redundant data and reduce computation
time, only one genome of each genus was used and 1,906 genomes were
retained. Following this, sequences of each locus were extracted into
multiple files according to annotations. After sequences were aligned
using MAFFT (Katoh & Standley 2013), sequence polymorphism of each
locus was investigated (Nei 1987; Spellerberg & Fedor 2003). Finally,
the seeds were determined based on the following five principles:
(1) the sequences of seed are highly conserved; (2) the locus as a seed
should exist and not pseudogenize in most of the taxa; (3) the locus as
a seed should not transfer to the mitochondrial genome or nuclear genome
commonly; (4) the seed candidates should be located far enough from each
other in the plastid genome to prevent insufficient coverage of data or
notorious repeats influencing the assembly start by taking full use of
alternative seeds (for instance, seeds are from different regions of the
quadripartite structure); and (5) since rbcL has been recommended
and widely used, it is kept regardless of other principles.
Eventually, rbcL , psaB , psaC and rrn23 were
selected as candidate seeds (Table 1). The information of other loci is
shown in Supplementary material 1.