Development of new seeds for assembly
In order to determine the seeds used for assembly, BarcodeFinder (https://github.com/wpwupingwp/BarcodeFinder) was used to screen plastid genome sequences of diverse taxa. Briefly, 5,254 plastid genomes of green plants were downloaded from NCBI RefSeq database (Pruitt et al. 2012). In order to remove redundant data and reduce computation time, only one genome of each genus was used and 1,906 genomes were retained. Following this, sequences of each locus were extracted into multiple files according to annotations. After sequences were aligned using MAFFT (Katoh & Standley 2013), sequence polymorphism of each locus was investigated (Nei 1987; Spellerberg & Fedor 2003). Finally, the seeds were determined based on the following five principles:
(1) the sequences of seed are highly conserved; (2) the locus as a seed should exist and not pseudogenize in most of the taxa; (3) the locus as a seed should not transfer to the mitochondrial genome or nuclear genome commonly; (4) the seed candidates should be located far enough from each other in the plastid genome to prevent insufficient coverage of data or notorious repeats influencing the assembly start by taking full use of alternative seeds (for instance, seeds are from different regions of the quadripartite structure); and (5) since rbcL has been recommended and widely used, it is kept regardless of other principles.
Eventually, rbcL , psaB , psaC and rrn23 were selected as candidate seeds (Table 1). The information of other loci is shown in Supplementary material 1.