Dataset preparation
The newly determined chloroplast genomes were combined with 37 chloroplast genomes (together with chloroplast fragments of three species) downloaded from GenBank (Table S3), aligned using mafft-win (Katoh & Standley, 2013), and adjusted manually using Se-Al. Species delimitation, resolution comparison, and seed identification were performed with corresponding datasets using phylogenetic methods.
Dataset 1 contained 58 chloroplast genomes, representing all rice species (1~3 per species), together with threeLeersia species as outgroups. Maximum parsimony analyses were carried out to identify and exclude mislabeled genomes (wrong systematic positions) or genomes of relatively low quality (longer branch lengths). This dataset was used to delimit the circumscription of species together with dataset 6 and a super barcode of Oryza .
Dataset 2 (matK ), dataset 3 (rbcL ), dataset 4 (psbA-trnH ), and dataset 5 (ITS) represented conventional DNA barcodes. The psbA-trnH sequence is interrupted by rps19in Poaceae. Dataset 6 represented the concatenation of two single-copy nuclear genes (N78 and R22) selected from 142 genes (Zou et al., 2008). The datasets were analyzed using phylogenetic methods to test the resolution of these candidate DNA barcodes. Dataset 7 was formed by the concatenation of six rice-specific chloroplast DNA barcodes identified in this study. This dataset was analyzed using phylogenetic methods for reliable species identification of rice seeds.
Phylogenetic analyses
Maximum parsimony.
Maximum parsimony analysis was executed using PAUP version 4.0a150 (Swofford, 2003). The tree search used a heuristic strategy with random stepwise addition of 100 replicates, tree bisection and reconnection branch swapping, and saving multiple trees with no more than two tree scores ≥5 from each replicate. Branch support for the maximum parsimony trees was assessed with 1000 bootstrap replicates. The trees were rooted using Leersia species as outgroups.
Maximum likelihood.
Maximum likelihood analyses were performed using RAxML (Stamatakis, 2014) with the GTR+I+G model. Branch support for the ML trees was assessed with 1000 bootstrap replicates. The trees were rooted usingLeersia species as outgroups.
Bayesian inference.
The best-fit substitution models were GTR+I+G and Blosum+I+G selected by running ModelFinder (Kalyaanamoorthy, Minh, Wong, Von Haeseler, & Jermiin, 2017) for dataset 1 and dataset 6. Bayesian inference was assessed with MrBayes 3.2 (Fredrik et al., 2012) integrated in the PhyloSuite (Zhang et al., 2020). The Markov chain Monte Carlo process was run 2,000,000 generations and trees were sampled every 100 generations with 2 × 4 chains. Stationarity was achieved when the average standard deviation of split frequencies remained <0.01. The first 25% of runs were discarded as burn-in. The outcomes from MrBayes were summed up by PhyloSuite and the consensus trees were rooted using Leersia species as outgroups.