Dataset preparation
The newly determined chloroplast genomes were combined with 37
chloroplast genomes (together with chloroplast fragments of three
species) downloaded from GenBank (Table S3), aligned using mafft-win
(Katoh & Standley, 2013), and adjusted manually using Se-Al. Species
delimitation, resolution comparison, and seed identification were
performed with corresponding datasets using phylogenetic methods.
Dataset 1 contained 58 chloroplast genomes, representing all rice
species (1~3 per species), together with threeLeersia species as outgroups. Maximum parsimony analyses were
carried out to identify and exclude mislabeled genomes (wrong systematic
positions) or genomes of relatively low quality (longer branch lengths).
This dataset was used to delimit the circumscription of species together
with dataset 6 and a super barcode of Oryza .
Dataset 2 (matK ), dataset 3 (rbcL ), dataset 4
(psbA-trnH ), and dataset 5 (ITS) represented conventional DNA
barcodes. The psbA-trnH sequence is interrupted by rps19in Poaceae. Dataset 6 represented the concatenation of two single-copy
nuclear genes (N78 and R22) selected from 142 genes (Zou et al., 2008).
The datasets were analyzed using phylogenetic methods to test the
resolution of these candidate DNA barcodes. Dataset 7 was formed by the
concatenation of six rice-specific chloroplast DNA barcodes identified
in this study. This dataset was analyzed using phylogenetic methods for
reliable species identification of rice seeds.
Phylogenetic analyses
Maximum parsimony.
Maximum parsimony analysis was executed using PAUP version 4.0a150
(Swofford, 2003). The tree search used a heuristic strategy with random
stepwise addition of 100 replicates, tree bisection and reconnection
branch swapping, and saving multiple trees with no more than two tree
scores ≥5 from each replicate. Branch support for the maximum parsimony
trees was assessed with 1000 bootstrap replicates. The trees were rooted
using Leersia species as outgroups.
Maximum likelihood.
Maximum likelihood analyses were performed using RAxML (Stamatakis,
2014) with the GTR+I+G model. Branch support for the ML trees was
assessed with 1000 bootstrap replicates. The trees were rooted usingLeersia species as outgroups.
Bayesian inference.
The best-fit substitution models were GTR+I+G and Blosum+I+G selected by
running ModelFinder (Kalyaanamoorthy, Minh, Wong, Von Haeseler, &
Jermiin, 2017) for dataset 1 and dataset 6. Bayesian inference was
assessed with MrBayes 3.2 (Fredrik et al., 2012) integrated in the
PhyloSuite (Zhang et al., 2020). The Markov chain Monte Carlo process
was run 2,000,000 generations and trees were sampled every 100
generations with 2 × 4 chains. Stationarity was achieved when the
average standard deviation of split frequencies remained
<0.01. The first 25% of runs were discarded as burn-in. The
outcomes from MrBayes were summed up by PhyloSuite and the consensus
trees were rooted using Leersia species as outgroups.