3.1 Genome assembly and annotation

A total of 22 Gb of Illumina short reads (100×) and 73 Gb of Oxford Nanopore Technologies (ONT) long reads (100 ×) were generated forS. tetraptera (Table S2). The contig-level assembly ofS. tetraptera was 943 Mb in length (covering 96.03% of the estimated size), with 199 contigs and a contig N50 length of 4.9 Mb (Tables S3). Using121Gb (100×) of Hi-C data, we further anchored 95.76% of the assembly (903 Mb) onto six pseudochromosomes (Figure 1b, Table S4, Figure S4). The accuracy and completeness of the genome assembly were assessed according to the following: (1) 98.90% of NGS reads could be mapped to the assembly (Table S5); (2) 94%-98% of assembled transcripts could be mapped for more than 50% of the length (Table S6); (3) 96.50% (1326 out of 1375) Benchmarking Universal Single-Copy Orthologs (BUSCO) were fully present in the assembly (Table S7). These results indicated that the assembly ofS. tetraptera was reliable with high completeness, continuity, and accuracy.
Around 70.88% of the S. tetraptera genome was identified as repetitive sequences, consisting of 69.55% interspersed repeats and 1.33% tandem repeats (Table S8). Long terminal repeats (LTRs) occupied the greatest proportion (47.51%), including 35.33% that were Gypsy elements and 11.70%Copia elements (Table S8). In addition, a total of 31,359 protein-coding genes were predicted in the genome, with an average gene length of 3,297 bp, an average exon sequence length of 224 bp, average exon number of 5.5 per gene, average intron length of 458 bp, and a GC content similar to the other previously reported Gentianales genomes (Figure S5-6, Table S9). Among all predicted protein-coding genes, 96.29% were functionally annotated by at least one database – SwissPort, TrEMBL, InterPro, GO, KEGG, Enogg-Mapper or NR (Table S10).