3.1 Genome assembly and
annotation
A total of 22 Gb of Illumina short reads (100×) and 73 Gb of Oxford
Nanopore Technologies (ONT) long reads (100 ×) were generated forS. tetraptera (Table S2). The contig-level assembly ofS. tetraptera was 943 Mb in
length (covering 96.03% of the estimated size), with 199 contigs and a
contig N50 length of 4.9 Mb (Tables S3). Using121Gb (100×) of Hi-C data,
we further anchored 95.76% of the assembly (903 Mb) onto six
pseudochromosomes (Figure 1b, Table S4, Figure S4). The accuracy and
completeness of the genome assembly were assessed according to the
following: (1) 98.90% of NGS reads could be mapped to the assembly
(Table S5); (2) 94%-98% of assembled transcripts could be mapped for
more than 50% of the length (Table S6); (3) 96.50% (1326 out of 1375)
Benchmarking Universal Single-Copy Orthologs (BUSCO) were fully present
in the assembly (Table S7). These results indicated that the assembly ofS. tetraptera was reliable with high completeness, continuity,
and accuracy.
Around 70.88% of the S. tetraptera genome was identified as
repetitive sequences, consisting of 69.55% interspersed repeats and
1.33% tandem repeats (Table S8).
Long terminal repeats (LTRs) occupied the greatest proportion (47.51%),
including 35.33% that were Gypsy elements and 11.70%Copia elements (Table S8). In addition, a total of
31,359 protein-coding genes were
predicted in the genome, with an average gene length of 3,297 bp, an
average exon sequence length of 224 bp, average exon number of 5.5 per
gene, average intron length of 458 bp, and a GC content similar to the
other previously reported Gentianales genomes (Figure S5-6, Table S9).
Among all predicted protein-coding genes, 96.29% were functionally
annotated by at least one database – SwissPort, TrEMBL, InterPro, GO,
KEGG, Enogg-Mapper or NR (Table S10).