Results
A highly continuous genome
assembly of Chinese flowering cabbage (B. rapa var.parachinensis )
A highly inbred line of Chinese flowering cabbage (B. rapa var.parachinensis , Fig.1) was used for the genome sequencing and
assembly with deep coverage long reads and Hi-C data. The assembly
pipeline for Brassica rapa var. parachinensis genome was
shown in Fig.1. DNA samples from a single plant were prepared for
PacBio, Illumina and Hi-C sequencing to avoid potential genome
variability between different plants. Overall, we obtained a total of
113Gb PacBio and 47.5Gb Illumina raw reads (Table S1), corresponding to
219 and 86 depth of the estimated genome size (515 Mb), respectively. A
preliminary survey of the genome size, heterozygosity, GC and transposon
elements (TEs) content of this inbred line was carried out with 32GB
clean illumina reads (Table 1; ~83 coverage) using
Kmer-based method (Liu et al. 2013). The genome size was estimated to be
about 515Mb with an overall GC content of 38.9% and transposon elements
(TE) content of 64.1% (Table S1). Remarkably, the heterozygosity is
very low with only 0.16% that would facilitate assembly.
We applied an integrated strategy to assemble the genome. Firstly, the
MECAT2 package(C.-L. Xiao et
al., 2017) was used for the Chinese flowering cabbage genome assembly.
Secondly, long reads with a length cutoff of 10 kb were polished using
NGS short reads with a
Pilon(Walker et al., 2014).
Finally, we obtained the final contig assembly of 384Mb with a contig
N50 length of 7.2Mb. The genome contained 450 contigs, and the longest
contig was 19.9Mb (Table 1). The GC content for the genomic contigs were
37.6% (Table 1). The results of coverage statistics by SAM tools
suggested that the assembly of this genome is credible (Table S2).
Furthermore, we found that 97.8% and 0.8% of the completed and partial
genes of the total of 1,440 BUSCO genes were detected in the genome,
respectively, which validated the completeness of the genome (Table S3).
Furthermore, high-throughput chromatin conformation capture (Hi-C) data
was used to scaffold the contigs into chromosome-level assembly. We
obtained a total of 66 Gb cleaned Hi-C paired-end (PE) reads which is
about 128 depth of the genome. Of which, 98.27% (434M/442M) were
mappable to the current assembly and ~33.18%
(147M/442M) were mapped to different contigs. Using contact frequency
calculated from the PE reads, 180 contigs were further scaffolded into
10 pseudo-chromosomes (Fig. 1A). These 180 contigs represent 87.93%
(338 Mb/384Mb) of the total assembled sequence and 40% (180/450) of the
total contigs. The final assembly contains 69 scaffolds with a scaffold
N50 of 32Mb and the longest scaffold is 47.5Mb in length (Table 1). The
Circos map of the genome shows that each position is collinear with the
other two, indicating that the annotation is complete (Fig.1B). A large
number of corrected repeat regions on A05 and A06 chromosomes were
identified (Fig.1C), which indicated that there might be a large region
of DNA transposons and LTR transposons at this region.
We also performed de novo gene prediction with guidance by
homologs from related species, transcriptome from short read data and
full-length transcripts from ISO-seq sequencing from the present study
using the MAKER
pipeline(Cantarel et al.,
2008). We annotated 47,598 protein-coding genes in the Chinese
flowering cabbage genome with an average gene length of 2060 bp (Table
1). The average number of exons per gene is 6.13, with a mean length of
199 bp (Table 1). Approximately 53.2% of the genome is annotated as
repetitive sequences, which is consistent with the estimation of
Kmer-based method. LTR retrotransposons (22.26 %) and DNA transposons
(17.62 %) are the most abundant families (Table S4).
In conclusion, we provide, to our knowledge, so far the most contiguous
and the first chromosome-level genome assembly of this species.