3.5 Gene annotation
A combination of ab initio prediction, homology search, and transcript mapping were used to predict the protein-coding genes in the Chinese walnut genome. RNA from eighteen tissues was used to predict gene models (Table S1). Predicted protein-coding genes (27,901) had an average gene length of 5,735 bp, an average coding sequence (CDS) length of 1,226 bp, and an average of 6 exons per gene (Table 1). When we compared Chinese walnut to Arabidopsis based on genome structural features, we found the distribution of CDS lengths exon lengths ofJ. cathayensis was similar to A. thaliana ; however, the distribution of mRNA lengths and intron lengths of J. cathayensiswas unlike A. thaliana (Table 1; Figure S3). Among 27,901 predicted genes, 96.1 % could be functionally annotated in at least one of these seven databases (Table S8). There were 2,014 genes annotated in Nr database only, 23 genes annotated in InterPro only, 6 genes annotated in KEGG only, and no gene was annotated in swissProt or COG only (Figure S4). The GC density with an average length of 900 bp and an average GC content of 51.21% (Figure 3b). Gene density throughout the genome was about 11 genes per 100 kb, with 56,553 genes (94.96 %) present on chromosomally anchored contigs (Figure 3c); this was equivalent to 307 transcripts per 1Mb of chromosome (Figure 3d). There are 82 syntenic blocks in the Chinese walnut genome (Figure 3e). The portion of the Chinese walnut genome comprised of non-coding RNA was small; it included miRNA, tRNA, rRNA, and snRNA (Table S9). A total of 581 tRNA (Table S9), 792 small nuclear RNA (snRNA) and 132 microRNA (miRNA) were identified (Table S9).