2.6 Evaluation of assembly quality
The quality of the assembly was evaluated using the mapping rate of the
paired-end and long reads to the assembly (Figure S1). We also evaluated
the completeness and accuracy of the genome assembly using Bench marking
universal single-copy orthologs (BUSCO) version 3.0.2 (Simão et al.,
2015). Genome completeness was further evaluated by mapping of
transcripts from 18 (Table S1) tissues and organs using GMAP (Wu and
Watanabe, 2005).
2.7 Genome annotation
We annotated repeat sequences, gene structure, and noncoding RNA in the
Chinese walnut genome (workflow, Figure S2). We used both homology based
on prediction and de novo prediction to identify transposable
elements (TEs). For de novo prediction, we
constructed a repeat sequence
database using RepeatModeler (http://www.repeatmasker.org), and
predicted the presence of repeat sequences using RepeatMasker software
(Maja et al., 2009) (http://www.repeatmasker.org), LTR-FINDER (Zhao and
Hao, 2007) and PILER (Edgar and Myers, 2005) with default parameters.
For homology based prediction, we identified transposable elements in
the DNA and based on predicted proteins by comparing genomic sequence
with the Repbase v21.12 database (Jurka, 2000) using RepeatMasker (Maja
et al., 2009) (http://www.repeatmasker.org) and RepeatProteinMask v4.0.7
(Maja et al., 2009). Finally, all transposable elements identified by
either method were merged into the final transposon annotations.
Transposable elements (TEs) in the assembled Chinese walnut genome were
also annotated using Tandem Repeats Finder (TRF) v4.09 (Benson et al.,
1999).
To ensure accurate gene structure annotations, we combined homology
prediction and de novo prediction methods. RNA sequences from
eighteen tissues (Table S1) were used to train the software AUGUSTUS
with default parameters (Stanke et al., 2006). We predicated gene
structure de novo based on the statistical characteristics of
genomic sequence data (such as frequency of codon, distribution of exon
and intron) using SNAP (Johnson et al., 2008). We further predicated
gene structure in the protein-coding genes by homology with genes
identified in Arabidopsis thaliana , Citrus sinensis,
Juglans regia , Malus domestica , Olea europaea ,Oryza sativa , Populus euphratica , Quercus robur ,
and Chinese walnut using Exonerate v2.2.0 (Slater et al., 2005). The
final structural annotation of protein-coding genes was performed using
a MAKER (Holt et al., 2011) pipeline that integrates AUGUSTUS (Stanke et
al., 2006) and results from homologous protein mapping, RNA-seq mapping,
and Nanopore mapping.