2.7 Gene functional annotation and genome assembly validation
To build consensus gene models of
the C. undulatus genome assembly, gene predictions were performed
using Augustus (Stanke et al. 2006), GeMoMa (Jens et al.2016), and PASA (Haas et al. 2008), with de novo model,
homology sequence, and transcript data, respectively. For homology-based
prediction, protein sequences of Danio rerio , Gasterosteus
aculeatus , Labrus bergylta , Lateolabrax maculatus ,Symphodus melops , and Takifugu rubripes were downloaded
from the NCBI database (https://www.ncbi.nlm.nih. gov). All gene models
were merged into a gene set, and transposon-including genes were removed
using Transposon PSI (http:// transposon- psi. sourceforge.net).
The predicted protein-coding genes
were functionally annotated based on two combined methods. First, the
SWISS-PROT database (Bairoch et al. 2010), the NCBI non-redundant
protein (NR) database, the Gene Ontology (GO) and Kyoto Encyclopedia of
Genes and Genomes (KEGG) databases (Kanehisa et al. 2000) were
used to annotate the protein-coding genes using BLAST with an e value of
1 × 10-5. GO terms were assigned to genes based on NR
annotation information using Blast2GO. Second, we performed functional
annotation using InterProScan (Zdobnov & Rolf 2001) to examine motifs,
domains, and other signatures in the secondary structure of the
protein-coding genes by searching the ProDom, PRINTS, Pfam, SMART,
PANTHER, and PROSITE databases in InterPro (Sarah et al. 2009).
The quality of gene annotation was
evaluated by checking the number of expressed genes using Cufflinks
(Trapnell et al. 2012) with FPKM values >0, based on
the Illumina short reads from tissue transcripts.