2.7 Gene functional annotation and genome assembly validation
To build consensus gene models of the C. undulatus genome assembly, gene predictions were performed using Augustus (Stanke et al. 2006), GeMoMa (Jens et al.2016), and PASA (Haas et al. 2008), with de novo model, homology sequence, and transcript data, respectively. For homology-based prediction, protein sequences of Danio rerio , Gasterosteus aculeatus , Labrus bergylta , Lateolabrax maculatus ,Symphodus melops , and Takifugu rubripes were downloaded from the NCBI database (https://www.ncbi.nlm.nih. gov). All gene models were merged into a gene set, and transposon-including genes were removed using Transposon PSI (http:// transposon- psi. sourceforge.net).
The predicted protein-coding genes were functionally annotated based on two combined methods. First, the SWISS-PROT database (Bairoch et al. 2010), the NCBI non-redundant protein (NR) database, the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Kanehisa et al. 2000) were used to annotate the protein-coding genes using BLAST with an e value of 1 × 10-5. GO terms were assigned to genes based on NR annotation information using Blast2GO. Second, we performed functional annotation using InterProScan (Zdobnov & Rolf 2001) to examine motifs, domains, and other signatures in the secondary structure of the protein-coding genes by searching the ProDom, PRINTS, Pfam, SMART, PANTHER, and PROSITE databases in InterPro (Sarah et al. 2009). The quality of gene annotation was evaluated by checking the number of expressed genes using Cufflinks (Trapnell et al. 2012) with FPKM values >0, based on the Illumina short reads from tissue transcripts.