Genome annotation
Repeat Elements Annotation
To identify the repeat elements’ sequences, we constructedde-novo repeat library using RepeatModeler2 (Flynn et al. 2020), including RECON v1.08 (Bao & Eddy, 2002), RepeatScout v1.0.6 (Price et al. 2005), LtrHarvest (Ellinghaus et al., 2008), which is incorporated in GenomeTools v1.5.9, Ltr_retriever v2.7 (Ou and Jiang 2018), assuming default parameters and the extra LTRStruct pipeline which includes Mafft v7.453 (Katoh and Standley 2013), CD-HIT v4.8.1 (Li & Godzik, 2006) and Ninja v0.95 (Wheeler 2009). Thereafter, sequences that were obtained by RepeatModeler, were combined with Repbase v17.01 and a custom database constructed with the entries ofTakifugu rubripes ,Takifugu flavidus andTetraodon nigroviridis of the FishTEDB (Shao et al. 2018). Finally, RepeatMasker v4.1.0 (Tarailo-Graovac and Chen 2009) was used to annotate repeat elements based on the above-described database.
Gene prediction & Functional annotation
After repeat masking, gene prediction was conducted using MAKER2 pipeline v2.31.10 (Holt and Yandell 2011) with two iterative rounds. We used a combined strategy of ab initio , homology-based and transcriptome-based methods. In the first round, for homology annotation, MAKER2 was initially run in protein2genome mode, while SWISS-PROT (www.uniprot.org) was used for protein sequences extraction of three closely related species,Mola mola , Tetraodon nigroviridis and Takifugu rubripes . For annotation using the RNA-Seq data, est2genome mode was enabled, which is based on transcriptome evidence. Τranscriptomic reads from all sequenced tissues were mapped and assembled through the genome-guide approach, using HISAT2 v2.2.0 (Kim et al. 2015) and StringTie v2.1.1 (Pertea et al. 2015). Ab initio prediction was performed with SNAP (Korf 2004) (http://korflab.ucdavis.edu), which was independently trained on L. sceleratus genome with default parameters and AUGUSTUS v3.3.3 (Stanke et al. 2006) previously trained through BUSCO v3.1.0 (Simão et al. 2015) with the extra parameter “-long”. The second round of MAKER2 was run using the previously trained models with the same settings as round one, except est2genome and protein2genome modes. The previous custom repeat library and MAKER2 repeat library that used for genome masking, remained for both rounds. The completeness of putative genes was assessed using BUSCO v4.0.5 (Simão et al. 2015) against the Actinopterygii odb10 database.
The functional annotation of the predicted genes of L. sceleratuswas performed by similarity search against the UniprotKB/Swissprot database (release-2020_03) with BLASTP v.2.9.0+ (e-value 1e-6, -max_target_seqs=10) (Altschul, S.F et al., 1990). InterProScan v5 (Jones et al. 2014) was used to search motifs and domains against all default databases and the extra of SignalP_EUK and TMHMM. Functional annotation results were also retrieved using eggNOG-mapper (Huerta-Cepas et al. 2017) based on fast orthology assignments using precomputed eggNOG v5.0 (Huerta-Cepas et al. 2019) clusters and phylogenies.
Gene Ontology mapping
Gene ontology analysis was carried out using a custom python script (gene_ontology_mapping.py). Gene ontology terms were retrieved through the Uniprot API service (https://www.uniprot.org/help/programmatic_access) and as queries we chose the best blast hits that we extracted after the functional annotation step against UniProtKB/Swiss-Prot.