Genome Annotation
In order to annotate both structural and functional properties of the genome, we used GenSAS v6 (Humann et al. 2019). We first identified and masked interspersed repeats and low complexity DNA sequences in the assembly. RepeatModeler v1.0.11 was run to identify and produce a structural annotation of repeat regions de novo (Smit & Hubley 2008), and subsequently RepeatMasker v4.1.0 was used to generate a modified version of the genome with these regions masked (Smit et al. 2015). We then employed Braker v2.1.0 (Hoff et al. 2019; Stanke et al.2008; Stanke et al. 2006) to automatically predict gene models for protein coding genes in the masked genome. The Braker pipeline uses the tools GeneMark-ES/ET (Lomsadze et al.2005; Ter-Hovhannisyan et al. 2008) and Augustus (Camacho et al. 2009), as well as evidence from RNAseq data, to predict gene models in novel eukaryotic genomes. Paired-end RNAseq reads were aligned to the genome using Hisat2 (Kim et al.2015), with default settings, and the alignment file was provided to Braker. Finally, PASA v2.3.3 (Haas et al. 2008) was used to refine the gene models, using the assembled transcriptome as input.
For the resulting consensus gene models, we assigned functional annotations using a combination of six tools. Amino acid similarity to proteins in the NCBI RefSeq invertebrate database was used for functional annotation based on searches with both Diamond v0.9.22 (Buchfink et al. 2015) and Blastp v2.7.1 (with the settings: matrix = BLOSUM62, expect = 1e-8, word size = 3, gap open penalty = 11, gap extend penalty = 1, maximum HSP distance =30000). The gene set was further annotated based on the presence of peptide domains, using InterProScan v5.29-68.0, Pfam v1.6 (with the settings: e-value sequence = 1 and e-value domain = 10), and SignalP v4.1 (with the settings: organism group = eukaryotes, method = best, D-cutoff for noTM networks = 0.45, D-cutoff for TM networks = 0.50, minimal predicted peptide length = 10, and truncate sequence length = 70). Finally, the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology terms were assigned to each gene using the KEGG Automatic Annotation Server (Moriyaet al. 2007), based on bi-directional best hit searches of the nucleotide sequences using blast.