2.5 Genome Annotation
Tandem repeats and interspersed repeats were identified using Tandem Repeats Finder (TRF) v4.09 (http://tandem.bu.edu/trf/trf407b.linux64.download.html) (Benson, 1999) and RepeatModeler v2.0 (http://www.repeatmasker.RepeatModeler/) (Flynn et al., 2020), respectively. RepeatMasker v4.1.0 (http://www.repeatmasker.org/RMDownload) was used to mask the predicted and known repeated sequences (Tarailo-Graovac and Chen, 2009). tRNAscan-SE v1.4alpha (Chan et al., 2019) was used to predict tRNAs, and Infernal v1.1.3 (http://eddylab.org/) was used to search the Rfam database v11.0 with an E-value cutoff of 10−5 to identify other types of noncoding RNAs (ncRNAs). (Nawrocki et al., 2013).
Protein-coding genes were predicted through the combination of homology-based, RNA sequencing-based, and ab initio predictions. For the homolog-based approach, the protein sequences of several related species, including A. pisum (International Aphid Genomics Consortium., 2010), R. maidis (Chen et al., 2019),Diuraphis noxia (Nicholson et al., 2015), Aphis gossypii(Quan et al., 2019), Aphis glycines (Wenger et al., 2020) andMyzus persicae (Mathers et al., 2017), were downloaded from NCBI and aligned against the assembled genome using Gene Model Mapper (GeMoMa) v1.6.1.jar (http://www.jstacs.de/index.php/GeMoMa) (Keilwagen et al., 2016) to refine the blast hits to define exact intron/exon positions. For the RNA sequencing-based method, the PacBio full-length transcriptome, which was obtained from the pooled sample ofM. dirhodum , was used to predict the open reading frames (ORFs) with PASA (https://sourceforge.net/projects/pasa/files/stats/timeline) (Campbell et al., 2006) using default settings. For the ab initio method, two de novo programs, Augustus v3.2.2 (http://augustus.gobics.de/binaries/) (Stanke and Waack, 2003) and SNAP (http://snap.stanford.edu/snap/download.html) (Korf, 2004), were employed with default parameters to predict genes in the repeat-masked genome sequences. All predicted genes from the three approaches were integrated with EVidenceModeler (EVM) (https://sourceforge.net/projects/evidencemodeler/) (Haas et al., 2008) to generate high-confidence gene sets, and the untranslated regions and alternative splicing were predicted using PASA.
The gene set was annotated by aligning protein sequences to functional databases, including NR (nonredundant sequence database) (Deng et al., 2006), Swiss-Prot (Bairoch & Boeckmann, 1991), eggNOG (evolutionary genealogy of genes: Nonsupervised Orthologous Groups) (Huerta-Cepas et al., 2019), GO (Gene Ontology) (Dimmer et al., 2012) and KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa and Goto, 2000), using BLAST with a threshold e-value ≤ 1e-5.