2.5 Genome Annotation
Tandem repeats and interspersed repeats were identified using Tandem
Repeats Finder (TRF) v4.09
(http://tandem.bu.edu/trf/trf407b.linux64.download.html)
(Benson, 1999) and RepeatModeler v2.0
(http://www.repeatmasker.RepeatModeler/)
(Flynn et al., 2020), respectively. RepeatMasker v4.1.0
(http://www.repeatmasker.org/RMDownload)
was used to mask the predicted and known repeated sequences
(Tarailo-Graovac and Chen, 2009). tRNAscan-SE v1.4alpha (Chan et al.,
2019) was used to predict tRNAs, and Infernal v1.1.3
(http://eddylab.org/) was used to
search the Rfam database v11.0 with an E-value cutoff of
10−5 to identify other types of noncoding RNAs
(ncRNAs). (Nawrocki et al., 2013).
Protein-coding genes were predicted through the combination of
homology-based, RNA sequencing-based, and ab initio predictions. For the
homolog-based approach, the protein sequences of several related
species, including A. pisum (International Aphid Genomics
Consortium., 2010), R. maidis (Chen et al., 2019),Diuraphis noxia (Nicholson et al., 2015), Aphis gossypii(Quan et al., 2019), Aphis glycines (Wenger et al., 2020) andMyzus persicae (Mathers et al., 2017), were downloaded from NCBI
and aligned against the assembled genome using Gene Model Mapper
(GeMoMa) v1.6.1.jar
(http://www.jstacs.de/index.php/GeMoMa)
(Keilwagen et al., 2016) to refine the blast hits to define exact
intron/exon positions. For the RNA sequencing-based method, the PacBio
full-length transcriptome, which was obtained from the pooled sample ofM. dirhodum , was used to predict the open reading frames (ORFs)
with PASA
(https://sourceforge.net/projects/pasa/files/stats/timeline)
(Campbell et al., 2006) using default settings. For the ab initio
method, two de novo programs, Augustus v3.2.2
(http://augustus.gobics.de/binaries/)
(Stanke and Waack, 2003) and SNAP
(http://snap.stanford.edu/snap/download.html)
(Korf, 2004), were employed with default parameters to predict genes in
the repeat-masked genome sequences. All predicted genes from the three
approaches were integrated with EVidenceModeler (EVM)
(https://sourceforge.net/projects/evidencemodeler/)
(Haas et al., 2008) to generate high-confidence gene sets, and the
untranslated regions and alternative splicing were predicted using PASA.
The gene set was annotated by aligning protein sequences to functional
databases, including NR (nonredundant sequence database) (Deng et al.,
2006), Swiss-Prot (Bairoch & Boeckmann, 1991), eggNOG (evolutionary
genealogy of genes: Nonsupervised Orthologous Groups) (Huerta-Cepas et
al., 2019), GO (Gene Ontology) (Dimmer et al., 2012) and KEGG (Kyoto
Encyclopedia of Genes and Genomes) (Kanehisa and Goto, 2000), using
BLAST with a threshold e-value ≤ 1e-5.