Repeats and gene annotation

Repeats were identified using a combination of RepeatMasker and "One code to find them all", the latter of which assembles multiple adjacent RepeatMasker hits into complete transposable element (TE) copies. RepeatMasker was run for each genome with a custom library, which combines Repbase 23.04 repeats with cotton-specific repeats. Default parameters were run, except the run was "sensitive" and was set to mask only TEs (no low-complexity). Parameters are available at https://github.com/Wendellab/D5D10. "One code to find them all" was used to aggregate multiple hits into TE models using default parameters. The output from "One code to find them all" was aggregated and summarized in R/3.4.4 (citation) using dplyr /0.7.4 (citation). All code can be found at https://github.com/Wendellab/D5D10
The MAKER-P pipeline (Cantarel et al., 2008) was used to annotate G. raimondii and G. turneri genomes after masking repetitive elements with RepeatMasker ((A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://repeatmasker.org) using a custom database that enriched for cotton-specific repeat sequences.  
G. raimondii was annotated using the iterative MAKER-P method described in Grover et al. 2017 with the following modifications: (1) assembly of RNA-seq data using Mikado (Venturini et al., 2018); (2) RNA-seq assembly provided as another prediction source instead of ESTs evidence; and (3) updated software versions. The raw RNA-seq reads are available from the SRA (SUB5372207). The assembly and annotation quality for each genome was validated via the BUSCO (Simão et al., 2015) pipeline, which evaluates the completeness by characterizing the presence, fragmentation, and/or duplication of highly conserved genes.

Data availability