Errors were identified in the previous G. raimondii sequence (Paterson et al. 2012). In the previous genome sequence, the chromosomes were named to be consistent with previous genetic maps. However, a new chromosome naming convention has been used for diploid and allotetraploid cotton (Zhang et al., 2015; Li et al., 2015, Du et al., 2018, Udall et al., 2018), where homoeologous chromosomes are organized in sequence pairs (e.g. At_01 - At_13 [Chr. 01 - Chr. 13] are homoeologs of D1_01 - Dt_013 [Chr. 14 - Chr 26], respectively). We have adopted this naming convention for the homologous chromosomes of these two genomes, though the chromosome names do not match homologous chromosomes of the previous assembly. Structural errors in the previously published sequence were identified by genome alignments (Figure 1) and by mapping HiC reads to the genome sequence (Figure 2, Supp Figure 1). The largest error of which was an assembly-derived translocation of D5_04 (previously Chr. 12) on D5_05 (previously Chr. 09) (Figure 2). Additional, smaller errors were found between Chr. 01 (now D5_07) and Chr. 13 (now D5_13); Chr. 02 (now D5_01) and Chr. 13 (now D5_13); Chr. 03 (now D5_02) and Chr. 13 (now D5_13); Chr. 02 (now D5_01) and Chr. 03 (now D5_02); Chr. 02 (now D5_01) and Chr. 07 (now D5_11); and Chr 03 (now D5_02) and Chr. 07 (now D5_11) (Supp Figure 1). These corrections based on alignment and HiC data were also supported by the Bionano data (data not shown).
We also inspected a reported mitochondrial genome insertion on Chr. 01 (now D5_07, Figure 3) located between coordinates 23.1Mb and 25Mb. This region This also appears to have been the result of assembly error. Alignment of the two genomes (previous D5 genome vs. new D5 genome) identified a 1.26MB segment that was inserted into the old sequence and not found in our new de novo assembly. Bionano data also indicated an insertion in the old assembly while the 'inserted' Bionano contig was unmapped in the new assembly of D5 (Figure 3C).
In general, the reported large-scale NUMT exhibited high similarity to the published G. raimondii mitochondrial genome (99.8% PID over 94% of region between Chr01:23100000-25000000). On an individual gene basis, over half of the genes contained within the putative NUMT were over 99% identical to the published sequence in the G. raimondii mitochondrial genome, with an average of 95% similarity. Given that NUMTs evolve more quickly once in the nucleus (Palmer or Adams citation?), this high level of sequence similarity also suggested that the insertion in Chromosome 1 of the previously published G. raimondii genome was either (1) a very recent insertion in the G. raimondii genome or (2) an assembly artifact. Considering the D-genome alignments and Bionano data presented above, it was more likely an assembly artifact that was mistakenly include in the final genome sequence.
Structural Variations between the D-genomes
Comparisons between D5 and D10 revealed several structural differences between the two genomes. The largest structural variant is an inversion on D10_08. Inversions were manually curated (20 kb in length [two segments of adjacent 10kb alignment block lengths]).