In 2012, the first reference quality cotton genome was brought to fruition through a monumental, collaborative effort using a combination of next-generation sequencing technologies and targeted Sanger sequencing (Paterson et al, 2012). Gossypium raimondii, a Mesoamerican diploid species, was selected to represent the cotton genus for its small genome size and its relationship to the domesticated polyploid species (Chen et al. Plant Physiol. 2007 Dec; 145(4): 1303–1310). Subsequently, this genome has been widely used by the cotton research community, garnering ~500 citations from a wide spectrum of research. While this genome has been a reliable resource for over 7 years, the advent of third generation 3C (chromatin capture) sequencing technologies provides the ability to increase accuracy in critical genomes by physically associating sequences of greater distance.
One justification for the original
G. raimondii sequence, i.e., its phylogenetic relatedness to the domesticated allopolyploid species and the recruitment of genetic factors from that subgenome during domestication, make
G. raimondii and its close relatives potential genetic sources for cotton breeding.
G.
turneri is species closely related to
G. raimondii found in Sonora, Mexico (Fryxell 1978,
Madroño, Vol. 25, No. 3, pp. 155-159). Like
G. raimondii, fiber from
G. turneri is unspinnable; however,
G.
turneri has phenotypic characters with agronomic potential, e.g., caducous bracts, insect resistance, and abiotic stress tolerance (
https://link.springer.com/article/10.1007/s10681-018-2118-2; DOI: 10.5772/58387 ). These two species are generally similar, with both having
n=13 and relatively small genome sizes (841 Mbp versus 880 Mbp in
G. turneri and
G.
raimondii, respectively); however, the species are genetically distinct (Fst=0.76 by SSRs) (
dx.doi.org/10.1139/cjb-2012-0192 ) and a previously published draft genome suggests that gene gain and loss may be elevated in this species (Grover 2018).
Here we describe resequencing of G. raimondii using PacBio, Bionano, and Hi-C technologies. We identified 3 significant assembly errors in the initial publication of G. raimondii. We have also sequenced the genome of G. turneri using PacBio and Dovetail Hi-C libraries. Comparisons of the two genomes provides a firmer foundation for understanding the D-genome contribution to polyploid cottons...
Methods & Materials
Plant material and sequencing
Leaf tissue of mature G. raimondii (accession D5-4) and G. turneri (accession D10-3) plants were collected at the Brigham Young University (BYU) Greenhouse and DNA was extracted using CTAB techniques (Kidwell et al. 1992). DNA concentration was measured by a Qubit Fluorometer (ThermoFisher, Inc.). The sequencing library was constructed according to PacBio recommendations at the BYU DNA Sequencing Center (DNASC). Fragments >18 kb were selected for sequencing via BluePippen (Sage Science, LLC). Prior to sequencing, the size distribution of fragments in the libraries was evaluated using a Fragment Analyzer (Advanced Analytical Technologies, Inc). Eight and eleven PacBio cells were sequenced from two libraries G. raimondii and G. turneri, respectively, on the Pacific Biosciences Sequel system. For both genomes, the raw PacBio sequencing reads were assembled using Canu V1.6 using default parameters (Koren, 2017).
Hi-C libraries were constructed from G. raimondii leaf tissue at NorthEast Normal University, China. Sequencing was performed at Annoroad Gene Technology Co., Ltd (Beijing, China). The Hi-C data of G. raimondii was mapped to the previous genome sequence of G. raimondii using ___. It was also mapped to the newly assembled CANU contigs of PacBio reads by PhaseGenomics. The Hi-C interactions (association frequency between each paired-end) were used as evidence for contig proximity and scaffolding of contig sequences. An initial draft genome sequence of pseudochromosomes (PGA assembly) was created using a custom script from PhaseGenomics.
DNA was also extracted from young
G. raimondii leaves following the Bionano Plant protocol for high-molecular weight DNA. DNA was purified, nicked, labeled, and repaired according to Bionano standard operating procedures for the Irys platform. Two optical maps of different enzymes (
BspQI and
BssSI) were assembled using the IrysSolve pipeline on the BYU Fulton SuperComputing cluster (
http://fsl.byu.edu). The optical maps were combined into a two-enzyme composite optical map and it was aligned to the PGA assembly using an
in silico labeled reference sequence. Conflicts between the Bionano maps and the PGA assembly were manually identified in the Bionano Access software by comparing the mapped Bionano contigs to the CANU contigs (bed file) along the draft genome sequence. Conflicts between datasets were resolved by repositioning and reorienting CANU contigs in PGA ordering files followed by reconstruction of the fasta sequence, provided there was supporting or no-conflict evidence from the optical map (Durand 2016,
Supp. Figure 1). Multiple iterations of mapping, conflict resolution, and draft sequence construction resulted in the final, new genome sequence of
G. raimondii.