Methods

Dovetail Chicago library preparation and sequencing:

A Chicago library was prepared as described previously [42]. Briefly, ~500ng of high molecular weight (HMW) gDNA (mean fragment length = 75 kbp) was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested withDpnII , the 5’ overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350 bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 500 million 151 base paired-end reads.

Dovetail Hi-C library preparation and sequencing:

A Dovetail Hi-C library was prepared in a similar manner as described previously [43]. Chromatin was fixed in the nucleus with formaldehyde, extracted, and DpnII digested. 5’ overhangs were filled with biotinylated nucleotides, and then free blunt ends were ligated. Crosslinks were reversed for DNA purification from proteins. Purified DNA was treated to remove biotin outside of ligated fragments. The DNA was then sheared to ~350 bp, and libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotinylated fragments were isolated with streptavidin beads before PCR enrichment of libraries and sequenced on an Illumina HiSeq X to produce 531 million 2x151 bp paired end reads.

Genome Assembly

The H. glycines genome was assembled with Falcon using previously deposited Pacbio sequencing (SRX2692203 - SRX2692222). Falcon unzip 0.4.0 [44] was then used to reduce the heterozygosity in the assembly. Dovetail Genomics scaffolded this assembly with Chicago and Hi-C reads using a modified SNAP read mapper (http://snap.cs.berkeley.edu) and an iterative assembly with HiRise. This Dovetail assembly was further scaffolded with the previously mentioned Pacbio subreads using Sspace-longread v1.1 [45]. Gmcloser 1.6.2 [46] was used to fill gaps using PacBio circular consensus (CCS) reads. Pilon 1.22 [47] was then used to polish the assembly using ~142 million 260bp PE Illumina reads, which were processed with Trim Galore 0.4.5 [48], Hisat2 2.1.0 [49], and Samtools 1.9 [50]. The genome was then scaffolded using the previously mentioned Hi-C reads using Juicer 1.7.6 [43], 3D-DNA 180419 [51], and manually corrected using Juicebox 1.9.8 [52]. Once pseudomolecules were assembled, the genome was polished with Pilon 1.23 [47] using CCS reads, then with the 142 million Illumina PE reads, and then with 38 iterations of polishing using Pacbio CCS reads with Pilon 1.23. A final round of polishing was performed with the 142 million Illumina 250bp reads (SRR8381095) using Pilon 1.23 [47].

Gene Prediction

Genes were predicted using a Mikado 1.24 pipeline [53] that picked consensus transcripts from seven transcriptome assemblies and gene predictions. First, the genome was masked using RepeatModeler 1.0.11 [54] and RepeatMasker 4.0.9 [55]. Previously published Illumina RNA-seq reads (SRX3339090- SRX3339098) were processed with Trim Galore 0.4.5 [48], Hisat2 2.1.0 [49], and Samtools 1.9 [50] on both a masked and an unmasked genome. Previously published NCBI expressed sequence tags (downloaded 06-17-19) and IsoSeq (SRX3702373) were aligned using Gmap (version 2018-03-25) [56]. These data were utilized with Braker 2.1.0 [57] using all three data sources, annotating both an unmasked assembly and a masked assembly to compensate for parasitism-related CNV genes. Transcriptomes were assembled using the guidance of a masked genome with Trinity 2.3-.2 [58, 59], Class2 2.1.7 [60], Stringtie 1.3.4a [61, 62], and Spades 3.13.1 [63]. This first Mikado prediction was utilized in a second round of Mikado, supplemented with masked braker prediction and a Maker 2.31.10 [64] gene annotation from a 368-scaffold version of the assembly. All resulting predictions from the second round of Mikado were collapsed into gene loci via using shared intron/exon borders with Cufflinks gffread (Cufflinks 2.2.1) [65].
The Maker annotation mentioned previously was run over four rounds, with Maker’s internal algorithm first, then Augustus 3.2.1, then Snaphmm 2006-07-28, followed by GeneMark-ES 4.32. Repeatmodeler 1.0.11 and RepeatMasker 4.0.9 were used to perform the softmasking used in the annotation. Maker utilized all transcripts and proteins from related species genomes [66, 67] and UniProt [68], including:Bursaphelenchus xylophilus [69] , Ditylenchus destructor , Globodera pallida [70] , Globodera rostochiensis [71] , Globodera ellingtonae [72] ,Meloidogyne floridensis [73] , Meloidogyne hapla [74], Meloidogyne incognita [75] ,Parastrongyloides trichosuri , Rhabditophanes_KR3021 ,Strongyloides papillosus , Strongyloides ratti ,Strongyloides stercoralis ; all H. glycines ESTs from NCBI [67], and a Braker 1.9 [76] annotation on this unmasked assembly using published RNA-seq (SRX3339090- SRX3339098).

Functional Gene Annotations

Gene annotations were compiled from Interproscan 5.27-66.0 [77] and BLAST [78] searches to NCBI NT and nr databases downloaded on 10-23-19 [67], as well as swissprot/uniprot databases downloaded on 12-09-2019 [68]. Genes encoding transposable element-associated proteins were identified using Bedtools 2.27.1 [79] with exon overlaps to Repeatmodeler-predicted transposable elements.

Differential Gene Expression

The strandedness of the RNA-seq was evaluated with RseQC V4.0 [80, 81], followed by alignment to the genome with HiSat (2.2.0) [49], and converted to bam with Samtools (1.1.0) [50]. Read counts were calculated with FeatureCounts from Subread package (1.6.0) [82], followed by Deseq2 (1.20.0) [83] with P-value cutoffs at 0.05 to determine differential expression between the samples.

BUSCO analysis

Universal single copy orthologous genes were assessed using BUSCO 3.0.2 [84-86] on both the predicted proteins and the genome against the nematoda ODB9 dataset. Missing genes were verified with BLAST [78] to the predicted protein sequences using a 0.01 e-value and 1.6x -0.4x length cutoff (S table 1).

Effector gene mapping

Effector proteins were mapped to the predicted proteome using Diamond 0.9.23 [87]. Effector genes were mapped to the genome using Gmap (2018-03-25) [56]. Secreted proteins were identified with SignalP 5.0 [88] on the predicted proteome.

Repeat Prediction

Multiple repeat predictions were pursued to finely detail genome structure. To comprehensively predict the structure of transposable elements in the genome with Extensive de-novo TE Annotator, EDTA 1.7 [89]. Tandem repeat finder 4.0.9 [90] was run on the genome to identify tandem repeats. A repeat prediction sensitive to copy number variation was also pursued with RepeatModeler 1.0.11 [54] and RepeatMasker 4.0.9 [55].

Synteny

Genome alignments were performed using Mummer3 [91] and merged for display in Circos 0.69-6 [92] using Bedtools merge [79]. By inferring gene orthology from primary mapping sites of the predicted transcripts from our genome with Gmap 2018-03-25 [56], we inferred gene-based synteny with iAdHoRe 3.0.01 [93].