II. MATERIALS AND METHODS

IIA. CROSSING AND SAMPLE COLLECTION

Gynogenetic double haploids were produced by fertilizing eggs with UV irradiated sperm, then pressure shocking embryos immediately following the first mitotic division (as described in Thorgaard et al. 1983; Limborg et al. 2016). Double haploid (DH) offspring were created at Pendill’s Creek National Fish Hatchery using eggs and sperm collected from captive adult Lake Trout from the Seneca Lake brood stock. Due to low survivorship of DH offspring (Komen & Thorgaard 2007), we tested multiple UV and pressure shock treatments on eggs from five different females. Batches of 900 eggs from each female were fertilized with sperm that was irradiated for 140, 280, or 1,260 seconds. Each batch was then split and sub-batches were pressure shocked at 11,000 PSI for five minutes at either 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 hours post-fertilization. One batch of 900 eggs from each female was also exposed to a control treatment which involved no sperm irradiation or pressure shock. Embryos were incubated in heath trays at ambient temperature until eye-up stage (E36 per Balon 1980), with dead embryos being removed from trays on a daily basis. Individuals surviving past post-embryo stage (sensuMarsden et al. 2021) were euthanized using a lethal dose of tricaine methanesulfonate (MS-222) and flash frozen in liquid nitrogen. Prior to sequencing and assembly, we verified that the DH individual chosen for sequencing was completely homozygous at 15 microsatellite loci that are typically highly heterozygous in the Seneca Lake hatchery population (Valiquette et al. 2014). One of the two individuals was selected for high molecular weight DNA extraction and long-read sequencing.

IIB. LABORATORY METHODS

A long-read sequencing library was prepared for the selected individual using the SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, Menlo Park, California), with the optional DNA Damage Repair step after size selection. Size selection was made for >10 kb using a Blue Pippin instrument (Sage Science, Beverly, Massachusetts) according to the manufacturer recommended protocol for 20kb template preparation. 5ug of concentrated DNA was used as input for the library preparation reaction. Library quality and quantity were assessed using a genomic DNA Tape Station assay (Agilent, Santa Clara, California), as well as Broad Range and High Sensitivity Qubit fluorometric assays (Thermo Fisher, Waltham, Massachusetts). Single-Molecule Real Time sequencing was performed on the Pacific Biosciences Sequel instrument at the McGill Genome Centre (McGill University, Montreal, Canada, https://www.mcgillgenomecentre.ca/) using an on-plate concentration ranging from 1.5-7.5pM and the Sequel Sequencing Kit 2.0 with diffusion loading. 38 SMRTCells were run with 600-minute movies and two SMRTCells were run with 1200-minute movies.
Hi-C proximity ligation libraries were generated to aid with assembly scaffolding. Two libraries were prepared from spleen and muscle tissue using library preparation kits manufactured by Kapa Biosystems (Wilmington, Massachusetts) and Lucigen (Middleton, Wisconsin), respectively. Each Hi-C library was spiked into a portion of an Illumina HiSeq lane in order to assess how effectively reads could be mapped against the draft contig assembly. Genpipes version 3.1.5 (Bourgey, Dali et al. 2019) and HiCUP version 0.7.2 (Wingett, Ewels et al. 2015) were used to map Hi-C sequencing reads. The Hi-C library prepared using muscle-derived DNA and prepared using the Arima-Hi-C Lucigen Kit (Arima Genomics, San Diego, CA), was selected for further sequencing. This kit employs a restriction enzyme cocktail that digests chromatin at N^GATC and G^ANTC sequence motifs.
DNA was also extracted from four Lake Trout from the Seneca Lake, Isle Royale, and Green Lake hatchery broodstocks using MagAttract HMW DNA extraction kits (Qiagen, Hilden, Germany) for the purpose of generating Illumina shotgun sequencing data. Sequencing reads from Seneca origin individuals were later used for contig polishing and correction (described below in Assembly and Scaffolding ). Libraries were prepared for these individuals using 100ng of input DNA and the NEBNext Ultra Library Preparation Kit for Illumina (New England Biolabs, Ipswich, Massachusetts). Libraries sheared to approximately 400 bp using a Covaris M220 Ultrasonicator, amplified for eight cycles, and quantified using Quant-It Picogreen dsDNA assays (Thermo Fisher, Waltham, Massachusetts) run in triplicate. Fragment size was assessed using a genomic DNA Tape Station assay (Agilent, Santa Clara, California). Libraries were multiplexed in equal concentrations and sequenced in two HiSeqX lanes using paired end 2x150 format by the Novogene Corporation (Beijing, China).

IIC. ASSEMBLY AND SCAFFOLDING

Assembly was carried out using the polished_falcon_fat assembly workflow run using the SMRT Analysis v3.0 pbsmrtpipe workflow engine provided with an installation of SMRT Link v5.0 (smrtlink-release_6.0.0.47841; https://github.com/PacificBiosciences/pbsmrtpipe). Read metadata were extracted using the SMRT Analysis v3.0 dataset tool with the merge option. Sequencing read metadata, pipeline settings, and an output directory were specified for the polished_falcon_fat pipeline option. Default assembly settings were used except genome size (HGAP_GenomeLength_str) was set to 3 gigabases (GB), seed coverage (HGAP_SeedCoverage_str) was set to 40X, and the minimum read length to use a read as a seed (HGAP_SeedLengthCutoff_str) was set to 1000. Multiple compute settings were also changed. The resulting assembly settings file, read metadata file, and commands used to run the pipeline are available in Supplemental Material 1 -Assembly Parameters).
The polished_falcon_fat workflow uses FALCON assembly algorithm (Chin et al. 2013) and the Quiver/Arrow consensus tool (https://github.com/PacificBiosciences/GenomicConsensus) to generate a polished contig assembly. The Falcon method operates in two phases: First, overlapping sequence reads were compared to generate accurate consensus sequences with read N50 greater than 10.9Kb. Next, overlaps between the corrected longer reads were used to generate a string graph. The graph was reduced so that multiple edges formed by heterozygous structural variation were replaced to represent a single haplotype. Contigs were formed by using the sequences of nonbranching paths. Two supplemental graph cleanup operations were applied to improve assembly quality by removing spurious edges from the string graph: tip removal and chimeric duplication edge removal. Tip removal discards sequences with errors that prevent 5′ or 3′ overlaps. Chimeric duplication edges may result from the raw sequence information or during the first sequence cleanup step and artificially increase the copy number of a duplication. In a second and final workflow stage, the polished_falcon_fat workflow used the Arrow consensus tool to perform error correction on the assembly and generate an initial polished assembly. The resulting contigs were passed through a second round of error correction using Pilon in order to resolve SNP, indel, and local assembly errors before proceeding with scaffolding (https://github.com/broadinstitute/pilon). The Illumina paired-end sequencing dataset described above was used as input for Pilon after removing adapters and trimming reads using the sliding window approach implemented in Trimmomatic v0.32 (Bolger et al. 2014).
We adopted a multifaceted scaffolding approach leveraging information from Hi-C sequencing and a high-density linkage map for Lake Trout (Smith et al. 2020). Hi-C reads were mapped to Pilon corrected contigs with default setting using the Arima Genomics Mapping pipeline (Arima Genomics, https://github.com/ArimaGenomics/mapping_pipeline), which included four primary steps. First, forward and reverse reads were mapped to the reference genome using bwa version 0.7.17 (Li 2013) separately. Next, the 5’ end of the mapped reads were trimmed. Samtools version 1.9 (Li, Handsaker et al. 2009) was then used to filter reads with mapping quality (MAPQ) less than 10. Finally, Picard version 2.17.3 (https://broadinstitute.github.io/picard/) was used to add read group information and mark duplicate reads. The resulting BAM file was used as input for SALSA v2.2 (Ghurye et al. 2017) run with default settings (three iterations). We also tested Salsa2 using five iterations and compared results with those produced using default settings by calculating Spearman’s rank order correlation coefficients between the order of loci on the Lake Trout linkage map (Smith et al. 2020) and the order of loci on the 50 largest scaffolds. Linkage mapped RAD contigs were aligned to the reference assembly using minimap2 using the -asm5 option. RAD contigs with mapping qualities less than 60 were removed before calculating correlation coefficients using the cor function and the method argument set to “spearman.”
Additional scaffolding was carried out using Chromonomer v1.13 (Catchen et al. 2020). The assembly was initially scaffolded using default settings, which yielded chromosome length scaffolds with a high degree of concordance with the linkage map; however, structural differences between the linkage map and scaffolds were apparent on six chromosomes. In order to resolve these inconsistences, we aligned the full set of PacBio subreads to the assembly using Minimap2 (Li 2018) using the preset option for PacBio data. The resulting bam file was sorted, indexed, and per-base coverage was calculated for all positions using samtools depth with the –a option. We then ran a second round of Chromonomer using the –rescaffold, –depth, and depth_stdevs = 2 options, which allowed for gaps to be opened in contigs if the site-specific depth within a sliding window of 1000 base pairs was greater than 2 standard deviations from the mean, suggesting an assembly error. This resulted in an assembly with improved concordance with the linkage map; however, linkage group 41 still exhibited a large inversion relative to the scaffolds. We determined the approximate location of this assembly error by identifying the pair of linkage mapped loci for which the level of discordance between the linkage map and assembly was maximized. The scaffold was manually broken and reoriented using an existing gap that existed between these two loci.
Gaps were filled using PBJelly from PBSuite v15.8.24 (English et al. 2012). All PacBio reads were aligned to the draft assembly using Minimap2 using the -pb preset option and reads mapping within 5000 base pairs of a gap were retained for gap filling using bedtools intersect (Quinlan and Hall 2010). Retained reads were re-mapped with Blasr v5.3.2 (Chaisson et al. 2012) using the options –minMatch 11, –minPctIdentity 75, –bestn 1, –nCandidates 10, –maxScore -500, and –fastSDP. The “maxWiggle” argument was set to 100 kilobases (KB) for the PBJelly assembly stage in order to account for gaps of unknown length. After filling gaps, we corrected single nucleotide and short indel errors by running 3 iterations of Polca (distributed with MaSuRCA v. 3.4.2; Zimin and Salzberg 2020) using Illumina data from a Seneca strain female as input. Default settings were used except alignments overlapping gaps were removed from bam files using bedtools intersect (Quinlan and Hall 2010) prior to running the Polca variant calling step.
Illumina paired end data from the same individual used for genome polishing and PacBio data from one SMRTcell were aligned to the Arctic Char (Salvelinus alpinus ) mitome (RefSeq: NC_000861.1) in order to obtain reads useful for assembling the Lake Trout mitome. Reads were aligned using Minimap2 using the sr and map-pb present options for short-reads and long-reads, respectively. Reads aligning to the Arctic Char mitome were extracted from original fastq files using seqtk subseq (https://github.com/lh3/seqtk) and hybrid assembly was conducted using Unicycler v0.4.8 (Wick et al. 2017) using the settings –min_fasta_length 15000 and –keep 0. Unicycler implements a hybrid-assembly approach using Spades (Bankevich et al. 2012), SeqAn (Döring et al. 2008), and Pilon. First, Spades (v3.13.1 here) was used to assemble Illumina short-reads and contigs with graph coverage less than half the median coverage were removed due to potential contamination from the nuclear genome. Contigs were then scaffolded using long-reads and SeqAn (Döring et al. 2008) was used to generate gap consensus sequences. Finally, Pilon was used to resolve assembly errors using short-read alignments as input.

IIG. ASSEMBLY QUALITY CONTROL

We used multiple approaches to assess the accuracy, contiguity, and completeness of the genome assembly. First, we determined the proportion of the genome that was recovered in our assembly by comparing total assembly size with an estimate of genome size based on the distribution of k-mer frequencies from Illumina paired-end 2x150 data generated using DNA from a Seneca strain female. The frequency of all 19mers in the read data was calculated using the count function in Jellyfish v2.2.6 (Marcais and Kingsford 2012) with the options -m 19 and -C. K-mer counts were then exported to the histogram format using the histo function. This file was used as input for GenomeScope v1.0 (http://qb.cshl.edu/genomescope/; Vurture et al. 2017) with read length set to 150 bp and k-mer length set to 19.
Basic assembly statistics were calculated using the program summarizeAssembly.py from PBSuite v15.8.24 (English et al. 2012). Statistics included total assembly size, contig and scaffold N50s, and minimum and maximum contig and scaffold lengths. Assembly statistics were calculated with and without gaps. Contig and scaffold N50s and counts were obtained for 14 additional salmonid assemblies from NCBI for comparison. Single base consensus accuracy was estimated after each iteration of polishing with Polca.
Next, we calculated percentages of complete singleton, complete duplicated, fragmented, and missing Benchmarking Single Copy Orthologs (BUSCOs) for seven chromosome-level salmonid assemblies and compared these with scores for the Lake Trout assembly discussed here. These included genomes for Brown Trout (Salmo trutta ; GCA_901001165.1), European Whitefish (Coregonus sp. balchen ; GCA_902810595.1; De-Kayne et al. 2020), Atlantic Salmon (Salmo salar; GCA_000233375.4; Lien et al. 2016), Coho Salmon (Oncorhynchus kisutch ; GCA_002021735.1), Rainbow Trout (Oncorhynchus mykiss ; GCA_002163505.1; Pearse et al. 2019), Chinook Salmon (Oncorhynchus tshawytscha ; GCA_002872995.1; Christensen et al. 2018b), and Dolly Varden (Salvelinus malma ; GCA_002910315.1; Christensen et al. 2018a). It should be noted that the assembly originally produced for Arctic Char (GCA_002910315.1; Christensen et al. 2018a, referred to as the Dolly Varden assembly here) was later found to be from a Dolly Varden or potentially a Dolly Varden – Arctic Char hybrid (see Shedko et al. 2019 and Christensen et al. 2021). BUSCO scores were also calculated for the Northern Pike genome (Esox lucius ; GCA_000721915.3), a member of the order Salmoniformes that is commonly used as a pre-Ss4R outgroup species. BUSCO scores were calculated using BUSCO v4.0.6, the actinopterygii_odb10 database (created November 20th, 2019), and the -genome option.
Finally, we aligned the linkage mapped contigs from Smith et al. (2020) to the final assembly and calculated Spearman’s rank order correlation coefficients between physical mapping locations and the order of loci along linkage groups. Linkage mapped contigs were aligned to the reference assembly using minimap2 using the -asm5 preset parameters and the resulting sam file was filtered to exclude contigs with mapping qualities less than 60. Correlation coefficients were calculated using the cor function in R (R Core Team 2017) with the method argument set to “spearman.” Correlation coefficients were then converted to absolute values using the abs function in order to compare chromosomes and linkage groups with reversed orientations.

III. REPETITIVE DNA

A custom repeat library was created using RepeatModeler v2.0.1 (Flynn et al. 2020) and repeats were subsequently classified using RepeatClassifier (Smit et al. 2015). Repeats were then masked using RepeatMasker (Smit et al. 2015) and the output of RepeatMasker was used to determine the genome-wide abundance of different repeat families and the relative density of repeat types across chromosomes. The density of the most abundant repeat type (Tcl-mariner) was visualized across chromosomes using the R-package circlize (Gu et al. 2014; Figure 2).

IIK. HOMEOLOG IDENTIFICATION AND SYNTENY

We performed a self-vs-self synteny analysis using SyMap v5 (Soderlund et al. 2006; Soderlund et al. 2011) to identify Lake Trout homeologs resulting from the Salmonid specific autotetraploid event (Macqueen and Johnston 2014; Lien et al. 2016). Prior to running SyMap, we hard-masked the genome using RepeatMasker v4.1.0 (Smit et al. 2015) using our custom repeat library as input and RMblast as the search engine (-e ncbi). Nucmer was used for SyMap alignments and options were set to –min-dots = 30, top_n = 2, and merge_blocks = 1. We then used Symap to identify blocks of synteny between Lake Trout and Dolly Varden, Rainbow Trout, and Atlantic Salmon. Alignments were conducted using Promer, and we used the options min_dots = 30, top_n = 1, merge_blocks = 1, and no_overlapping_blocks = 1. Results from self-vs-self synteny analysis were visualized using the R-package circlize (Gu et al. 2014). Results from the species-vs-species synteny analysis were visualized using the Chromosome Explorer option in Symap v5 (Supplemental Material 4 – Syntenic Blocks and Between Species Circos Plots ).

IIF. RNA SEQUENCING AND GENE ANNOTATION

RNA samples were obtained from the offspring of Seneca Lake hatchery strain fish held within the Ontario Ministry of Natural Resources and Forestry (OMNRF) hatchery system. Offspring were produced using four males and four females in a full factorial mating cross, by dry-spawning anesthetized fish (anesthetic: 0.1 g L-1 MS-222; Aqua Life, Syndel Laboratories Ltd., B.C., Canada). Eggs (140 mL) were stripped from each female, divided evenly among four jars, and fertilized by pipetting milt directly onto them. After fertilization, embryos were transported to the Codrington Fish Research Facility (Codrington, Ontario, Canada) where they were transferred from the jars into perforated steel boxes with one family per box. These boxes were contained in flow-through tanks receiving freshwater at ambient temperature (5-6℃) and natural photoperiod under dim light. When the embryos fully absorbed their yolk sacs and were ready to feed exogenously (i.e. free embryos; approximately March 2016), 14 individuals from each family were randomly selected and split into two groups of seven, then transferred into one of four larger (200 L) tanks.
Tissue sample collection occurred between June 28 to August 9, 2016. Each fish was euthanized in a bath of 0.3 g L-1 of MS-222 and dissected to remove the whole liver. The liver was gently blotted on a lab wipe and stored in RNAlater (Invitrogen, Thermo Fisher Scientific) for 24-48 hours at room temperature. RNALater was pipetted from the liver tissue and the samples were stored at -80℃ until RNA isolation. Liver tissues were homogenized individually in 2 mL Lysing Matrix D tubes (MP Biomedicals) with 1 mL of Trizol reagent (Invitrogen, Thermo Fisher Scientific). RNA was extracted from the homogenate using phenol-chloroform extraction (Chomczynski & Sacchi, 2006). RNA was precipitated with RNA precipitation solution (Sambrook & Russel, 2001) and isopropanol, and washed with 75% ethanol. RNA samples were resuspended in nuclease-free water (Thermo Fisher Scientific). The purity and concentration of the RNA were initially determined using a NanoDrop-8000 spectrophotometer. RNA quality was also assessed using a Bioanalyzer (Agilent) and resulting RNA integrity numbers (RIN). All RNA samples met our minimum RIN threshold of 7.5.
RNA sequencing was performed over two years. Twenty-four samples were sent to The Centre for Applied Genomics (Sick Kids Hospital, Toronto, Ontario, Canada) in 2018, and another 30 samples were sent to the Centre d’expertise et de services Génome Québec (Montreal, Quebec, Canada; https://cesgq.com/) in 2020. cDNA libraries were produced by enriching the poly(A) tails of mRNA with oligo dT-beads using the NEBNext Ultra II Directional polyA mRNA Library Prep kit (New England Biolabs; Ipswich, Massachusetts). The group of 24 individuals was sequenced in 2.5 Illumina HiSeq 2500 lanes using 2X126 bp paired end reads. The additional thirty individuals were sequenced in three Illumina HiSeq 4000 lanes using 2X126 bp paired end reads. Data were deposited in sequence read archives associated with BioProject PRJNA682236. These sequencing reads, along with those from two previous RNAseq experiments (Goetz et al. 2010; Goetz et al. 2016), were used as input for NCBI’s Eukaryotic Genome Annotation Pipeline (Thibaud-Nissen et al. 2016).

IIE. RECOMBINATION RATES AND CENTROMERES

Sex averaged recombination rates were estimated across chromosomes using the sliding window interpolation approach implemented in MareyMap (Rezvoy et al. 2007). Restriction site associated DNA (RAD) contigs from the Lake Trout linkage map (Smith et al. 2020) were mapped to chromosomes using minimap2 using the -asm5 preset option and reads with mapping qualities less than 60 were removed. At this point, RAD loci overlapping centromere mapping intervals for each linkage group were extracted and the centromere center was considered to be the mean mapping position for centromere associated RAD tags. Centromere positions were visualized using the R-package circlize (Gu et al. 2014).
In order to remove contigs with anomalous mapping positions that could bias recombination rate estimates, we fit a loess model describing linkage map position as a function of physical position for each chromosome, extracted model residuals, and removed markers with residuals that were greater than one standard deviation from the mean. Loess models were fit using the loess function in R with the span argument set to 0.2 and the degree argument set to 2. The remaining markers were output to MareyMap format and were manually curated using MareyMap Online (Siberchicot et al. 2017). A sex averaged recombination map was calculated using sliding window interpolation and exported from the program (Supplemental Material 3 – Recombination Map ).