Raw data pre-processing and genome size estimation
Quality assessment of the raw DNA Illumina sequence data was performed
with FastQC v0.11.8 (Andrews et al., 2010). Low quality reads and
adapters were removed using Trimmomatic v0.39 (Bolger et al., 2014). The
reads were scanned by a 4-based sliding window with an average cutting
threshold lower than 15 Phred score. Leading and trailing bases with
quality scores less than 10 were also filtered out. Reads with total
length shorter than 75 bp and average score below 30 were omitted. The
same process was applied to the RNASeq reads.
Adapter trimming and length filtering of basecalled ONT data was done
using Porechop v0.2.4
(https://github.com/rrwick/Porechop)
with default parameters and the option – discard_middle to discard
reads with internal adapters.
The genome size was estimated using the k-mer histogram method with
Kmergenie v1.7051 (Chikhi and Medvedev 2014) from the Illumina genomic
sequencing data.
De novo genome
assembly
To build the genome assembly the long ONT reads were used for the
construction of an initial de novo assembly, and then the
Illumina reads were used for the polishing stages. (Figure 1). To
construct the initial assembly, we used the v. Flye v2.6 (Kolmogorov et
al. 2019) algorithm, a repeat graph assembler. The assembly was
evaluated by assessing: (1) the N50 sizes of contigs, using QUAST v5.0.2
(Gurevich et al. 2013), and (2) a gene completeness score using BUSCO
v3.1.0 (Simão et al. 2015) against the Actinopterygii ortholog dataset
v9, with default parameters.
The produced assembly was polished with two rounds of Racon v1.4.3
(Vaser et al. 2017), using the prepossessed long reads mapped against
the assembly with Minimap2 v2.17 (Li 2018). Further polishing was
performed with Medaka v0.9.2
(https://github.com/nanoporetech/medaka) and the final polishing
was completed using Pilon v1.23 (Walker et al. 2014) after mapping the
Illumina reads against the partially polished assembly with Minimap2
v2.17.