2.2 ∣ Genomic sequencing and assembly
High quality DNA was extracted from the whole first instar female larvae
by the
QIAamp
DNA Blood Mini kit (Qiagen, Germantown, USA). Illumina and third
generation sequencing (PacBio) were used respectively. For Illumina
sequencing, a library with average insert size ~350 bp
was constructed and sequenced on an Illumina HiSeq 2500 platform at
Novogene (Novogene, Beijing, China). For third generation sequencing, a
library with average insert size ~20 Kb was constructed.
The libraries were sequenced on PacBio Sequel platform with Sequel SMRT
cells 1M v2 at Novogene.
Raw data (raw reads) of fastq format were firstly processed through
in-house perl scripts. Clean data (reads) were obtained by removing
reads containing adapter, reads containing ploy-N and low-quality reads
from raw data. At the same time, Q20, Q30 and GC content the clean data
were calculated. All the downstream analyses were based on the clean
data with high quality. The genome was assembled by a short-read
assembly method, SOAPdenovo2 package. A de Bruijn graph was built by
splitting the reads into K-mers, from the short insert size libraries
(<1kb), without making use of pairing information. After a
series of graph simplifications, the reads were assembled into contigs.
All available paired-end reads were realigned onto the contig sequences
to construct the linkage between contigs. The linkage was removed if it
was supported by an unreliable weight of paired-end relationships. We
used the strategy of subgraph linearization to simplify the contig
linkage graph, by means of extracting unambiguously linear paths. The
scaffolding process was iterated in the order of estimated insert size
by our in-house scaffolder GOBOND. Finally, for filling the
intra-scaffold gaps, a local assembly was performed to locate the reads
in the gap region by GapCloser 1.12, with the other end uniquely mapped
to the contig.