2.2 ∣ Genomic sequencing and assembly
High quality DNA was extracted from the whole first instar female larvae by the QIAamp DNA Blood Mini kit (Qiagen, Germantown, USA). Illumina and third generation sequencing (PacBio) were used respectively. For Illumina sequencing, a library with average insert size ~350 bp was constructed and sequenced on an Illumina HiSeq 2500 platform at Novogene (Novogene, Beijing, China). For third generation sequencing, a library with average insert size ~20 Kb was constructed. The libraries were sequenced on PacBio Sequel platform with Sequel SMRT cells 1M v2 at Novogene.
Raw data (raw reads) of fastq format were firstly processed through in-house perl scripts. Clean data (reads) were obtained by removing reads containing adapter, reads containing ploy-N and low-quality reads from raw data. At the same time, Q20, Q30 and GC content the clean data were calculated. All the downstream analyses were based on the clean data with high quality. The genome was assembled by a short-read assembly method, SOAPdenovo2 package. A de Bruijn graph was built by splitting the reads into K-mers, from the short insert size libraries (<1kb), without making use of pairing information. After a series of graph simplifications, the reads were assembled into contigs. All available paired-end reads were realigned onto the contig sequences to construct the linkage between contigs. The linkage was removed if it was supported by an unreliable weight of paired-end relationships. We used the strategy of subgraph linearization to simplify the contig linkage graph, by means of extracting unambiguously linear paths. The scaffolding process was iterated in the order of estimated insert size by our in-house scaffolder GOBOND. Finally, for filling the intra-scaffold gaps, a local assembly was performed to locate the reads in the gap region by GapCloser 1.12, with the other end uniquely mapped to the contig.