The Merge module
The Merge module implemented an edited overlap-Layout-consensus
algorithm (Li et al. 2012) to merge contigs according to the
overlaps between them. The procedures are as follows:
- Make a copy. Since the input contigs could be in sense or antisense
strands, add a reverse-complemented copy of the original input to
ensure that at least one copy of contigs are all in sense strands.
- Self-to-self alignment. Call blastn of BLAST 2.9.0 to do self-to-self
alignment with default options.
- Analyze the BLAST output. Only the forward overlapped match is kept,
that is, the query and subject sequence are both in the sense strand,
and the match is between the beginning of the downstream sequence and
the ending of the upstream sequence. In addition, nested overlap is
omitted to remove short redundant contigs.
- Generate a unidirectional graph. According to the overlap information
above, a unidirectional graph is generated. The nodes of the graph
represent the contigs and the directed edges represent overlaps
between contigs. Since there are two copies of contigs in the opposite
pattern, ideally, there might be two major circles in the graph.
- Cut edges. The transitively inferable edges, non-branching stretches
and alternative paths that go through the same node are removed. The
edges across the two circles are also removed. All removed edges
represent incorrect overlapping relationships, mainly due to repeated
sequences, especially IR regions in plastid genomes.
- Extract the full circle found in the graph. According to the overlap
information, the program merges contigs to generate circular
sequences.
If the input contigs contain enough head-to-tail overlapping sequences,
a whole plastid genome is likely to be formed and the program may
generate two circular sequences with opposite orientation. Finally, the
Validation module is called to test the output.
Results