The Validation module
When the Assembly module finishes running, the Validation module is
internally called. In addition, users could invoke this module manually
by providing the sequence to be validated and a local reference file or
taxon information.
The Validation module uses “Rotate” algorithm to adjust the structure
of target and reference sequences and then conducts the collinearity
analysis (Figure 1). It takes several steps:
- Generate a full-length plastid genome. Extend the target sequence by
adding a copy to its end. Independently from where the sequence
starts, there will be one complete genome started with the beginning
of the LSC region instead of a truncated region.
- Perform self-to-self alignment. Call blastn of BLAST 2.9.0 (Maddenet al. 2019) to perform a self-to-self pairwise alignment and
default options of blastn are used.
- Locate IR regions. Analyze the BLAST output. Find the longest match
that has at least three copies (four if the sequence does not start
with one truncated IR region), which should be IR.
- Determine the other regions. According to the boundary of the IRs,
locate the LSC and SSC regions. Extract a complete plastid genome with
the order of LSC-IR-SSC-IR ensuring that the starting site of the
sequence is on the 5′ or 3′ (if the LSC region is
reverse-complemented) terminal of the LSC region.
- Adjust the reference. Repeat steps 1 to 4 on the reference sequence.
- Align target and reference sequences. Use BLAST to perform the
pairwise alignment for adjusted target and reference sequences. While
the sequence identity threshold of alignment is between 0 and 100%,
in this step, the default low threshold (0.7) allows BLAST to ignore
the mismatch in the middle of alignment and continue to extend the
alignment. Hence, the alignment process prefers to focus on the
structure similarity instead of the sequence one.
- Analyze the alignment result. Because the target and reference
sequences have the same starting site (steps 4 and 5) and the boundary
of each region is known, regions of two sequences can be easily
compared. This may have three possible results. If all the sequences
match but they are in different directions, the direction of the whole
sequence needs to be altered. If the direction of LSC or SSC is
inconsistent with the reference, the orientation of the LSC or SSC
needs to be adjusted. Besides the two cases mentioned above, the
process treats it as a problematic assembly due to the conservatism of
the plastid genome.
- Output validated sequences with a standardized structure. After
adjusting the starting site, direction of the strand, and orientation
of four major regions, the verified plastid genome with a standardized
structure is generated. In addition, the results of collinearity
analysis and unadjusted sequences are also available in case of need.
- Since both the plus and minus strands are considered when finding the
repeated regions, apart from IR regions, theoretically, the program
could also recognize direct repeats (DR). For those species that do
not have a quadripartite structure, the Validation module may not work
as normal, such as species of Erodium (Blazier et al.2011) and some parasitic plants (Bellot & Renner 2016).
If the assembly failed to pass the validation, the Assembly module was
recalled to use another seed for assembly. When at least one validated
assembly is found by the Validation module or all seeds are tried, the
program stops. If no validated assembly is generated after trying all
seeds, the Merge module tries to build a complete plastid genome based
on contigs created using different seeds.