The Validation module
When the Assembly module finishes running, the Validation module is internally called. In addition, users could invoke this module manually by providing the sequence to be validated and a local reference file or taxon information.
The Validation module uses “Rotate” algorithm to adjust the structure of target and reference sequences and then conducts the collinearity analysis (Figure 1). It takes several steps:
  1. Generate a full-length plastid genome. Extend the target sequence by adding a copy to its end. Independently from where the sequence starts, there will be one complete genome started with the beginning of the LSC region instead of a truncated region.
  2. Perform self-to-self alignment. Call blastn of BLAST 2.9.0 (Maddenet al. 2019) to perform a self-to-self pairwise alignment and default options of blastn are used.
  3. Locate IR regions. Analyze the BLAST output. Find the longest match that has at least three copies (four if the sequence does not start with one truncated IR region), which should be IR.
  4. Determine the other regions. According to the boundary of the IRs, locate the LSC and SSC regions. Extract a complete plastid genome with the order of LSC-IR-SSC-IR ensuring that the starting site of the sequence is on the 5′ or 3′ (if the LSC region is reverse-complemented) terminal of the LSC region.
  5. Adjust the reference. Repeat steps 1 to 4 on the reference sequence.
  6. Align target and reference sequences. Use BLAST to perform the pairwise alignment for adjusted target and reference sequences. While the sequence identity threshold of alignment is between 0 and 100%, in this step, the default low threshold (0.7) allows BLAST to ignore the mismatch in the middle of alignment and continue to extend the alignment. Hence, the alignment process prefers to focus on the structure similarity instead of the sequence one.
  7. Analyze the alignment result. Because the target and reference sequences have the same starting site (steps 4 and 5) and the boundary of each region is known, regions of two sequences can be easily compared. This may have three possible results. If all the sequences match but they are in different directions, the direction of the whole sequence needs to be altered. If the direction of LSC or SSC is inconsistent with the reference, the orientation of the LSC or SSC needs to be adjusted. Besides the two cases mentioned above, the process treats it as a problematic assembly due to the conservatism of the plastid genome.
  8. Output validated sequences with a standardized structure. After adjusting the starting site, direction of the strand, and orientation of four major regions, the verified plastid genome with a standardized structure is generated. In addition, the results of collinearity analysis and unadjusted sequences are also available in case of need.
  9. Since both the plus and minus strands are considered when finding the repeated regions, apart from IR regions, theoretically, the program could also recognize direct repeats (DR). For those species that do not have a quadripartite structure, the Validation module may not work as normal, such as species of Erodium (Blazier et al.2011) and some parasitic plants (Bellot & Renner 2016).
If the assembly failed to pass the validation, the Assembly module was recalled to use another seed for assembly. When at least one validated assembly is found by the Validation module or all seeds are tried, the program stops. If no validated assembly is generated after trying all seeds, the Merge module tries to build a complete plastid genome based on contigs created using different seeds.