Introduction
In nature, Host-pathogen Protein-Protein interactions (HP-PPIs) are highly complex, ubiquitous and fairly essential for elucidation of infectious diseases (1). During this interaction, there is a continuous cross talk between pathogens and their hosts that is mediated by a variety of effectors including proteins, small molecules, metabolites, and regulatory RNAs(2, 3). Pathogenesis involves interactions between the signalling networks of the host and pathogen. Recent studies regarding HP-PPIs focus on the mechanisms employed by pathogens to hijack and exploit the host immune system for their own survival. Processes for molecular mimicry have evolved to enable the proteins of pathogens to imitate the host proteins in order to disrupt their interactions and disturb the signalling pathways (4). Thus, the interacting pathways and proteins of the pathogen may be conceived to be in a continuum with those of the host.
Mimicry of host antigenic determinants as a survival mechanism was described early in parasites (5). A pathogen’s ability to mimic the host components may be achieved by two distinct mechanisms. The first one is where the host genes are acquired by the pathogen through horizontal gene transfer. An example of this is the acquisition of complement escape regulators by pathogenic bacteria like Echinococcus granulosus (6) and Onchocerca volvulus (7). The second mechanism is where both host and pathogen genes evolved independently and ended up having similar structures with different function i.e. underwent convergent evolution(8). A well-known example of this is theYersinia pseudotuberculosis effector protein, invasin, that structurally mimics the integrin-binding surface of the protein fibronectin (9). While Horizontal gene transfer leads to a detectable homology between the pathogen and host proteins (10, 11), convergent evolution is likely to modulate local similarity between the proteins of pathogen and host as depicted by sharing of motifs (12). The local similarities between epitopes from the pathogen/infectious agents and antigens present in the host can also lead to autoimmune diseases (13-17).
Molecular mimicry can operate at four distinct levels; (i) Similarity in both sequence and structure of a full-length protein or a functional domain as displayed by molecular mimicry between Legionella pneumophilaChlamydia trachomatis  and Burkholderia thailandensis SET-domain containing proteins with host proteins (18), (ii) only the structural similarity without an apparent sequence similarity as detected in case of several bacterial and viral pathogens that eventually evolved to structurally mimic host ligands, though the sequence similarity between pathogen molecules and the mimicked host ligands was low (19), (iii) similarity in the sequence of a short linear motif. An example of motif mimicry is displayed by the WxxxE motif in many bacterial Guanine Exchange Factors, such as EspM2 and Map in E. coli and also SifA of Salmonella (20, 21). Motifs have the ability to tolerate mutations and can evolve rapidly to alter interactions with the host (22), (iv) Similarity of only the binding site architectures (interface mimicry) without sequence homology is displayed by human fibronectin and Y. pseudotuberculosis invasin binding to human integrin (9, 11). These proteins display similarity in the chemical properties at the binding site in the absence of sequence and structural homology.
The existing methods of detection of mimicry are simply based on identifying sequence or structure similarity. A previously available database, namely mimicDB (8) provides information about molecular mimicry proteins or epitopes involved in a limited number of human parasites. Another database miPepBase (23) lists the experimentally verified mimicry peptides involved in auto-immune disease. However, a wide range of domains and motifs are recruited by pathogens to mimic and hijack the host cell machinery for its survival (20, 21, 24-26). A computational pipeline using pBLAST against the human proteome has also been implemented for the prediction of the molecular mimicry candidates in bacterial pathogens (27). However, sequence-based methods for discovery of protein mimics may not be adequate as they are dependent on the level of recognizable homology between the host and pathogen proteins. Structure-based methods are more suitable for recognizing remote similarity while motif-based methods are suitable for recognizing localized regions of similarity between proteins. Pathogenic bacteria are likely to target the host proteins by imperfectly mimicking the host interface (28). An interface mimicry-based method, the HMI-PRED server (29) carries out structural prediction of given HP-PPIs. However, it is limited due to the requirement of the structure of the microbial protein involved in mimicry.
Similarity between motifs and domains of the host and pathogen proteins does not necessarily indicate their actual interaction. This is further dependent on the proteins having simultaneous expression and being present in the same cellular compartment. However, analysis of the PPIs in yeast and human showed that a large majority of the interactions occur between proteins in the same subcellular compartment (30, 31). Studies have also shown that functionally related or interacting proteins from the same pathways share Gene Ontology, and also usually constitute a higher co-expression score (32, 33). Also, imitation of host proteins by the pathogen essentially works by imitation and competing with endogenous (host–host) interactions(34, 35). We therefore hypothesize that resemblance between the experimentally validated host and pathogen interactors of the same host protein increases the confidence in the identification of molecular mimicry candidates due to colocalization and co-expression of the interacting protein pairs. This is shown schematically in Figure 1a for global structural similarity (domain linear pair or DLP) and Figure 1b for local sequence similarity (motif linear pair or MLP). Delineating the DLPs and MLPs also provides information about the host interactions that are likely to be disrupted by pathogen protein mimicry.
In this work, we collated the entire set of experimental HP-PPIs from interaction databases in order to compute their DLPs and MLPs, which were organized in the form of a publicly available database, ImitateDB available online at http://imitatedb.sblab-nsit.net. The ImitateDB resource can help researchers to search for organism-wise mimicry patterns prominent in the host pathogen interactome. It houses 2,06,449 DLPs and 38,45,643 MLPs. Out of the total 61,215 HP-PPIs collated, 1,549 and 49,266 were found to be characterized by imitated domains and motifs. Several novel potential domain mimics include SANT (Swi3, Ada2, N-Cor, and TFIIIB) DNA binding domain, Tudor and PhoX homology domain while some of the novel motif mimics identified are Microbodies C-terminal targeting signal, Ubiquitin-interacting motif and Lipocalin signature. Specific domains or motifs imitated commonly by a large number of pathogens are likely to be responsible for microbial virulence suitable for drug/vaccine targeting. Thus, ImitateDB constitutes a source of information for molecular imitation in HP-PPIs for researchers in the field of infectious diseases and microbiology.