Domain mimicry
Out of the 5,569 pathogens proteins from 630 pathogens, 607 proteins from 146 pathogens made DLPs with the host interactor proteins as indicated in the schematic Figure 1. Since there were multiple instances of every mimicked domain, we looked for unique domain types. There existed 3040 types of unique cdd domains shared by both pathogens and host. The largest number of DLPs were found for the Serine Threonine Protein Kinase US3 (UniProt ID: P04413) from Human Herpesvirus 1 Strain 17 (HHV-1) with 61,609 DLPs. The top 10 pathogens involved in molecular mimicry along with the number of DLPs are shown in Table 2. Two viral pathogens with the maximum number of DLPs were HHV-1and Rous sarcoma virus strain Schmidt-Ruppin A . In case of bacteria, Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) was found to have the largest number and widest diversity of host-like domains (Table 2). This opportunistic human bacterial pathogen has previously been reported to be highly involved in molecular mimicry of host proteins (24, 55).
The top 10 most frequently observed mimicked domains are shown in Figure 4a. PHA03247 (large tegument protein UL36) was the most frequent among DLPs. UL36 is an important domain family of tegument protein of Herpes Simplex Virus (HSV) that is crucial for virus host interaction and host immune evasion (56). UL36 is found to be colocalized with host and viral membrane proteins and aid in the assembly and cell entry of HSV(57). The top 10 most frequently occurring mimicked domains in different organism categories are shown in Table 3. A conserved domain family found to be potentially mimicked by viruses was DEAD-like helicases domain superfamily. The DEAD-box helicases bear a common D-E-A-D motif and is an emerging class of host proteins being mimicked by viruses for infections (58). Bacterial, viral and fungal conserved domains found in most frequently in DLPs were Rad50 ATPase and SbcC. Rad50 ATPase and SbcC are both involved in DNA repair pathways and are highly conserved among eukaryotes (humans and fungi), bacteria and viruses as well (59, 60).This way, the pathogens seem to have captured DNA repair proteins from their hosts to aid their own replication and survival by disrupting the host DNA repair pathways (61, 62).
Another important mimicked domain found in our data is Glycogen Synthase Kinase-3 (GSK-3) domain. Bacterial pathogen such as Helicobacter pylori has been found to divert the host signalling pathways such as WNT signalling by targeting the host GSK-3(61).
The predominantly occurring fungal pathogen found to mimic the largest number of host-like domains was found to be Saccharomyces cerevisiae S288C . In case of Others category, Dictyostelium discoideum is the predominant pathogen imitating the maximum number of domains. The pathogens with the highest number of DLPs and MLPs in different pathogen categories, i.e., virus, bacteria, fungi, and others are listed in Supplementary data Tables S1, S2, S3 and S4 respectively.