HP-PPI Data collection and cleaning
The information regarding the host HP-PPIs for the database was collected from different HP-PPIs databases namely BioGrid(36), PHISTO(37), HPIDb(38), MINT(39), IntAct(40), MPIDB(41), UniProt(42), VirHostNet(43), MatrixDB(44), I2D(45), DIP(46) and InnateDB(47). The data obtained from these sources included information about i) UniProt accession numbers, ii) Gene symbols, iii) UniProt entry names, iv) Gene symbols for the interacting proteins of pathogen and human host, v) Corresponding pathogen names for all pathogen proteins, vi) Pathogen taxon IDs, and vii) Experimental method of interaction detection for each unique interaction.
The UniProt accession number was used as a unique identifier for the proteins extracted from different sources to maintain uniformity in the data. The pathogen names from different databases were also examined for variations in syntax/nomenclature and were converted into a uniform name using UniProt Taxon identifier. The duplicate entries were removed from the data to avoid redundancy and the obsolete entries were either removed or converted into secondary uniport accession if available.