KEYWORDS:
Histidine; post-translational modifications; Artificial Neural Network (ANN); Convoluted Neural Network (CNN); Long Short-term Memory (LSTM); Logistic Regression; protein sequence; UNIPROT database, accuracy, recall, precision
1. Introduction: Enzyme functions are primarily executed through the catalytic residues. With the availability of the high-throughput sequence data, a large number of protein sequences are known without functional characterizations [1]. Computational characterization would facilitate rapid initial screening that can be verified further with experimental observations. Cysteine (Cys) and Histidine (His) are the two most important amino acid residues observed at the catalytic sites of all enzyme classes[2] [3]. The thiol group of cysteine amino acid side chain can undergo oxidation leading to various chemical and post-translational modifications that impact the structure and function of proteins in different capacities. A histidine imidazole is an electron-deficient heteroaromatic ring (pKa = 6.8) that makes it a suitable candidate for proton buffering, metal ion chelation, and antioxidant agents. Due to the similar values of the imidazole ring pKa and the physiological pH (= 7.4), His efficiently participates in enzyme catalysis. His residue is particularly important in acid base catalysis due to its amphoteric character. Apart from that, it participates in elimination-addition and redox reactions. Experimental characterizations are done for various His post-translational modifications those are involved in protein-protein interactions and catalysis. Extensive computational characterization of the cysteine functions has been done by our group [4], [5], [6]. However, the post-translational modification of His is less explored compared to that of Cys or Lys. The computational characterizations of His functions, so far, were reported for single modifications only. For example, histidine phosphorylation sites were predicted using a convoluted neural network (CNN) - based model, PROSPECT [7], and support vector machine-based model, pHisPred [8]. Transition metal-binding sites for Cys and His were predicted by exploiting position-specific evolutionary profiles using support vector machines and neural networks [9]. The CNN-based prediction model, PROSPECT, inputs a protein sequence and returns predicted histidine sites with 72% accuracy. The transition-metal-binding sites of histidine and cysteine were predicted from protein sequences with 73% precision. To the best of our knowledge, prediction of multiple His post-translational modifications is not reported. For the first time, we attempt to predict eight post-translational modifications of His from, i) protein sequence and ii) His residue position only, using deep neural networks. The convolution neural network (CNN) performed the best. The output of the models yields the most probable His modification. The internal evaluation accuracies are comparable to the single prediction methods, albeit, our results showed better performances than the existing ones. The model was blindly tested for external evaluation on independent phosphorylated Histidine data points.