Histidine; post-translational modifications; Artificial Neural Network
(ANN); Convoluted Neural Network (CNN); Long Short-term Memory (LSTM);
Logistic Regression; protein sequence; UNIPROT database, accuracy,
recall, precision
1. Introduction: Enzyme functions are primarily executed through the catalytic residues.
With the availability of the high-throughput sequence data, a large
number of protein sequences are known without functional
characterizations
[1].
Computational characterization would facilitate rapid initial screening
that can be verified further with experimental observations. Cysteine
(Cys) and Histidine (His) are the two most important amino acid residues
observed at the catalytic sites of all enzyme
classes[2]
[3]. The thiol
group of cysteine amino acid side chain can undergo oxidation leading to
various chemical and post-translational modifications that impact the
structure and function of proteins in different capacities. A histidine
imidazole is an electron-deficient heteroaromatic ring (pKa = 6.8) that
makes it a suitable candidate for proton buffering, metal ion chelation,
and antioxidant agents. Due to the similar values of the imidazole ring
pKa and the physiological pH (= 7.4), His efficiently participates in
enzyme catalysis. His residue is particularly important in acid base
catalysis due to its amphoteric character. Apart from that, it
participates in elimination-addition and redox reactions. Experimental
characterizations are done for various His post-translational
modifications those are involved in protein-protein interactions and
catalysis.
Extensive computational characterization of the cysteine functions has
been done by our group
[4], [5],
[6]. However, the
post-translational modification of His is less explored compared to that
of Cys or Lys.
The computational characterizations of His functions, so far, were
reported for single modifications only. For example, histidine
phosphorylation sites were predicted using a convoluted neural network
(CNN) - based model, PROSPECT
[7], and support
vector machine-based model, pHisPred
[8]. Transition
metal-binding sites for Cys and His were predicted by exploiting
position-specific evolutionary profiles using support vector machines
and neural networks
[9]. The
CNN-based prediction model, PROSPECT, inputs a protein sequence and
returns predicted histidine sites with 72% accuracy. The
transition-metal-binding sites of histidine and cysteine were predicted
from protein sequences with 73% precision.
To the best of our knowledge, prediction of multiple His
post-translational modifications is not reported. For the first time, we
attempt to predict eight post-translational modifications of His from,
i) protein sequence and ii) His residue position only, using deep neural
networks. The convolution neural network (CNN) performed the best. The
output of the models yields the most probable His modification. The
internal evaluation accuracies are comparable to the single prediction
methods, albeit, our results showed better performances than the
existing ones. The model was blindly tested for external evaluation on
independent phosphorylated Histidine data points.