Table 4: Performance of the CNN model on the training dataset with
variable window size
3.2 Selection of the optimal neural network model on the
training dataset:
The training dataset was benchmarked against four different classifiers,
namely, Logistic regression, ANN, LSTM and CNN. Some of these
classifiers are simple and computational less expensive and two others
(LSTM and CNN) are complex.
3.2.1 Logistic regression
The overall performance of logistic regression is shown (Table S1). The
precision, recall and F1-score vary for different modifications.
Logistic regression was unable to predict methylation and
phosphorylation based on the current validation dataset. The possible
reasons could be that i) the target label has no linear correlation with
the features and/or ii) the sample sizes (in the validation dataset) are
uneven with respect to different classes (Table S1). The accuracy from
this classifier on the validation dataset was 0.67.
3.2.2 ANN
The accuracy achieved using ANN model was 70%, slightly better than
that of the logistic regression. This is presumably due to the presence
of three layers in ANN (an input layer, a hidden layer and an output
layer) in contrast to the logistic regression. Moreover, performance of
logistic regression reduces when trained on noisy data or the samples
are unevenly distributed between classes. Variation was observed in the
prediction results for different modifications (Table S2). ANN model was
unable to predict oxidation modification from the current dataset,
although, it has successfully predicted methylation and phosphorylation
modifications, unlike logistic regression. To note, the modifications in
validation dataset varies across the classifier, as the train to test
(validation) dataset was randomly split into 2:1 ratio and each random
split contains different ratio of His modifications. For example, in
logistic regression, support (the number of validation data point) value
for phosphorylation was only 8 in contrast to 175 in ANN model. This
could presumably justify why logistic regression was unable to predict
phosphorylation whereas ANN has accomplished it successfully.
3.2.3 LSTM
The accuracy obtained from LSTM was 71%, better than those obtained
from logistic regression and ANN models. The results produced by LSTM
were better, most likely due to feed data back while training. LSTM
works the best on a known set of patterns or sequences. As mentioned
above, His hydroxylation and methylation were observed with
characteristic sequence motifs [23] [24]. Moreover, protein
splicing involves multiple conserved His at the enzyme active sites
[22]. The conserved patterns for these three His modifications lead
to improved recall value (that is, high rate of true positive prediction
with respect to false negative values) (Table S3). Despite of improved
performance of LSTM, the classifier was unable to compute precision,
recall and F1-score for ribosylation and oxidation modifications.
3.2.4 CNN
The overall accuracy obtained from CNN model was the 75.47%, best out
of all the classifiers. The notable observation was that the CNN model
was capable of predicting all the modifications, unlike other
classifiers (Table 5). The variation in predicting different
modifications also exists in CNN model as in other classifiers. The
superior performance of the CNN model is most likely due to the
application of convolutional layers, those automatically lowers the
dimensionality of sequences, yet preserving the information. The
logistic regression works the best with a predefined relation between
input and output, that was not so explicit in the training dataset. As
ANN is a simple neural network model with only one hidden layer,
learning was less accurate. LSTM works the best with pattern recognition
thus the model was capable of better prediction of hydroxylation,
methylation and protein splicing with known patterns. However, the
performance of acetylation prediction was consistently poor across all
the classifiers, although sequence-specificity of His acetylation was
reported in literature
[10][25].
Comparing the overall performance from the above benchmarking exercise
(Table S4), CNN model was selected for further His modification
prediction for an unknown protein. This model, termed as Hist-i-fy, was
tested on an independent dataset of His phosphorylation, obtained from
mass spectroscopy.