Table 4: Performance of the CNN model on the training dataset with variable window size
3.2 Selection of the optimal neural network model on the training dataset:
The training dataset was benchmarked against four different classifiers, namely, Logistic regression, ANN, LSTM and CNN. Some of these classifiers are simple and computational less expensive and two others (LSTM and CNN) are complex.
3.2.1 Logistic regression
The overall performance of logistic regression is shown (Table S1). The precision, recall and F1-score vary for different modifications. Logistic regression was unable to predict methylation and phosphorylation based on the current validation dataset. The possible reasons could be that i) the target label has no linear correlation with the features and/or ii) the sample sizes (in the validation dataset) are uneven with respect to different classes (Table S1). The accuracy from this classifier on the validation dataset was 0.67.
3.2.2 ANN
The accuracy achieved using ANN model was 70%, slightly better than that of the logistic regression. This is presumably due to the presence of three layers in ANN (an input layer, a hidden layer and an output layer) in contrast to the logistic regression. Moreover, performance of logistic regression reduces when trained on noisy data or the samples are unevenly distributed between classes. Variation was observed in the prediction results for different modifications (Table S2). ANN model was unable to predict oxidation modification from the current dataset, although, it has successfully predicted methylation and phosphorylation modifications, unlike logistic regression. To note, the modifications in validation dataset varies across the classifier, as the train to test (validation) dataset was randomly split into 2:1 ratio and each random split contains different ratio of His modifications. For example, in logistic regression, support (the number of validation data point) value for phosphorylation was only 8 in contrast to 175 in ANN model. This could presumably justify why logistic regression was unable to predict phosphorylation whereas ANN has accomplished it successfully.
3.2.3 LSTM
The accuracy obtained from LSTM was 71%, better than those obtained from logistic regression and ANN models. The results produced by LSTM were better, most likely due to feed data back while training. LSTM works the best on a known set of patterns or sequences. As mentioned above, His hydroxylation and methylation were observed with characteristic sequence motifs [23] [24]. Moreover, protein splicing involves multiple conserved His at the enzyme active sites [22]. The conserved patterns for these three His modifications lead to improved recall value (that is, high rate of true positive prediction with respect to false negative values) (Table S3). Despite of improved performance of LSTM, the classifier was unable to compute precision, recall and F1-score for ribosylation and oxidation modifications.
3.2.4 CNN
The overall accuracy obtained from CNN model was the 75.47%, best out of all the classifiers. The notable observation was that the CNN model was capable of predicting all the modifications, unlike other classifiers (Table 5). The variation in predicting different modifications also exists in CNN model as in other classifiers. The superior performance of the CNN model is most likely due to the application of convolutional layers, those automatically lowers the dimensionality of sequences, yet preserving the information. The logistic regression works the best with a predefined relation between input and output, that was not so explicit in the training dataset. As ANN is a simple neural network model with only one hidden layer, learning was less accurate. LSTM works the best with pattern recognition thus the model was capable of better prediction of hydroxylation, methylation and protein splicing with known patterns. However, the performance of acetylation prediction was consistently poor across all the classifiers, although sequence-specificity of His acetylation was reported in literature [10][25]. Comparing the overall performance from the above benchmarking exercise (Table S4), CNN model was selected for further His modification prediction for an unknown protein. This model, termed as Hist-i-fy, was tested on an independent dataset of His phosphorylation, obtained from mass spectroscopy.