2.4 Test dataset generation:
Test dataset was curated from the mass spectroscopy data
[28], available
from the Supplementary Table 1 of that reference. The UniProt ID was
used to retrieve the sequence from the UniProt database and the
corresponding His residue number. All these His residues are
phosphorylated. This independent test dataset consists of 34
phosphorylated His.
2.5 Processing of the training dataset 2.5.1 Selection of input parameters for deep learning models:The training dataset was pre-processed by selecting a stretch of amino
acids from each protein sequence with His of interest at the centre and
that is flanked by amino acids with a variable window size, from three
to ten. The length of the amino acid sequence will be 2(n)+1 for window
size n. For example, amino acid sequence length will be seven for window
size three. Hence, all the training sequences will have equal number of
amino acids with His (of interest) at the centre, for a given window
size. This set of sequences (per window size) were used as the input
(X-parameter) for deep neural network models. The Y-parameters were the
post-translational modifications. A representative input file to the
deep neural network model is shown (Table 2). The relative performances
of variable window sizes were tested.