2.4 Test dataset generation:
Test dataset was curated from the mass spectroscopy data [28], available from the Supplementary Table 1 of that reference. The UniProt ID was used to retrieve the sequence from the UniProt database and the corresponding His residue number. All these His residues are phosphorylated. This independent test dataset consists of 34 phosphorylated His.
2.5 Processing of the training dataset 2.5.1 Selection of input parameters for deep learning models:The training dataset was pre-processed by selecting a stretch of amino acids from each protein sequence with His of interest at the centre and that is flanked by amino acids with a variable window size, from three to ten. The length of the amino acid sequence will be 2(n)+1 for window size n. For example, amino acid sequence length will be seven for window size three. Hence, all the training sequences will have equal number of amino acids with His (of interest) at the centre, for a given window size. This set of sequences (per window size) were used as the input (X-parameter) for deep neural network models. The Y-parameters were the post-translational modifications. A representative input file to the deep neural network model is shown (Table 2). The relative performances of variable window sizes were tested.