Table 3: Tokenized 1584*15 matrix (1584 entries x 15 amino acids)
2.6 Algorithms
Four classification models (namely, Logistic regression and Deep learning algorithms - Convolutional Neural Networks (CNNs), Artificial Neural Networks (ANNs), and Long Short-Term Memory (LSTM) networks) were used to train and benchmark the dataset. These models can adjust the weights of the input data to produce a value for each class, either positive or negative, and can learn complex non-linear relationships between features and the target variable. Logistic regression is a simple and efficient algorithm that can be used for binary classification problems. It works well when the relationship between the features and the target variable is approximately linear. Deep learning algorithms like ANNs, LSTMs, and CNNs are more powerful than logistic regression for complex classification problems. They are designed to automatically learn hierarchical representations of the data, which can capture complex non-linear relationships between the features and the target variable. In general, deep learning algorithms require more data and computing resources than logistic regression but can produce results with higher accuracy on complex classification tasks.
2.6.1 Logistic regression
It is commonly used for prediction and classification problems. In the current study the model consists of eight classes (eight His modifications). As logistic regression is a binary classification method, One-Versus-Rest (OVR) logistic regression method was used where the model is trained separately for each class to determine if an observation is part of that class or not (making OVR a binary classification problem). The method assumes that every classification issue (whether involving class 0 or something else) stands on its own.
2.6.2 ANN
Artificial neural network (ANN) is a single hidden layer neural network (consists of input layer, hidden layer and output layer) that attempts to categorize each observation as one out of many discrete classes. The input to the model could be either categorical or numeric and the dependent variable (Y-parameter) should be categorical.
2.6.3 LSTM
Long-Short Term Memory (LSTM) recurrent neural networks are specialized recurrent neural networks. Recurrent neural networks (RNN) run in cycles that receive the input of network activations to the current time step from the previous time step. These activations are stored (for an amount of time, not fixed a priori) in the internal states of the network, known as long-short-term memory. Thus LSTM-RNNs can exploit a dynamically changing contextual information to transform an input sequence to an output sequence. LSTM has multiple hidden layers, in contrast to a single hidden layer in ANN.
2.6.4 CNN
A convolutional neural network consists of an input layer, multiple hidden layers and an output layer. At least one or more hidden layers should perform convolutions between the convolution kernel and the layer of input matrix. (Convolution is a mathematical operation, that is a dot product of two functions.) The convolution operation generates a feature map, as the convolution kernel slides along the input matrix for the layer and prepare the input of the next layer. This layer is followed by pooling layers, fully connected layers, and normalization layers those enable spatial hierarchical feature learning by backpropagation, in an automated manner.