2.5.2 Tokenization of the data:Character tokenization was performed to convert text (Table 2) into a
list of characters using keras pre-processing library
[29]. It builds a
corpus of all characters and assigns a number to each character. After
tokenization, 1584x15 dimension matrix was converted to a string of
integers (Table 3). (15 corresponds to the sequence length, that is of
window size 7). Finally, these integers were considered as X parameters,
rather than the alphabetical characters (column 2 of Table 2). The
Y-parameter was processed using Label Binarizer
[30] which
accepts categorical data as input and returns a NumPy array.
The training dataset was randomly split into train and test dataset in a
ratio of 2:1 using the Sklearn library function train_test_split.
Thus, 1584 data points produced 1061 entries for training, and 523 for
testing.