2.2. Data Preprocessing
There were no missing values in the available dataset regarding COVID-19
severity and other attributes of the patients. However, there was a
problem of class imbalance between categories of COVID-19 severity. The
class imbalance problem affects data when class distributions are
strongly imbalanced. Many classification learning algorithms in this
context have poor predictive accuracy for the rare class. The proportion
of cases in a data set that belongs to each class plays an important
role in machine learning. The real-world data, however, often suffer
from class imbalances. It is harder to deal with multi-class tasks that
require different class misclassification costs than with two-class
tasks (8). The Sample (Balance) operator in the RapidMiner software was
used to solve the class imbalance problem that emerged in this study. An
automated balancing of an example set with labels is achieved by the
Sample (Balance) operator, which allows up and downsampling (9). For
feature/attribute selection, the most important attributes of the given
data set were selected by the Optimize Selection (Evolutionary operator
in the RapidMiner software, which uses a Genetic Algorithm. A genetic
algorithm (GA) is a search heuristic that imitates the natural evolution
process. This heuristic is regularly used to construct helpful solutions
to problems of optimization. Genetic algorithms belong to the broader
class of evolutionary algorithms (EA), which use techniques inspired by
natural evolution, such as inheritance, mutation, selection, and
crossover, to produce solutions to optimization problems (10).