Dataset Description
The dataset has records of 366 patients. 12 clinical features and 22 histopathological features are used as features in the training of models. The family history attribute is a categorical attribute that has a value of 1 if any one of these diseases was present in the predecessor, else it has a value of 0. The age attribute is a numerical attribute which simply represents the age of each patient. Rest of the features are categorical and range from values of 0 to 3.
Here: 0 - Symptom not present in the patient
1 - Symptom present in small amounts
2 - Symptom present in moderate amounts
3- Symptom present in large amounts
Modeling
Multinomial regression is first applied to the dataset as an initial benchmark. Other machine learning techniques like decision tree models (rpart and C5.0), random forest and gradient boosting. Most of the models gave a good enough result as an initial benchmark. But there is a need for hyper tuning because of the fact that there is a need for definite and accurate prediction.
Plan of Action
Hyper tuning of models is needed for a better system to be generated. Evaluation of data for newer insights needs to be done for a better understanding of the need of classification.