2.6 Model training and validation
After feature selection and visualisation (including potential reclassification of behaviour types), the user can train a supervised machine learning model (XGBoost in this package) with the selected, most relevant features through function train_model. Usually, the construction and evaluation of supervised machine learning models includes three steps: (i) machine learning model hyperparameter tuning by cross-validation, (ii) model training with the optimal hyperparameter set, and (iii) evaluating model performance through validation with a test dataset. Function train_model is a wrapper function that utilizes relevant functions from the “caret” package to automatically conduct the three above steps for model construction and evaluation.
Four arguments can be set in the function train_model to control the training and validation processes. Which features to use for model building is set by ”df”, which in the following example is set to “selection$features[1:6]” (i.e. the first six selected features from the feature selection procedure). The “vec_label” argument is used to pass on a vector of behaviour types. How to select the hyperparameter set is set by “hyper_choice”, which has two options: ”defaults” will let XGBoost use its default hyperparameters (nrounds = 10) while ”tune” will run repeated cross-validations to find a best set (note that the settings for the hyperparameters inside this function are based on our previous experience with a range of different ACC datasets (Hui et al., in prep) and are set at: nrounds = c(5, 10, 50, 100), max_depth = c(2, 3, 4, 5, 6), eta = c(0.01, 0.1, 0.2, 0.3), gamma = c(0, 0.1, 0.5), colsample_bytree = 1, min_child_weight = 1, subsample = 1). Finally, “train_ratio” determines the percentage of data used to train the model, the remainder of the data being used for model validation.
The ultimate output consists out of four parts. The first is a confusion matrix, depicting how well the ultimate behaviour classification model predicts the different behaviours based on the validation part of the dataset only (i.e 25% of the dataset in our stork example using a train_ratio of 0.75). On the diagonal of this table, where the observed behaviour is organised in columns and the predicted behaviour is organised in rows, the correct predictions are depicted, with all the wrong predictions being off the diagonal. The overall performance statistics are presented next, the meaning of which is explained in detail in <https://topepo.github.io/caret/measuring-performance.html>. The third part of the output, statistics by class, presents a range of performance statistics for the individual behavioural categories, which are explained in detail in <https://topepo.github.io/caret/measuring-performance.html>. Finally, the importance of the various features in producing the behaviour classification model is being presented.
Another way of calculating and visualising the performance of the behavioural classification model makes use of cross-validation using function plot_confusion_matrix. In this case the entire dataset is randomly partitioned into five parts. In five consecutive steps, each of the five parts is used as a validation set, while the remaining four parts are used for model training. This procedure thus resembles a five-fold “classification model training and validation” with a train_ratio of 0.8, be it that in this case the dataset is systematically divided and each point in the dataset is being used for the validation process at some point (see function createFolds in ”caret” for more details). Thus, after all five training and validation rounds, all behavioural observations will also have an associated predicted behaviour, which are being stored in the data frame that is being returned by plot_confusion_matrix in addition to a plot of the confusion table (Fig. 7).