K-Fold Cross-Validation.
‘K-fold cross-validation’ (Figure 2) and ‘bootstrapping’ (Figure 3) are
two popular methods that improve upon the split-sample method and
produce better results in terms of bias and variability. K-fold
cross-validation and bootstrapping are also better in situations where
the sample size is small and when external validation is not readily
available.
Cross-validation is a resampling procedure primarily used to evaluate
the performance of prediction models on unseen data set, particularly,
when the data set is small. The purpose is to see how the model performs
in general when used to predict data that were not used to develop the
model. K-fold cross-validation contains only one parameter ”k” that
refers to the number of groups (folds) that a given data set is to be
split into. If a specific value for ”k” is chosen, such as k = 10, then
accordingly, the procedure is called 10-fold cross-validation.
In k-fold cross-validation, each observation in the data set is allotted
to a specific subsample and remains in that subsample for the entire
duration of the procedure. K-fold cross-validation starts with randomly
partitioning the original sample into k roughly equal size subsamples.
Then, only one subsample out of this k subsamples is kept as the
validation data to test the model, and the remaining k-1 subsamples are
utilized as training data to derive the model. A total of k times (the
folds) this process is replicated, with each of the k subsamples used
only once as the validation data. Finally, the results from the k-fold
cross-validation run are summarized and a single estimate is produced by
averaging (or otherwise combining) the k results from the folds.
Choosing an appropriate value for K is important to avoid
misrepresentation of the performance of the model4.
While choosing the value of k we need to be careful that each subsample
(particularly, validation set) of data is large enough to reasonably
represent the whole data set. More splits will reduce the size of the
validation set and we will not have sufficient sample in the validation
set to fairly and confidently evaluate models
performance4. On the other hand, too few splits will
not provide enough trained models to evaluate4. In
addition, a higher k value is associated with less bias (the difference
between the estimated and true values of performance) but more
variability (performance of the model may change according to the data
set used to fit the model) and computation. On the other hand, a lower k
value is associated with more bias but less variability and computation.
Though there is no formal rule, usually k is chosen between 5 or
104. Often K = 5 or 10 provides a good compromise for
this bias-variance tradeoff4.
One disadvantage of k-fold cross-validation is its high variance, which
makes it less attractive4. However, with a large
training set with multiple repetitions of the whole k-fold
validation-process (e.g., 50 times 10-fold cross-validation) provides
true stable results that effectively increase the precision of the model
estimates while still maintaining a small bias2.
K-fold cross-validation has the big advantage that all observations are
utilized for both derive and validate the model, with each observation
is used only once for validation. As a result, this process has less
chance to succumb to a biased division of the data.