K-Fold Cross-Validation.
‘K-fold cross-validation’ (Figure 2) and ‘bootstrapping’ (Figure 3) are two popular methods that improve upon the split-sample method and produce better results in terms of bias and variability. K-fold cross-validation and bootstrapping are also better in situations where the sample size is small and when external validation is not readily available.
Cross-validation is a resampling procedure primarily used to evaluate the performance of prediction models on unseen data set, particularly, when the data set is small. The purpose is to see how the model performs in general when used to predict data that were not used to develop the model. K-fold cross-validation contains only one parameter ”k” that refers to the number of groups (folds) that a given data set is to be split into. If a specific value for ”k” is chosen, such as k = 10, then accordingly, the procedure is called 10-fold cross-validation.
In k-fold cross-validation, each observation in the data set is allotted to a specific subsample and remains in that subsample for the entire duration of the procedure. K-fold cross-validation starts with randomly partitioning the original sample into k roughly equal size subsamples. Then, only one subsample out of this k subsamples is kept as the validation data to test the model, and the remaining k-1 subsamples are utilized as training data to derive the model. A total of k times (the folds) this process is replicated, with each of the k subsamples used only once as the validation data. Finally, the results from the k-fold cross-validation run are summarized and a single estimate is produced by averaging (or otherwise combining) the k results from the folds.
Choosing an appropriate value for K is important to avoid misrepresentation of the performance of the model4. While choosing the value of k we need to be careful that each subsample (particularly, validation set) of data is large enough to reasonably represent the whole data set. More splits will reduce the size of the validation set and we will not have sufficient sample in the validation set to fairly and confidently evaluate models performance4. On the other hand, too few splits will not provide enough trained models to evaluate4. In addition, a higher k value is associated with less bias (the difference between the estimated and true values of performance) but more variability (performance of the model may change according to the data set used to fit the model) and computation. On the other hand, a lower k value is associated with more bias but less variability and computation. Though there is no formal rule, usually k is chosen between 5 or 104. Often K = 5 or 10 provides a good compromise for this bias-variance tradeoff4.
One disadvantage of k-fold cross-validation is its high variance, which makes it less attractive4. However, with a large training set with multiple repetitions of the whole k-fold validation-process (e.g., 50 times 10-fold cross-validation) provides true stable results that effectively increase the precision of the model estimates while still maintaining a small bias2. K-fold cross-validation has the big advantage that all observations are utilized for both derive and validate the model, with each observation is used only once for validation. As a result, this process has less chance to succumb to a biased division of the data.