Bootstrap Validation.
The bootstrap method is a resampling technique often used to estimate statistics on a population as well as validate a model by sampling a dataset with replacement. The bootstrap method allows us to use a computer to mimic the process of obtaining new data sets so that the variability of the estimates can be assessed without creating additional samples. Instead of repeatedly obtaining independent data set from the population, which is often not realistic, in bootstrapping, distinct data sets are obtained by repeatedly doing sampling from the original data set with replacement. The idea behind bootstrapping is the original observed data will take the place of the population of interest, and each bootstrap sample will represent a sample from that population.
Bootstrap samples are of the same size as the original sample and drawn randomly with replacement from the original sample. In a with replacement sampling, after a data point (observation) is selected for the subsample, it is still available for further selection. As a result, some observations represented multiple times in the bootstrap sample while others may not be selected at all. Because of such overlaps with original data, on average almost two-thirds of the original data points appear in each bootstrap sample4. The samples that are not included in a bootstrap sample are called “out-of-bag” samples. When performing the bootstrap, two things must be specified: the size of the sample and the number of repetitions of the procedure to perform. A common practice is to use a sample size that is equivalent to the original data set and a large number of repetitions (50-200) to get a stable performance2, 4.
In the bootstrap method, a prediction model is developed in each bootstrap sample and measures of predictive ability such as C-statistic are estimated in each bootstrap sample. Then these models from bootstrap data are applied to the original dataset to evaluate the model and estimate the predictive measure (C-statistic) of these bootstrap models in the original data. The difference in performance in the predictive measure indicates optimism, which is estimated by averaging out all the differences in predictive measures. Finally, this estimate of optimism is subtracted from the performance of the original prediction model developed in the original data to get an optimism-adjusted measure of the predictive ability of the model2.
Bootstrap samples have significant overlap with the original data (roughly two-third) which causes the method to underestimate the true estimate. This is considered a disadvantage of this method. However, this issue can be solved by performing prediction on only those observations that were not selected by the bootstrap and estimating model performance. Bootstrapping is more complex to analyze and interpret due to the methods used and the amount of computation required. However, this method provides stable results (less variance) than other methods with a large number of repetitions.
It is very obvious that each of the internal model validation techniques has advantages and disadvantages and no one method is uniformly better than another4. Researchers have a different opinion on choosing the appropriate method for internal model validation. Several factors such as sample size, finding the best indicators of a model’s performance and choosing between models were asked to consider before making the choice4.
The above-mentioned procedures for model validation pertain to internal validation, which does not examine the generalizability of the model. To ensure generalizability, it is necessary to use new data not used in the development process, collected from an appropriate (representative) patient population but using a different set of data.