Figure 1. The machine learning and evaluation scheme of rwTTD prediction. a. Calculation of future time in a censored population. b. Simulation of rwTTD data capturing a variety of factors potentially affecting performance. c-e. Three evaluation schemes used in the study: absolute error, cumulative error and absolute number of error days when 50% of the population is terminated.
We developed three metrics to evaluate the model performance (Fig. 1c-e ). For the first metric, “absolute error ”, we calculated the accumulated values of the predicted curve and the gold standard curve from day 0 to a specific date (1000 days, if not otherwise specified in this paper), and then divided the total difference by the total number of days. Thus, if the predicted curve is higher than the gold standard curve in the first half, but lower in the later half, the errors could be canceled out by using this metric. For the second metric, “cumulative error ”, we accumulated the absolute error at each day from day 0 to a specific date, and then divided the total error by the total number of days. Then, no matter positive error or negative error, the absolute errors will aggregate. For the third metric, “Absolute date error at 50% terminated”, we calculated when 50% of the patients are terminated (reaching 0.5 on y-axis on the termination curve), what is the absolute difference in days between the gold standard curve and the predicted curve. The three metrics capture the important aspects in drug administration.
Of note, models can only generate predictions for each individual’s expected future time in the test set when trained with a machine learning classifier. When we aggregate the predictions, the resulting curve is closely centered at the average expected future time and substantially deviates from the true distribution (Fig. 2a-c ). This is due to the innate properties of most machine learning algorithms. When minimizing the squared errors or another similar loss function, the prediction values tend to center around the mean.
To combat such an effect, we further divided the training set into the train set, from which the model parameters are derived, and the validation set, from which the distribution of the prediction value is obtained. The prediction value from the validation set and corresponding future time are used as a reference to interpolate the prediction results of the test set. In this study, we used first order interpolation and extrapolation if the test set prediction values go beyond the range of the validation set. By interpolation, we generated a distribution resembling the observed future time distribution of the test set. To further illustrate the functions of the three metrics we used in this study, we showed the illustrations of the percentage of errors using either the absolute error or the cumulative errors using ExtraTreeRegressor by different numbers of maximal dates considered and the absolute error date when 50% of the population is terminated (Fig. 2d-e ).