Calibration.
The agreement between observed outcomes and predictions made by the
model is referred to as calibration1. Model
calibration measures the validity of the predictions and determines
whether the predictions based on the risk prediction model align with
what is observed within the study cohort. For example, if we predict a
20% risk that a person will develop hypertension, the observed
frequency of hypertension should be 20 out of 100 people with such a
prediction. Calibration plot is a method that visually inspects
calibration and presents plot for predicted against observed
probabilities. It also uses the Hosmer-Lemeshow test to assess
calibration. In a calibration plot, predictions are plotted on the
x-axis and the observed outcome on the y-axis. In the y-axis, the plot
contains only 0 and 1 values for binary outcomes. Different smoothing
techniques (e.g., the loess algorithm) can be employed to estimate the
observed probabilities of the outcome for the predicted probabilities.
Perfect predictions should be on the 45° line suggesting that predicted
risks are correct. An alternative assessment of calibration is to
categorize predicted risk into groups (e.g., deciles) and assess whether
the event rate corresponds to the average predicted risk in each risk
group. The Hosmer-Lemeshow goodness-of-fit-test makes the plot of a
graphical illustration to assess whether the observed event rates match
expected event rates in subgroups of the model population.
For survival data, the calibration is usually assessed at fixed time
points2. Within each time point, survival rates are
calculated by the Kaplan-Meier method for a group of patients. Then this
observed survival is compared with the mean predicted survival from the
prediction model2.
Besides the above-mentioned major measures of model assessment, there
are other measures occasionally used to assess a model. Although
calibration and discrimination are considered the most important aspects
to assess a model, they did not provide any assessment regarding the
clinical usefulness of a model. Clinical usefulness assessment helps to
understand the ability of a model to make better decisions compared to a
situation when the model was not used. The measures associated with
clinical usefulness are generally related to a cutoff, a decision
threshold of the model, which classify peoples into low and high-risk
groups balancing the likelihood of benefit and likelihood of harm. Net
benefit (NB) is one such measure that can be used to assess the clinical
usefulness of a model.