Results
Figure S1 contains a flowchart of the reference search, and Table S1 presents an overview of study characteristics for the eight articles that met our inclusion criteria.
Data synthesisThe measurement characteristics, i.e. validity, reliability and validation context, are summarised in the following for each type of assessment tool: 1) global rating scale, 2) global and procedure rating tools combined, 3) task-specific rating tools and 4) non-procedure-specific error assessment. Table S4 presents an overview of each assessment tool using Kane’s validity argument.
1) Global rating scalesObjective Structured Assessment of Technical Skills (OSATS)
Currently, the most widely used and validated assessment scale is OSATS,14 which originally consisted of a task-specific checklist and a global rating scale, the latter of which has been shown to have high reliability and validity and to be applicable at various trainee levels and for a variety of surgical procedures.15
Hiemstra et al.16 present the use of an objective assessment tool as a way to establish learning curves and analyse the OSATS scores of nine trainees over a three-month period. Nineteen types of procedures were identified among the 319 they assessed.
The surgical procedures consisted of abdominal hysterectomy (39%), labioplasty (31%), a vaginal approach (20%) and hysteroscopies (10%).
The trainees were instructed to fill out an OSATS assessment sheet after every procedure. A consultant would then perform supervision, discuss the result with the trainee and provide constructive feedback. Within the six OSATS domains, scores range from 6 to 30 points, and a score of 24 was the selected threshold for good surgical performance.
To prove construct validity, the authors hypothesise that surgical performance improves over time, with increasing procedure-specific experience.16 They found that performance improved 1.10 OSATS points per assessed procedure (p=0.008, 95% confidence interval (CI) 0.44–1.77) and that the learning curve for a specific procedure passed the threshold of 24 points at a caseload of five procedures. Furthermore, a performance plateau was reached after performing eight of the same procedures.