Results
Figure S1 contains a flowchart of the reference search, and Table S1
presents an overview of study characteristics for the eight articles
that met our inclusion criteria.
Data synthesisThe measurement
characteristics, i.e. validity, reliability and validation context, are
summarised in the following for each type of assessment tool: 1) global
rating scale, 2) global and procedure rating tools combined, 3)
task-specific rating tools and 4) non-procedure-specific error
assessment. Table S4 presents an overview of each assessment tool using
Kane’s validity argument.
1) Global rating scalesObjective
Structured Assessment of Technical Skills (OSATS)
Currently, the most widely used and validated assessment scale is OSATS,14 which originally consisted of a task-specific
checklist and a global rating scale, the latter of which has been shown
to have high reliability and validity and to be applicable at various
trainee levels and for a variety of surgical
procedures.15
Hiemstra et al.16 present the use of an objective
assessment tool as a way to establish learning curves and analyse the
OSATS scores of nine trainees over a three-month period. Nineteen types
of procedures were identified among the 319 they assessed.
The surgical procedures consisted of abdominal hysterectomy (39%),
labioplasty (31%), a vaginal approach (20%) and hysteroscopies (10%).
The trainees were instructed to fill out an OSATS assessment sheet after
every procedure. A consultant would then perform supervision, discuss
the result with the trainee and provide constructive feedback. Within
the six OSATS domains, scores range from 6 to 30 points, and a score of
24 was the selected threshold for good surgical performance.
To prove construct validity, the authors hypothesise that surgical
performance improves over time, with increasing procedure-specific
experience.16 They found that performance improved
1.10 OSATS points per assessed procedure (p=0.008, 95% confidence
interval (CI) 0.44–1.77) and that the learning curve for a specific
procedure passed the threshold of 24 points at a caseload of five
procedures. Furthermore, a performance plateau was reached after
performing eight of the same procedures.