DISCUSSION
Compared to composite score modelling20, the IRT
approach differentiates the items by their sensitivity level and has
shown the potential to reduce trial sample size for detecting drug
effects9,22. The sample-size saving is an attractive
proposition, especially as the field advances towards increasingly
personalized medicine, where a certain therapy is expected to be
effective only in a small population.
Multi-variable IRT models with item-level interaction across domains
have been published; but they were not readily adaptable for application
to analysis of Part III alone12,22. In this work, we
used only items in Part III, aiming to support early development of PD
drugs where a Go/No-Go decision hinges on their effect on (the more
objective) motor examinations. There is also a differentiating
methodological feature of our analysis: the analyses reported by others
used the IRT model to simulate the total scores; applied hypothetical
drug effects to both the severity endpoint and the simulated total
scores; and compared the two endpoints – severity and total score - for
the sample size requirement to detect the drug effect. This approach
could potentially bias against the total score endpoint, in the event
the simulation inflated the noise in total score. In contrast, we
applied the drug effect directly to the SoS, as to the severity. In
doing so, the two endpoints were treated more fairly.
To compare the sample size requirement between the IRT and the
conventional SoS methods, we applied a range of relevant potential
reduction in progression rate that a new agent could cause. The normally
distributed effects centered at 0.3 and had the 5th –
95th range of 0.1 to 0.5 which has been considered as
clinically meaningful effect range for neurodegenerative indications
such as Parkinson’s disease and Alzheimer’s
disease9,22. While the center of the range represented
an effect that’s highly relevant and reasonably plausible, the lower and
higher tails were respectively less relevant and plausible. As such, the
effect levels further away from the center carried less weight in the
computation of the overall PoS, which is then effectively the collective
power weighted by the distribution of the effect level. We consider this
as a useful approach to account for the uncertainty in the eventual
effect size that a new agent could produce. Figure 4 lower panel
illustrates the (expected) difference between the PoS under this effect
distribution and the power under the more extreme effect sizes. For the
same sample size, the power for detecting a large treatment effect would
be higher than the PoS for detecting a range of potential effects. Under
this condition, we found that the IRT method could lead to a tremendous
saving of about 50% in sample size compared to the conventional SoS
method. This magnitude of sample size savings is consistent with our
recent analysis of a placebo-controlled clinical trial of ropinirole –
an established dopaminergic agent.33
The tremor tests showed poor discrimination power; they each and
collectively held very little information (Table 2). For most of the
tremor items, the probability of score 0 (normal) was disproportionally
high, regardless of a patient’s severity as defined by the overall
instrument (Figure 2, lower left and right). Consistent with these
observations, the clinical trial PoS was not affected by whether the
tremor items were included in the analyses or not (Figure 4 upper).
Interestingly, a Rasch measurement theory analysis revealed disordered
threshold for several tremor-related items.34 These
observations supported the view that the tremor tests might measure a
different construct, hence perhaps should be assessed using a separate
and more sensitive scale.22,31,35
Interestingly, all seven left-side non-tremor items were among the most
informative ones (Table 2). Compared to their right-side counter items,
they showed higher discriminatory power (aj ), and
generally lower values and narrower ranges of difficulty parameters
(bj1 to bj4 ). This was
also reflected by the left side’s better differentiated ICCs (Figure 2
lower left) and slightly higher proportion of higher scores (Figure 2,
lower right). Similarly, Gottipati et al. identified
“left
hand finger tapping” as the most informative among the sided
items12. In a previously-reported analysis, we
explored the PoS for four different approaches: by IRT and SoS, using
all items or only the seven left-side items. For the same sample size,
the order of estimated trial PoS was: IRT on all items >
IRT on seven items > SoS on seven items > SoS
on all items.34 This order illustrated IRT’s ability
to enhance signal-noise ratio by item differentiation; indeed, its
advantage over SoS was reduced when only the most informative items were
included in the analysis. These findings were consistent with earlier
analysis of combined Part II and Part III data by Buatois et
al.22
A recent cross-section analysis also found the discrimination parameters
to be higher and difficulty parameters to be lower for the left-side
items then for the right-side items.35 Similar
findings were reported from an item-response analysis of multiple latent
variables, although that analysis also reported a majority (58%) of the
patients having more advanced baseline disability on the right side of
the body.12 The lower difficulty parameters, or worse
test performance, for the left side items may be a reflection of most
people being right handed, despite neuroimaging and meta-analyses
suggesting the dominant side might be affected
earlier25,26,27. Change of hand preference while the
disease progresses has also been reported.36 This is
an area to be investigated further, in different datasets and at
different stages of the symptom progression. Another possible reason for
the consistent worse performance by the left side was this side being
always examined later per UPDRS form. Conceivably, this hypothesis may
be tested by randomizing the order of the sided tests.
We introduced an inter-occasion (visit) variability in the longitudinal
model to reflect the commonly recognized disease fluctuation; this
improved the estimation of the progression rate. The model suggested
that patients with lower baseline severity had faster progression,
support the report that the progression, when measured by MDS-UPDRS Part
III, was slower at the more advanced stage21. The
effects of other factors such as genotype, comorbidity, age, disease
history and diagnostic biomarkers on disease progression remain to be
assessed.23,24,30
That IRT analysis of MDS-UPDRS Pat III required a smaller sample size is
relevant to composite scales used in other indications. Because of the
less informative items, composite scores could compromise
signal-to-noise ratio. Some instruments are also long, hence physically
and mentally exhausting for debilitated patients, and leading to
incomplete or poor data. Therefore, a bespoke and shorter instrument is
often desired. The development, validation and user training are costly
and time consuming; and a new instrument suffers the risk of missing out
relevant information when used for assessing a new drug of unestablished
profile and lack of comparability with existing data. The IRT approach
can enhance signal detection power and reduce sample size through
directly accessing and weighting of item-level data of a
well-established instrument that’s accepted by regulators. When item
scores are used directly, incomplete data are still useful. By
extension, it may be possible to reduce patient burden by asking each
patient to take only a stratified partial test. Other potential
applications of this approach include bridging between different
versions of an evolving instrument for meta-analysis or cross-study
comparison,28 and translating clinical trial results
to patient outcome expectations. These areas require extensive further
research and experience building by the clinical research community.