1 Introduction
The latest round of the Critical Assessment of Protein Structure
Prediction experiment CASP15, held in 2022, introduced a novel category
for protein-ligand interaction prediction (CASP-PLI), aiming to evaluate
cutting-edge methodologies on a blind target set of experimentally
resolved complexes. In contrast to typical ligand docking benchmark
experiments like Teach Discover Treat (TDT)1,
Continuous Evaluation of Ligand Prediction Performance (CELPP)2, Drug
Discovery Data Resource (D3R)3–6,
or Community Structure-Activity Resource (CSAR)7,8,
the prediction task in CASP consisted of predicting both the structure
of the receptor protein as well as the position and conformation of the
ligand, hereafter referred to as protein-ligand complex (PLC)
prediction. The evaluation results of this experiment are presented
elsewhere in this issue9, as well
as the technical details and challenges encountered during the
establishment of the new category as part of CASP10. These
challenges include, (1) PLC with incomplete ligands or suboptimal
quality to be used as ground truth ligand poses, (2) the need for
extensive manual verification of data input and prediction output, and
(3) the lack of suitable scoring metrics that consider both protein
structure and ligand pose prediction accuracy, which necessitated the
development of novel scores.
By integrating the insights and developments from the CASP-PLI
experiment, automated systems for the continuous benchmarking of
combined PLC prediction can be established. We discuss challenges and
insights associated with the development of two complementary approaches
for PLC benchmarking: a continuous evaluation of newly released PLC in
the Protein Data Bank PDB11, as
implemented in Continuous Automated Model EvaluatiOn (CAMEO,
https://beta.cameo3d.org/)12, and a
comprehensive evaluation of PLC prediction tools based on a diverse,
curated, and annotated benchmark dataset of PLC.
CAMEO is a benchmarking platform conducting fully automated blind
evaluations of three-dimensional protein prediction servers based on the
weekly prerelease of sequences of those structures, which are going to
be published in the upcoming release of the Protein Data Bank13–15.
Since 2012, the 3D structure prediction category has been assessing the
accuracy of single-chain predictions. Additional assessment categories
have been implemented over time to serve the structural bioinformatics
community, in particular around the assessment of quality estimates
(QE). Recently, efforts were made towards the assessment of
protein-protein complexes (quaternary structures) and protein-ligand
pose prediction12.
While CAMEO allows for continuous validation of newly developed methods,
it is dependent on the distribution of PLC released in the PDB in a
given period. Thus, CAMEO evaluation in a given time period may not be
representative of the entire PLC space and method developers may not
have immediate access to problem cases or specific sets of PLC where
their algorithm under or overperforms. This suggests a second,
complementary angle to automated benchmarking, namely the creation of a
diverse dataset of PLC with representative complexes from across
protein-ligand space, which would allow both global comparative scoring
as well as pinpointing cases that method developers would need to
address to improve their global performance. While many recent
deep-learning docking methods train and validate their approach on the
time-split PDBBind set16 of PLC
(where 363 protein-ligand pockets are used for benchmarking), we
demonstrate that this approach has shortcomings arising from the lack of
crystal structure quality verification and the lack of consistent
redundancy removal.
Previous research has shown that the quality of experimentally resolved
structures can vary significantly17.
Efforts have been made to establish criteria for assessing the quality
of such structures, like the Iridium criteria18.
Comparing prediction results to lower quality structures can skew the
perception of their performance, an especially important consideration
when assessing deep learning-based tools which have been trained to
reproduce results seen in experimentally resolved structures.
Additionally, many crystal structures with ligands contain missing atoms
or missing residues in the binding site, complicating their use as
ground truth.
Even in the era of deep learning, determining the difficulty of
predicting a PLC still relies, to some degree, on previously
experimentally resolved structures. This was exemplified in this year’s
CASP-PLI results9, where
template-based docking methods outperformed others due to the
availability of previously solved highly similar PLC for many of the
targets. Thus, incorporating the novelty of a PLC into automated
benchmarking setups is crucial for a fair and comprehensive evaluation.
For CAMEO, this consists of filtering out ”easy” targets based on
sequence and ligand information available in the PDB pre-release. For
the generation of a representative benchmark set, one can additionally
look at the novelty of the binding site and ligand pose on a structural
level.
Proteins are inherently flexible, exhibiting a range of conformations in
line with their functions. Not every observed conformation is compatible
with ligand binding, and this can significantly impact the accuracy of
docking predictions even when using high quality experimentally resolved
structures19,20.
These factors are further complicated by the use of computationally
predicted protein structures, as previous studies indicate that even
state-of-the-art methods for structure prediction are not always suited
for the task of ligand docking, due to inaccuracies in conformations and
side-chain positioning21.
Moreover, some ligands have highly flexible regions that mainly interact
with the solvent, where evaluating the conformation of the flexible part
may not be as meaningful as the parts of the ligand forming crucial
interaction with protein residues. Thus, it is necessary to develop and
employ evaluation metrics that extend beyond rigid ligand pose
assessments.