2.2 | Classifying evaluation units into evolutionary
prediction classes
Historically, the outcome of a protein structure modeling exercise was
largely predetermined by the evolutionary relationship between the
target and experimentally determined structures. Proteins with apparent
homology to available structures were typically easier to model, while
non-homology targets were at the harder side of the prediction
difficulty spectrum. Since targets of different difficulty required
different modeling approaches, yielded different degrees of model
accuracy, and thus required different evaluation approaches, CASP had
previously assessed modeling results separately for different target
difficulties. The names of the difficulty categories changed with time,
but the major factor defining the difficulty remained the same:
availability of structural templates. The classical difficulty schema
was shaken in CASP14, where the DeepMind group showed that highly
accurate models can be built with AlphaFold 2 (AF2) for practically all
targets, independently of the template availability. This suggested that
the classical division into largely homology-based difficulty categories
may not be needed any more. Acting upon these developments, CASP
organizers recommended assessment of tertiary structure prediction in
CASP15 in one batch. This analysis is presented elsewhere in this issue21. Nevertheless,
similarly to splitting targets into EUs (above), the assignment of EUs
to evolutionary prediction classes is still needed for comparing CASP15
results with the earlier ones.
In previous CASPs, EUs were classified into difficulty categories based
on the availability of similar structures in the PDB, as detected by
sequence- and structure-based searches (reflecting estimated difficulty)
and predictors’ performance (reflecting actual difficulty)9,10.
Since performance has become more uniform across the whole range of
targets, it is no longer useful for their discrimination. To adapt to
the situation, we explored automated approaches to target
classification, aiming to recapitulate the outcomes of previous CASPs as
far as possible, but working solely with the results of automated PDB
searches. Each EU was assigned a sequence-based and structure-based
similarity score. The sequence-based score was defined as the
HHscore 10, which is
the product of the HHsearch probability and the alignment coverage of
the query for the top-ranked template identified by HHsearch. The
structure-based score was the LGA_S score of the highest-ranked
structural match according to the procedure described in section
2.1, Step 3 . These scores were used to automatically assign EUs to
prediction classes (see Results, section 3.2 ).