3 Methods

3.1 PLC validation criteria

PLC were obtained from the PDB, release 2023-03-15. The PDB Chemical Component Dictionary41 was downloaded on March 17, 2023. X-ray validation information was extracted from the XML files provided by the PDB. Additional information including the entry ID to polymer entity ID mapping, release date and polymer composition for each entry as well as the canonical one-letter code sequence for each entity in the dataset was retrieved with the GraphQL-based API of the RCSB PDB Web Services42 on 2023-03-28. 37 entries marked as obsolete in the API results were discarded.
Ligands were defined as any non-polymer entity. A PLC was defined as a PDB entry with at least one polymer and one non-polymer entity (ion or small molecule). PDB entries for which the “polymer composition” was one of “DNA”, ”RNA”, ”DNA/RNA”, ”NA-hybrid”, ”other type pair”, ”NA/oligosaccharide” or ”other type composition”, as well as any remaining entry containing DNA or RNA polymers were ignored.
Binding pockets were defined as the set of amino acid residues in the reference structure with at least one heavy atom within a 6 Å radius of any heavy ligand atom.
The filtering thresholds for the Iridium criteria were extracted from the original manuscript18. The suggestion to filter PLC where atoms from crystal packing are within 6 Å of any ligand atom was not used as this information could not easily be extracted from the PDB validation report.

3.2 PLC clustering and novelty assessment

For PLC clustering, the set of PLC described in section 3.1 was used. PLC were grouped together based on the cluster identifier of all the unique polymer entities and the chemical component 3-letter code of the ligands (i.e. identical ligands) they contained. Polymer entity cluster identifiers were obtained by performing sequence-based clustering of all polymer entities in the dataset with the cluster module from the MMseqs2 software (version 13.45111)43. Six different sequence-based clustering patterns were obtained as a result of clustering with minimum sequence identity thresholds of 100%, 95%, 90%, 70%, 50% and 30% respectively. For the sequence alignment, a coverage threshold of 90% (-c 0.9) of both the query and target sequences was used (–cov-mode 0). The sensitivity of the prefiltering was set to (-s 8.0). Clustering was performed with the connected component algorithm (–cluster-mode 1) with the option (–cluster-reassign) to reassign cluster members to other clusters if they no longer fulfill the clustering criteria after each iteration. Each PLC entry in the dataset was subsequently given an identifying string consisting of the cluster ids of the entities and the 3-letter code of the unique ligands present in the PLC.
The assessment of the novelty of a given PLC with respect to a different set of PLC, at a given minimum sequence identity threshold, was performed by comparing its PLC identifier to the set of all PLC identifiers of the other set.

3.3 Benchmarking state-of-the-art docking tools

A Nextflow36pipeline (20.10.0) was developed to run and assess 5 state-of-the-art PLC prediction tools. This is available athttps://github.com/PickyBinders/PickyBinder

3.3.1 Benchmark dataset

The 363 PLC in the PDBBind time-split test-set that were not used as training data by TANKBind and DiffDock were used as a test set to demonstrate the automated benchmarking workflow16. To compare docking on experimental and predicted structures, AlphaFold v2.3.0 39was used to predict models for 256 monomeric proteins in this set, using the canonical one-letter code sequence, and default parameters and relaxation. Results are present on the best relaxed model (according to average pLDDT) for each protein.

3.3.2 Molecule preparation

Each ligand was prepared starting from the SMILES string. Ligands were first standardized by neutralizing the charges and re-adjusted for pH 7 using protonation rules. Explicit hydrogen atoms were then added. The 3D conformation was generated using the ETKDG method from RDKit44, and stored in SDF format. For docking tools related to the AutoDock family, the Python package Meeko (v0.4.0) was used to generate the PDBQT input files 45.

3.3.3 PLC prediction tools

The predictions were run with the default parameters given by the tools unless stated differently below.
(1) Autodock Vina version 1.2.330,31docking was performed with exhaustiveness set to 64 within a Conda37environment containing the required python bindings. Meeko v0.4.0 was used to transform the PDBQT output file into an SDF file, to be used by the evaluation tools. (2) SMINA32 was run within a Conda environment (v2020.12.10, conda-forge:b08c07c, based on AutoDock Vina 1.1.2) with exhaustiveness set to 64. (3) GNINA33 was run using a Singularity image downloaded fromhttps://hub.docker.com/r/nmaus/gnina(digest: 7087cbf4dafd, gnina v1.0.2 (master:0cb5eb8, built Sep 29 2022)) with exhaustiveness set to 64. (4) TANKBind35, input preparation and inference was run according to the code provided athttps://github.com/luwei0917/TankBindusing a Singularity image for the dependencies downloaded fromhttps://hub.docker.com/r/qizhipei/tankbind_py38. (5) DiffDock34inference was run using –samples_per_complex 40 –batch_size 10 –actual_steps 18 –no_final_step_noise within a Conda environment built according to the setup guide (master:2c7d438, built Mar 13 2023).
Each tool except DiffDock allows for the definition of a pocket center and grid size, within which the search space for ligand conformations is restricted. To assess predictions for different pockets, P2Rank40 (v2.4) was used to predict and rank multiple binding pockets,with default parameters for experimental structures and -c alphafold option for AlphaFold predicted models. The box in which Autodock Vina,GNINA and SMINA search for binding poses was constructed around each predicted P2Rank pocket center. The diameter of the search box was the diameter of the ligand conformer generated by RDKit with an additional 10 Å on all 6 sides of the search box.Thus for each tool (p+1)*n predicted ligand poses were obtained as outputs, where p is the number of pockets predicted by P2Rank and n is the number of poses returned by the tool.

3.3.4 Scoring

BiSyRMSD (shortened to RMSD throughout this manuscript) and lDDT-PLI scores were calculated with OpenStructure version 2.4.046 with default parameters. The methods are identical to those described in the CASP15 CASP-PLI assessment paper9. Every ligand was scored separately and a summary CSV file containing scores for each ligand pose, pocket, and blind docking is generated.