# Fragment Oriented Molecular Shapes

Abstract

Molecular shape is an important concept in drug design and virtual screening. Shape similarity typically uses either alignment methods, which dynamically optimize molecular poses with respect to the query molecular shape, or feature vector methods, which are computationally less demanding but less accurate. The computational cost of alignment can be reduced by pre-aligning shapes, as is done with the Volumetric-Aligned Molecular Shapes (VAMS) method. Here we introduce and evaluate Fragment Oriented Molecular Shapes (FOMS), where shapes are aligned based on molecular fragments. FOMS enables the use of shape constraints, a novel method for precisely specifying molecular shape queries that provides the ability to perform partial shape matching and supports search algorithms that function on an interactive time scale. When evaluated using the challenging Maximum Unbiased Validation dataset, shape constraints were able to extract significantly enriched subsets of compounds for the majority of targets, and FOMS matched or exceeded the performance of both VAMS and an optimizing alignment method of shape similarity search.

Molecular shape is a fundamental concept in medicinal chemistry (Nicholls 2010), and shape-based virtual screens have successfully identified novel inhibitors (Rush III 2005, McMasters 2009, Muchmore 2006, Naylor 2009). Shape is used both to assess the similarity of a candidate molecule to a set of known actives and to evaluate the complementarity of a molecular shape to the shape of the binding site on the target receptor. Here we describe a novel method for exactly specifying shape constraints for querying a large library of molecular shapes. The shape constraints are derived from both the receptor, which describes where a molecule cannot be, and from the shape of a known binder, which describes where the molecule should be. Our method is unique in relying on a manually positioned ligand anchor fragment. This anchor fragment requirement makes our approach particularly applicable to a fragment-based drug discovery workflow (Rees 2004, Congreve 2008) as shape constraints are a natural way to search for compounds that extend an identified fragment structure while remaining complementary to the receptor. Additionally, the anchor fragment, by defining a fixed coordinate system, enables the indexing of large libraries of molecular shapes. This indexing allows search times to scale sub-linearly with the size of the library, resulting in search performance that is on an interactive time scale.

Shape-based virtual screening typically attempts to identify the most similar molecules in a virtual library to known active molecules or to a pseudo-ligand that is derived from the desired binding site (Ebalunode 2008). Shape similarity is usually assessed either through alignment methods, which seek to maximize the three dimensional overlap of two shapes, or through feature vector methods, which transform shapes into a low-dimension vector of features that can be efficiently compared. As part of the similarity calculation, molecular shapes may be further annotated with electrostatic or pharmacophore features (Vainio 2009, Cheeseright 2006, Thorner 1996, Tervo 2005, Marí-n 2008, Sastry 2011).

Alignment methods try to find the optimal overlay of two molecules to either maximize the overlapping volume or the correspondence between feature points, such as molecular field extrema (Vainio 2009, Cheeseright 2006). The predominant method of maximizing volume overlap is to represent the molecular shapes as a collection of Gaussians (Good 1993, Grant 1996), sample several starting points, and use numerical optimization to find a local maximum (Hawkins 2007). An alternative method is to represent the molecular shape by discrete features and use point correspondence algorithms to generate the alignment. Potential features include pharmacophore features (Sastry 2011), field points (Thorner 1996, Vainio 2009, Cheeseright 2006), or hyperbolical paraboloid representations of patches of molecular surface (Proschak 2008). A number of performance optimizations to the alignment process have been described (Grant 1996, Rush III 2005, Sastry 2011, Fontaine 2007), but alignment methods remain computationally expensive relative to feature vector methods, unless shapes can be pre-aligned to a canonical coordinate system, as is done with Volumetric Aligned Molecular Shapes (VAMS) (Koes 2014).

Feature vector methods reduce molecular shapes to a simple vector of Boolean or numerical features. Shape similarity is then determined by comparing these vectors using a metric such as Tanimoto or Euclidean distance. The numerical features can be computed using geometric moments (Ballester 2007, Schreyer 2012), ray-tracing histograms (Zauhar 2003), or a small set of reference shapes (Haigh 2005, Putta 2002). Feature vectors enable computationally efficient screening (millions of shape comparisons per a second) (Ballester 2007), but lack the accuracy and interpret-ability of alignment methods (Nicholls 2010). Critically, a feature vector similarity does not generate a molecular overlay suitable for visual inspection and analysis.

In our approach, fragment oriented molecular shapes (FOMS), we eliminate the computational burden of alignment by requiring the presence of a common anchor fragment. Molecules are trivially aligned by a direct superposition of anchor fragments, and the fragment defines a standard coordinate system for describing the shape of the molecule. Prepositioned molecular fragments are a common component of de novo drug design (Schneider 2005) where ligands are ‘grown’ from a prepositioned fragment to fit the binding site. Prepositioned fragments have also been successfully used in structure-based design to identify high-affinity inhibitors for a multitude of targets (Kick 1997, Murray 1997, Li 1998, Liebeschuetz 2002). Notably, our AnchorQuery web service (Koes 2012) provides interactive pharmacophore virtual screening of billions of custom compounds for protein-protein interaction (PPI) inhibitors by pre-aligning compounds to amino acid side-chain motifs corresponding to anchor residues (Rajamani 2004, Meireles 2010) in the PPI interface. Fragment-based drug discovery workflows (Rees 2004, Congreve 2008) can provide a physical basis for the selection and positioning of an appropriate anchor fragment. Alternatively, virtual docking methods may be used (Brenke 2009).

Anchor fragments present a different modality for shape-based screening: the user is required to identify a fragment structure with a meaningful binding mode and the search space is limited to compounds that contain the specified fragment. These requirements enable a new type of search language that supports explicit shape constraints. In essence, a partial similarity search (Bronstein 2009) can be performed, where instead of optimizing similarity with the entirety of a query shape, the shape constraints specify only part of the shape in detail (e.g., within the binding site) while leaving other parts unspecified (e.g., interactions with solvent). In addition, the use of anchor fragments enable a new mechanism of search. Instead of evaluating the query against every molecule in the virtual screening library, the molecular shapes of the library can be indexed so that searches need only evaluate a fraction of the library. This allows large libraries of millions of shapes to be searched on an interactive time scale of a few seconds.

Here we describe the retrospective virtual screening performance of FOMS and explore the potential for our explicit shape constraints to define highly specific filters for the creation of significantly enriched subsets of large virtual libraries.

# Methods

We describe our representation of molecular shapes, how we define and construct shape constraints for searching, and briefly discuss how these shapes can be indexed to support efficient searches. We also describe our benchmarking methodology for performing a retrospective evaluation of FOMS.

\label{voxelshape} The Rho-Kinase inhibitor from PDB 2H9V and the voxelization of its molecular, solvent-excluded volume. Voxlized image generated with Sproxel (sproxel, r173).

## Shape Representation

A molecular shape in FOMS is a discretized solvent-excluded volume that is calculated from the heavy atoms of a molecular conformation using a water probe with radius 1.4Å. The volume is discretized into 0.5Å$$^3$$ voxels (three dimensional pixels) and stored as an oct-tree, an efficient data structure for representing volumetric data (Zhang 2009). Most oct-tree operations take time proportional to the surface area of a shape ($$\approx n^2$$) instead of the volume ($$\approx n^3$$). An example of a voxelized shape is shown in Figure \ref{voxelshape}. The grid coordinates of a molecular shape are all computed relative to the identified anchor fragment which is aligned to the axes at the origin. If a molecule contains multiple instances of an anchor fragment or if there are inherent symmetries in the fragment, then a distinct voxelized shape is generated for every valid alignment of the fragment(s). For example, a fragment consisting of ring with a single exit vector will map to four distinct alignments on a ligand that contains a ring with two exit vectors. We grow/shrink molecular shapes by adding/removing the appropriate number of surface voxels.

\label{iptsshape} (Left) The Rho-Kinase inhibitor (sticks) and receptor (surface) from PDB 2H9V shown with the identified interaction points (spheres). (Right) Shape constraints derived from this structure. The interaction points become a minimum shape constraint (yellow) by applying a radius of 1Å. Here the receptor volume has been shrunk by 1Å (two layers of surface voxels were removed) to form a maximum shape constraint (inverse shown in cyan). Voxlized image generated with Sproxel (sproxel, r173).

## Shape Constraints

Since we assume all shapes are registered to a common coordinate system defined by the anchor fragment, it is possible to exactly specify regions of space within this coordinate system that a molecule should and should not occupy. Following the nomenclature of VAMS (Koes 2014), we refer to these constraints as minimum and maximum shape constraints. These constraints combine to form an expressive and exacting specification for a desired target shape. A minimum shape constraint sets a strict lower bound on the volumetric shape of a target shape. Every voxel within the minimum shape constraint must be contained within the target shape. A minimum shape constraint can be used to require that a target shape has a specific binding mode and minimum bulkiness. A maximum shape constraint sets a strict upper bound on the volumetric shape of a target shape. Every voxel of the target shape must be contained within the maximum shape constraint. The maximum shape can be used to constrain the total volume of the target shape and prevent the target shape from overlapping undesirable areas, such as space filled by a receptor. Shape constraints are distinct from shape similarity. Unlike shape similarity, which produces a continuous ranking of similarity with respect to a query shape, shape constraints are binary filters: a shape either matches the constraints or does not.

The structure of a receptor-ligand complex provides a natural starting point for the development of shape constraints. The inverse of the receptor shape provides a starting point for defining a maximum shape constraint. The constraint can be expanded within the binding site as needed to incorporate tolerance for small steric clashes and receptor flexibility while the shape can be shrunk to avoid excessive exposure of the target ligand to solvent. For visualization purposes we display the inverse of the maximum shape, which corresponds to the region of space the desired shape is excluded from, to more clearly visualize the minimum shape, which is, by definition, fully subsumed in the maximum shape.

Similarily, the ligand shape can be used to develop a minimum shape constraint. However, constructing a minimum shape constraint directly from the ligand shape, even if the shape is shrunk or skeletonized, results in a highly specific shape query that may be limited in its ability to retrieve novel chemical scaffolds. As an alternative to directly using the shape of a bound ligand, we derive interaction points from the ligand-receptor complex. Interaction points are points at the center of clusters of ligand atoms that are interacting with the receptor. Interacting ligand atoms are defined to be those no more than 6Å away from a receptor atom, as measured between atom centers. These atoms are then clustered into groups of three or more that are no more than 4Å across. The center of the cluster is an interaction point. An example of interaction points identified from a ligand-receptor complex is shown with the corresponding minimum and maximum shape constraints in Figure \ref{iptsshape}. Interaction points provide a more general specification of the binding mode of a ligand that is less dependent on the ligand chemical scaffold than the full ligand shape and ignores non-interacting components of the ligand.

## Shape Indexing

As with VAMS(Koes 2014), we use a matching and packing(Koes 2014a) bulk-loading algorithm to initialize an efficient data structure for volumetric shape constraint searches. Briefly, shapes are stored in a GSS-tree(Keim 1999) where each leaf of the tree is a single molecular shape and each internal node includes a maximum included volume (MIV) and minimum surrounding volume (MSV). The MSV is the union of all the molecular shapes beneath the node in the tree while the MIV is the intersection. By appropriately applying minimum and maximum shape constraints to the MIV and MSV, it can be determined if any of the shapes lower in the tree have the potential to match the constraints. If not, the entire subtree of molecular shapes can be eliminated from consideration, resulting in a sub-linear running time.

## Shape Similarity

We use the shape Tanimoto (Rush III 2005) to compute the similarity of two voxelized shapes A and B: $\delta(A,B) = \frac{A \cap B}{A \cup B}$ where a larger score indicates a greater degree of similarity. For shape similarity evaluations we consider only similarity with the query ligand.

## Dataset

The specific modality of fragment-oriented molecular shapes requires the creation of a custom benchmark for assessing the virtual screening performance of the method. In order to construct the shape constraints, a receptor-ligand structure is required, and, in order to screen a library, all compounds in the library must contain the desired anchor fragment.

We use the Maximum Unbiased Validation (MUV) benchmark (Rohrer 2009) as a starting point. MUV includes sets of 30 active and 15000 property-matched decoy compounds for each of 17 targets. Compounds are selected from PubChem bioactivity data using a methodology that both reduces the similarity of actives (to avoid analogue bias (Good 2008)) and increases the similarity between actives and decoys (which helps prevent artificial enrichment (Verdonk 2004)). MUV is also noteworthy in that the decoys were all assessed to be inactive in the initial high-throughput screen against the target providing some measure of experimental evidence that the negatives are true negatives (as opposed to schemes that generate decoys through random sampling (Mysinger 2012)). The construction of the MUV dataset makes it particularly challenging for ligand similarity approaches. In an evaluation of several screening protocols (including molecular shape), poor enrichments were obtained for all protocols for the majority of query molecules (Tiikkainen 2009). The use of such a challenging benchmark allows us to critically evaluate FOMS and shape constraints in the context of a dataset where traditional approaches often fail.

Of the 17 targets in the MUV dataset, we identified 10 that had a receptor-ligand structure in the Protein Data Bank (PDB) where the ligand had sub-$$\mu$$M affinity. The interaction diagrams of these structures and their PDB codes are shown in Figure \ref{targets}. For each of these structures we manually identified interacting fragments that could potentially serve as anchor fragments. For each target we selected relatively generic functional groups (at most 7 atoms) that were sufficiently common among both the actives and decoys to yield meaningful results and that were clearly forming interactions with the receptor. An exit vector atom was included in generic ring fragments to limit the number of innate symmetries in the fragment and reduce the number for aligned poses. Chosen fragments are shown in Table \ref{fragtable} which includes the number of matching actives and decoys for each benchmark. A second fragment is chosen for the PKA target to illustrate the effect of fragment specificity.

Target SMARTS Confirmatory Assay ID Actives Decoys
CathG c1ccccc1[!H] 832 24 12847
ER$$\alpha$$ agonist c1ccccc1[!H] 737 29 13390
ER$$\alpha$$ c1ccccc1[!H] 713 22 13063
ER$$\beta$$ c1ccccc1[!H] 733 23 12839
FAK a1aaaaa1[!H] 810 29 13891
FXIa a1aaaa1[!H] 846 24 7691
HIVrt a1aaaa1[!H] 652 18 9535
HSP90 c1ccccc1[!H] 712 29 12915
PKA c1[c,n]cccn1 466 25 5582
PKA$$_\mathrm{alt}$$ a1aaaaa1[!H] 466 29 13985
Rho c1[c,n]cccn1 644 18 4593

\label{targets} Graphical depictions of the ten receptor-ligand structures used to define shape constraints. The fragment of the ligand used for alignment is highlighted. Diagrams generated with PoseView (Stierand 2006).

## Virtual Screening Evaluation

In order to investigate the utility of the FOMS approach we consider both the ability of shape constraint filters to generate enriched subsets and the quality of shape similarity rankings generated using fragment aligned molecules. We compare to VAMS, which aligns all molecules to a canonical reference system based on their moments of inertia, and rdMolAlign, the shape alignment module of RDKit (Landrum) which aligns molecules using Open3DALIGN (Tosco 2011). Open3DALIGN uses shape and property information to perform an atom-based alignment that attempts to match the largest number of atoms having similar partial charges and MMFF94 atom types. The input conformers for RDKit alignment were the same aligned poses used with VAMS, and the computed similarity score is the Tanimoto coefficient.

For each target, conformers of the active and decoy compounds were generated using RDKit (Landrum). A maximum of 100 conformers with a minimum RMSD difference of 0.7Å and a maximum energy difference of 10 kcal/mol were generated for each compound. For each fragment considered, the corresponding conformers were extracted into fragment-specific subsets. The subsets were then preprocessed to create VAMS and FOMS search databases. The VAMS database stores a single pose for each conformation aligned along its moments of inertia. For FOMS, if a compound contains multiple instances of the anchor fragment or the fragment contains symmetries, multiple poses per a conformation are stored to account for the multiple fragments/symmetries.

For each target, a single reference structure was identified by searching BindingDB (Liu 2007), PDBbind (Wang 2005), and Binding MOAD (Hu 2005) for the complex with the best binding affinity. The structures were visually inspected to identify buried anchor fragments that made key contacts with the protein, as shown in Figure \ref{targets} and Table \ref{fragtable}. All shape queries were constructed using the reference complexes.

We evaluated both shape constraints and shape similarity using the single reference complex. For the shape constraint search we evaluated multiple maximum shape constraints by shrinking the receptor shape by 0, 0.5, 1.0, 1.5, and 2.0 Ångstroms. For the minimum shape constraints we used interaction points with minimal radius (one voxel). We evaluated every possible query of interaction points. That is, if $$n$$ interaction points were identified, we evaluated $$2^n$$ minimum shapes. Selecting subsets of interaction points reduces the dependency of the query on the full shape of the reference molecule and more closely mimics the use case where the shape constraints are manually constructed by an expert to select only the most important interactions. As each shape constraint query is a filter, it does not return a total ranking of database compounds, so we report the true positive and false positive rate of the selected subset. We calculate a p-value for the returned subset (relative to the null hypothesis of random selection) using a hypergeometric test. Reported p-values are corrected for multiple comparisons using the Bonferroni correction (the calculated p-value is multiplied by the number of comparisons).

Shape similarity results are reported using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. This metric is not biased by the number of compounds and has well-developed statistical properties (Jain 2007). The AUC is equivalent to the probability that a randomly selected active and randomly selected inactive compound will be correctly ranked by a method. An AUC of 0.5 corresponds to random chance while a perfect predictor exhibits an AUC of 1.0.

# Results

Consistent with previous studies (Tiikkainen 2009), we find the MUV dataset to be a challenging target for shape-based screening with few targets demonstrating AUCs far from random performance. Overall FOMS either matched or exceeded the virtual screening performance of VAMS and RDKit while retaining most of the benefits of pre-aligned molecules. Specifically, it is orders of magnitude faster than the optimizing alignment of RDKit suggesting, at a minimum, that FOMS is a viable method for rapidly pre-screening large libraries. In general, the Pareto frontier of the shape constraint queries delineates virtual screen performance equivalent or better to a full shape similarity search while performing queries orders of magnitude faster.

For each target, we plot the ROC curves for FOMS, VAMS, and RDKit similarity search as well as the results for every interaction point shape constraint query. Shape constraint results along the Pareto frontier are highlighted as solid circles and the most statistically significant result is annotated with its Bonferroni-corrected $$p$$-value. We also plot the total time required for the FOMS, VAMS, and RDKit similarity search and provide a box plot of the distribution of times for the shape constraint searches. All time plots are plotted on a logarithmic scale due to the orders of magnitude difference between the methods. Shape constraint search time varies considerably depending on the number of hits retrieved. Most queries are highly selective, return few or no results, and take less than a hundredth of a second. We do not include the time needed to generate the shape constraints in the query time since the receptor-based constraints are generated once and cached. We plot times for only those queries that generate results (queries that were too selective to return any matches were nearly instantaneous). For these queries the median time, as shown in the box plots, remains below a tenth of a second. A few queries are non-selective and return subsets of compounds that are comparable in size to the full database. These queries approach or exceed the running time of a full similarity search. The statistically most significant shape constraint query, annotated with a p-value in the ROC plot, has its running time plotted as a circle. These informative queries typically exceed the median time, but still take less than a second.

\label{cathg} Cathepsin G virtual screening performance and search time for various methods.

\label{cathgip} Cathepsin G interaction points that defined the shape constraint query with the most significant enrichment. The interaction points (magenta spheres) and fragment (magenta sticks) delineate only half the molecule and are located near a charge-charge interaction and aromatic interaction. Unlike molecular similarity, shape constraints are capable of selecting molecules that sterically match these features while ignoring the remainder of the query ligand shape and avoiding significant clashes with the receptor shape (for this query the receptor was reduced by 1.5Å).

## Cathepsin G

Virtual screening results for cathepsin G are shown in Figure \ref{cathg}. All three shape similarity methods exhibited poor performance with this query with AUCs near 0.5, but many of the interaction point shape constraint queries were significantly better than random. The statistically most significant interaction point query is shown in Figure \ref{cathgip}. The fragment for the query is deeply buried within the S1 pocket and the interaction points match the S2 pocket and phosphonate moiety of the ligand. However, there are no interaction points selected in the S3/S4 pocket. Interactions in this pocket can increase affinity, but are not necessary for binding (Garavilla 2005). Unlike whole-ligand similarity methods, shape constraint search can ignore a significant part of the geometry of the ligand (the naphthalene in the S3 pocket) and select for only specific parts of the query shape. In this case, this selectivity results in substantially better virtual screening results that is two orders of magnitude faster than they pre-aligned similarity search methods.

\label{eralphapot} ER$$\alpha$$ agonists virtual screening performance and search time for various methods.

\label{eralpha} ER$$\alpha$$ inhibitors virtual screening performance and search time for various methods.

## Estrogen Receptor $$\alpha$$

Virtual screening results for estrogen receptor $$\alpha$$ agonists and inhibitors are shown in Figures \ref{eralphapot} and \ref{eralpha}. RDKit was marginally successful at identifying inhibitors with an AUC of 0.59 but the remaining methods were less successful with AUCs ranging from 0.43 to 0.52 and none of the shape-based methods exhibited early enrichment. The shape constraint queries lacked statistical significance and the most significant queries identified a large percentage of the total compounds in the database, resulting in running times that sometimes exceeded a second.

\label{erbeta} ER$$\beta$$ virtual screening performance and search times for various methods.

## Estrogen Receptor $$\beta$$

Virtual screening results for estrogen receptor $$\beta$$ are shown in Figure \ref{erbeta}. None of the methods outperform random selection and all methods fail to exhibit early enrichment. Although shape constraints are capable of improving on the performance of the similarity methods, they still lack significance.

\label{fak} FAK virtual screening performance and search times for various methods.

Virtual screening results for focal adhesion kinase (FAK) are shown in Figure \ref{fak}. Similar to ER$$\beta$$, none of the methods outperform random selection. However, there is at least one shape constraint query capable of generating an enriched subset with statistical significance. Note that unlike the other kinases, Rho and PKA, for FAK a nitrogen containing ring pattern was not chosen for fragment matching since fewer than half of the actives in the benchmark had such a pattern. Unlike the other kinases, using such a pattern for FAK does not improve the virtual screening results (data not shown).

\label{fxia} Factor XIa virtual screening performance and search times for various methods.

## Factor XIa

Virtual screening results for factor XIa (FXIA) are shown in Figure \ref{fxia}. All three shape similarity methods perform worse than random with AUCs ranging from 0.31 (VAMS) to 0.42 (RDKit and FOMS), with both VAMS and FOMS failing to identify any actives early in the screen. Shape constraints do not select enriched subsets and the pre-alignment methods exhibit no early enrichment behavior.

\label{hivrt} HIV reverse transcriptase virtual screening performance and search times for various methods.

## HIV Reverse Transcriptase

Virtual screening results for HIV reverse transcriptase (HIV-RT) are shown in Figure \ref{hivrt}. Again, none of the methods outperform random selection. However, FOMS displays the best early enrichment up to a true positive rate of 17% with RDKit also achieving some early enrichment. The shape constraint queries are statistically significant and can identify enriched subsets with greater proficiency than the shape similarity methods.

\label{hsp90} Heat shock protein 90 virtual screening performance and search times for various methods.

## Heat Shock Protein 90

Virtual screening results for heat shock protein 90 (HSP90) are shown in Figure \ref{hsp90}. All three shape similarity methods perform poorly with AUCs from 0.38 (VAMS) to 0.49 (RDKit). The shape constraint queries can produce statistically significant enrichments and the Pareto frontier matches or exceeds the ROC curve of RDKit.

\label{pka} Protein kinase A virtual screening performance and search times for various methods.

\label{pka5} Protein kinase A virtual screening performance and search times for various methods with a less selective fragment choice.

## Protein Kinase A

Virtual screening results for protein kinase A (PKA) are shown in Figure \ref{pka}. For this kinase FOMS substantially outperforms both RDKit (AUC 0.64) and VAMS (AUC 0.55) with an AUC of 0.90. Shape constraints can perform even better than the full similarity search. Much of this performance is due to the use of a more selective fragment that properly orients the nitrogen in the ring fragment. Figure \ref{pka5} shows the result when a more generic fragment, an arbitrary aromatic ring (PKA$$_\mathrm{alt}$$ in Table \ref{fragtable}), is used. The enhanced performance of FOMS disappears, and the ability of shape constraints to generated enriched subsets is reduced. In comparison, the optimizing shape similarity of RDKit performs about the same (AUC 0.62). The use of this fragment pulls in thousands more decoys and introduces many more valid fragment alignments for both actives and decoys. This implies that the good performance of FOMS and shape constraints with this target is likely due the ability of the fragment to properly position compounds to make conserved interactions with the hinge region of the kinase, as shown in Figure \ref{targets}. Compounds that contain the specified fragment but cannot position the fragment to make these interactions while maintaining the steric compatibility of the whole ligand with the query ligand and receptor are filtered out.

\label{rho} Rho-kinase 2 virtual screening performance and search times for various methods.

## Rho-Kinase2

Virtual screening results Rho-Kinase2 are shown in Figure \ref{rho} and are similar to those of PKA. FOMS achieves an AUC of 0.94, VAMS an AUC of 0.56, and RDKit an AUC of 0.66. The best shape constraints match or exceed the performance of FOMS. As with PKA, if a less selective fragment is used (a generic aromatic ring), virtual performance is substantially reduced (data not shown).

\label{aucs} AUCs for shape similarity methods across all targets shown with their 95% confidence intervals. Confidence intervals were calculated using 2000 iterations of bootstrapping.

\label{pvaltable} Significance of the best shape constraint queries and the percentage of all evaluated queries with a Bonferroni-corrected $$p$$-value below 0.01.
Target Best $$p$$-value Sig. Queries
CathG 4.23e-06 1%
ER$$\alpha$$ 0.0505 0%
ER$$\alpha$$ agonist 0.0106 0%
ER$$\beta$$ 0.0328 0%
FAK 0.000289 4%
FXIa 0.387 0%
HIVrt 0.000951 5%
HSP90 0.000755 2%
PKA 1.14e-23 80%
Rho 8.41e-15 80%

\label{confs} (Left) The effect of the maximum number of conformers generated for each molecule on screening performance, and (Right) the change in distribution of similarity scores for active and decoy compounds using FOMS for different numbers of conformers.

# Discussion

The MUV dataset, with its focus on eliminating analogue bias, is particularly resistant to single-query shape-based virtual screens (Tiikkainen 2009). This is reflected in our overall results, shown in Figure \ref{aucs}, where only two targets (Rho and PKA) achieve AUCs where the 95% confidence interval does not overlap with 0.5 (random performance). The remaining targets likely lack meaningful whole-molecule shape complementary between the query ligand and the active compounds of the benchmark. One exception may be HIV-rt, where there is clear early enrichment which indicates that a subset of the actives may be compatible with the query molecule. FOMS dramatically outperformed other methods for the Rho and PKA targets due to correct positioning of a fragment with key, conserved interactions. For comparison, Figure \ref{aucs} also shows the performance of a 2D fingerprint, the OpenBabel FP2 (OBoyle 2011) path-based fingerprint. 2D information is more successful for three targets (ER$$\beta$$, FXIa, HSP90), and has comparable or worse performance for the remaining targets, illustrating the orthogonality of 2D and 3D approaches.

Our assumption in the design of this study was that, since the evaluated methods utilize rigid conformers, it would be necessary to generate a large sampling of conformers to ensure the biologically relevant conformer was screened and was correctly registered as a hit. Surprisingly, when we tested this assumption with the Rho Kinase target (results were similar for PKA), we found that the virtual screening performance for all three shape methods was relatively insensitive to the number of conformers used (Figure \ref{confs}). Interestingly, this is not because the top ranked conformer is the most representative conformer. Instead, as demonstrated in Figure \ref{confs}, reducing the number of conformers does result in a reduction in similarity scores for active compounds. However, a compensating reduction in scores is observed in the decoy set as the number of conformers sampled is decreased, resulting in similar virtual screening results. This suggests that sampling a large number of conformations will be beneficial in terms of producing meaningful poses with high similarity to the query ligand, a key advantage of shape-based methods, but, counter-intuitively, may not be strictly necessary for virtual screening purposes.

FOMS essentially provides a rapid means of template docking (Ruiz-Carmona 2014, Abagyan 2015, Koes 2012) using shape-based scoring. The disadvantage of fragment-oriented approaches is they are critically dependent on the choice of fragment and its proper positioning in defining the query. Provided these requirements can be met, there are several advantages to shape-based fragment alignment search. By enforcing the fragment alignment, key interaction are guaranteed to be conserved. Previous studies have demonstrated the importance of adding pharmacophoric properties (or ‘color’) to shape similarity (Hawkins 2007). Fragment alignment introduces a hard bias toward matching a key portion of the query molecule without introducing any additional computation, as required by more general methods. In fact, as we have shown, pre-alignment substantially reduces the computational overhead.

Prealignment, whether to fragments (FOMS) or canonical internal coordinates (VAMS) is orders of magnitude faster than methods that dynamically optimize the alignment. This holds true even if the cost to create the search database is taken into account. The time to create the databases scales with the number of molecular shapes (about 10 shapes a second on our system) and compares favorable with RDKit search (2 molecules a second). As the common case is for a fragment oriented database to be re-used to for queries by multiple users investigating multiple targets, in practice the cost of database creation gets amortized into insignificance.

A major advantage of fragment alignment is that it enables the use of shape constraints. Shape constraint search generally tracked or improved upon the performance of FOMS similarity ranking (e.g. Figure \ref{cathg}). As shown in Table \ref{pvaltable}, shape constraints were able to generate statistically significant ($$p < 0.01$$) enriched subsets for six of the ten targets. Unlike whole-molecule shape similarity, shape constraints can select for a subshape of the query ligand (i.e., perform partial shape matching) and specifically filter out potential clashes with the receptor. Furthermore, shape constraints support an indexing search algorithm that scales sub-linearly. In our evaluation, shape constraint searches were orders of magnitude faster than even VAMS similarity search and generally took well under a second.

Shape constraints provide a novel and unique method for specifying molecular shape queries. Although we have investigated automatically generated shape constraints, our assumption is that intelligently designed constraints created by human experts will substantially outperform these interaction point based constraints. The speed of shape constraint searches enables interactive analysis and refinement of shape queries where the expert user sculpts their desired molecule guided by the results of searches of full-sized databases. We anticipate that fragment-oriented shape constraints will prove a useful tool in both fragment-based drug discovery workflows, which are naturally centered around a fragment, and lead optimization workflows, where the core scaffold of the lead compound can serve as a fragment.

## Acknowledgements

This work was supported by the National Institute of Health [R01GM097082, R01GM108340].

### References

1. A. Nicholls, G. B. McGaughey, R. P. Sheridan, A. C. Good, G. Warren, M. Mathieu, S. W. Muchmore, S. P. Brown, J. A. Grant, J. A. Haigh. Molecular shape and medicinal chemistry: a perspective. J Med Chem note = [PubMed:20158188] [PubMed Central:PMC2874267] [doi:10.1021/jm900818s] 53, 3862 American Chemical Society, 2010.

2. T. S. Rush III, J. A. Grant, L. Mosyak, A. Nicholls. A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem note = [PubMed:15743191] [doi:10.1021/jm040163o] 48, 1489–1495 ACS Publications, 2005.

3. D. R. McMasters, M. Garcia-Calvo, V. Maiorov, M. E. McCann, R. D. Meurer, H. G. Bull, J. M. Lisnock, K. L. Howell, R. J. DeVita. Spiroimidazolidinone NPC1L1 inhibitors. 1: Discovery by 3D-similarity-based virtual screening. Bioorganic & medicinal chemistry letters note = [PubMed:19410454] [doi:10.1016/j.bmcl.2009.04.031] 19, 2965–2968 Elsevier, 2009.

4. S. W. Muchmore, A. J. Souers, I. Akritopoulou-Zanze. The Use of Three-Dimensional Shape and Electrostatic Similarity Searching in the Identification of a Melanin Concentrating Hormone Receptor 1 Antagonist. Chemical biology & drug design note = [PubMed:16492165] [doi:10.1111/j.1747-0285.2006.00341.x] 67, 174–176 Wiley Online Library, 2006.

5. Edmund Naylor, Abdelilah Arredouani, Sridhar R Vasudevan, Alexander M Lewis, Raman Parkesh, Akiko Mizote, Daniel Rosen, Justyn M Thomas, Minoru Izumi, A Ganesan, Antony Galione, Grant C Churchill. Identification of a chemical probe for NAADP by virtual screening. Nature Chemical Biology 5, 220–226 Nature Publishing Group, 2009. Link

6. D. C. Rees, M. Congreve, C. W. Murray, R. Carr. Fragment-based lead discovery. Nature Reviews Drug Discovery note = [PubMed:15286733] [doi:10.1038/nrd1467] 3, 660–672 Nature Publishing Group, 2004.

7. Miles Congreve, Gianni Chessari, Dominic Tisi, Andrew J. Woodhead. Recent Developments in Fragment-Based Drug Discovery. J Med Chem 51, 3661–3680 American Chemical Society, 2008.

8. Jerry Osagie Ebalunode, Zheng Ouyang, Jie Liang, Weifan Zheng. Novel Approach to Structure-Based Pharmacophore Search Using Computational Geometry and Shape Matching Techniques. J. Chem. Inf. Model. 48, 889–901 American Chemical Society, 2008.

9. M. J. Vainio, J. S. Puranen, M. S. Johnson. ShaEP: molecular overlay based on shape and electrostatic potential. Journal of chemical information and modeling note = [PubMed:19434847] [doi:10.1021/ci800315d] 49, 492–502 ACS Publications, 2009.

10. T. Cheeseright, M. Mackey, S. Rose, A. Vinter. Molecular field extrema as descriptors of biological activity: definition and validation. Journal of chemical information and modeling note = [PubMed:16562997] [doi:10.1021/ci050357s] 46, 665–676 ACS Publications, 2006.

11. D. A. Thorner, D. J. Wild, P. Willett, P. M. Wright. Similarity searching in files of three-dimensional chemical structures: flexible field-based searching of molecular electrostatic potentials. J. Chem. Inf. Comput. Sci note = [doi:10.1021/ci960002w] 36, 900–908 (1996).

12. A. J. Tervo, T. Rönkkö, T. H. Nyrönen, A. Poso. BRUTUS: optimization of a grid-based similarity function for rigid-body molecular superposition. 1. Alignment and virtual screening applications. Journal of medicinal chemistry note = [PubMed:15943481] [doi:10.1021/jm049123a] 48, 4076–4086 ACS Publications, 2005.

13. R. M. Marí-n, N. F. Aguirre, E. E. Daza. Graph theoretical similarity approach to compare molecular electrostatic potentials. Journal of chemical information and modeling note = [PubMed:18166018] [doi:10.1021/ci7001878] 48, 109–118 ACS Publications, 2008.

14. M. Sastry, S. Dixon, W. Sherman. Rapid Shape-Based Ligand Alignment and Virtual Screening Method Based on Atom/Feature-Pair Similarities and Volume Overlap Scoring. J. Chem. Inf. Model. note = [PubMed:21870862] [doi:10.1021/ci2002704] ACS Publications, 2011.

15. A. C. Good, W. G. Richards. Rapid evaluation of shape similarity using Gaussian functions. J. Chem. Inf. Model. 33, 112–116 American Chemical Society, 1993.

16. J. A. Grant, M. A. Gallardo, B. T. Pickup. A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape. Journal of Computational Chemistry 17, 1653–1666 John Wiley & Sons, 1996.

17. Paul C. D. Hawkins, A. Geoffrey Skillman, Anthony Nicholls. Comparison of Shape-Matching and Docking as Virtual Screening Tools. J. Med. Chem. 50, 74–82 American Chemical Society (ACS), 2007. Link

18. Ewgenij Proschak, Matthias Rupp, Swetlana Derksen, Gisbert Schneider. Shapelets: Possibilities and limitations of shape-based virtual screening. Journal of Computational Chemistry note = [PubMed:17516427] [doi:10.1002/jcc.20770] 29, 108–114 Wiley Subscription Services, Inc., A Wiley Company, 2008.

19. F. Fontaine, E. Bolton, Y. Borodina, S. H. Bryant. Fast 3D shape screening of large chemical databases through alignment-recycling. Chemistry Central Journal note = [PubMed:17880744] [PubMed Central:PMC1994057] [doi:10.1186/1752-153X-1-12] 1, 12 BioMed Central, 2007.

20. David Ryan Koes, Carlos J. Camacho. Shape-based virtual screening with volumetric aligned molecular shapes. Journal of Computational Chemistry 35, 1824–1834 (2014). Link

21. P. J. Ballester, W. G. Richards. Ultrafast shape recognition to search compound databases for similar molecular shapes. J. Comp. Chem. note = [PubMed:17342716] [doi:10.1002/jcc.20681] 28, 1711 John Wiley & Sons, Ltd, 2007.

22. Adrian M Schreyer, Tom Blundell. USRCAT: real-time ultrafast shape recognition with pharmacophoric constraints. Journal of Cheminformatics 4, 27 Springer Science $$\mathplus$$ Business Media, 2012. Link

23. R. J. Zauhar, G. Moyna, L. F. Tian, Z. J. Li, W. J. Welsh. Shape signatures: a new approach to computer-aided ligand-and receptor-based drug design. J. Med. Chem note = [PubMed:14667221] [doi:10.1021/jm030242k] 46, 5674–5690 (2003).

24. J. A. Haigh, B. T. Pickup, J. A. Grant, A. Nicholls. Small molecule shape-fingerprints. J. Chem. Inf. Model note = [PubMed:15921457] [doi:10.1021/ci049651v] 45, 673–684 ACS Publications, 2005.

25. S. Putta, C. Lemmen, P. Beroza, J. Greene. A novel shape-feature based approach to virtual library screening. J. Chem. Inf. Comput. Sci note = [PubMed:12377013] 42, 1230–1240 (2002).

26. Gisbert Schneider, Uli Fechner. Computer-based de novo design of drug-like molecules. Nat Rev Drug Discov 4, 649–663 (2005). Link

27. E. K. Kick, D. C. Roe, A. Geoffrey Skillman, G. Liu, T. J. A. Ewing, Y. Sun, I. D. Kuntz, J. A. Ellman. Structure-based design and combinatorial chemistry yield low nanomolar inhibitors of cathepsin D. Chemistry & Biology note = [PubMed:9195867] 4, 297–307 Elsevier, 1997.

28. C. W. Murray, D. E. Clark, T. R. Auton, M. A. Firth, J. Li, R. A. Sykes, B. Waszkowycz, D. R. Westhead, S. C. Young. PRO_SELECT: combining structure-based drug design and combinatorial chemistry for rapid lead discovery. 1. Technology. J Comput Aided Mol Des note = [PubMed:9089436] 11, 193–207 Springer, 1997.

29. J. Li, C. W. Murray, B. Waszkowycz, S. C. Young. Targeted molecular diversity in drug discovery: integration of structure-based design and combinatorial chemistry. Drug discovery today 3, 105–112 Elsevier, 1998.

30. J. W. Liebeschuetz, S. D. Jones, P. J. Morgan, C. W. Murray, A. D. Rimmer, J. M. E. Roscoe, B. Waszkowycz, P. M. Welsh, W. A. Wylie, S. C. Young. PRO_SELECT: combining structure-based drug design and array-based chemistry for rapid lead discovery. 2. The development of a series of highly potent and selective factor Xa inhibitors. Journal of medicinal chemistry note = [PubMed:11881991] 45, 1221–1232 ACS Publications, 2002.

31. David Koes, Kareem Khoury, Yijun Huang, Wei Wang, Michal Bista, Grzegorz M. Popowicz, Siglinde Wolf, Tad A. Holak, Alexander Dömling, Carlos J. Camacho. Enabling Large-Scale Design Synthesis and Validation of Small Molecule Protein-Protein Antagonists. PLoS ONE 7, e32839 Public Library of Science (PLoS), 2012. Link

32. D. Rajamani, S. Thiel, S. Vajda, C. J. Camacho. Anchor residues in protein-protein interactions. Proceedings of the National Academy of Sciences 101, 11287–11292 Proceedings of the National Academy of Sciences, 2004. Link

33. L. M. C. Meireles, A. S. Domling, C. J. Camacho. ANCHOR: a web server and database for analysis of protein-protein interaction binding pockets for drug discovery. Nucleic Acids Research 38, W407–W411 Oxford University Press (OUP), 2010. Link

34. R. Brenke, D. Kozakov, G. Y. Chuang, D. Beglov, D. Hall, M. R. Landon, C. Mattos, S. Vajda. Fragment-based identification of druggable hot spots of proteins using Fourier domain correlation techniques. Bioinformatics note = [PubMed:19176554] [PubMed Central:PMC2647826] [doi:10.1093/bioinformatics/btp036] 25, 621 Oxford Univ Press, 2009.

35. A. M. Bronstein, M. M. Bronstein, A. M. Bruckstein, R. Kimmel. Partial similarity of objects, or how to compare a centaur to a horse. International Journal of Computer Vision 84, 163–183 Springer, 2009.

36. sproxel, r173.

37. J. Zhang, S. Smith. Shape Similarity Matching With Octree Representations. Journal of Computing and Information Science in Engineering 9, 034503 (2009).

38. David Ryan Koes, Carlos J. Camacho. Indexing volumetric shapes with matching and packing. Knowledge and Information Systems 43, 157–180 Springer Science $$\mathplus$$ Business Media, 2014. Link

39. Daniel A. Keim. Efficient geometry-based similarity search of 3D spatial databases. 419–430 In Proc. of the Intl. Conf. on Management of Data. ACM, 1999. Link

40. S. G. Rohrer, K. Baumann. Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model note = [PubMed:19434821] [doi:10.1021/ci8002649] 49, 169–84 (2009).

41. A. C. Good, T. I. Oprea. Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection?. Journal of computer-aided molecular design 22, 169–178 Springer, 2008.

42. M. L. Verdonk, V. Berdini, M. J. Hartshorn, W. T. Mooij, C. W. Murray, R. D. Taylor, P. Watson. Virtual screening using protein-ligand docking: avoiding artificial enrichment.. Journal of chemical information and computer sciences note = [PubMed:15154744] [doi:10.1021/ci034289q] 44, 793 (2004).

43. Michael M. Mysinger, Michael Carchia, John. J. Irwin, Brian K. Shoichet. Directory of Useful Decoys Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 55, 6582–6594 American Chemical Society (ACS), 2012. Link

44. Pekka Tiikkainen, Patrick Markt, Gerhard Wolber, Johannes Kirchmair, Simona Distinto, Antti Poso, Olli Kallioniemi. Critical Comparison of Virtual Screening Methods against the MUV Data Set. Journal of Chemical Information and Modeling 49, 2168–2178 American Chemical Society (ACS), 2009. Link

45. K. Stierand, P. C. Maass, M. Rarey. Molecular complexes at a glance: automated generation of two-dimensional complex diagrams. Bioinformatics 22, 1710–1716 Oxford University Press (OUP), 2006. Link

46. Greg Landrum. RDKit: Open-source cheminformatics.

47. Paolo Tosco, Thomas Balle, Fereshteh Shiri. Open3DALIGN: an open-source software aimed at unsupervised ligand alignment. Journal of Computer-Aided Molecular Design 25, 777–783 Springer Science $$\mathplus$$ Business Media, 2011. Link

48. T. Liu, Y. Lin, X. Wen, R. N. Jorissen, M. K. Gilson. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Research 35, D198–D201 Oxford University Press (OUP), 2007. Link

49. Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, Shaomeng Wang. The PDBbind Database:  Methodologies and Updates. J. Med. Chem. 48, 4111–4119 American Chemical Society (ACS), 2005. Link

50. Liegi Hu, Mark L. Benson, Richard D. Smith, Michael G. Lerner, Heather A. Carlson. Binding MOAD (Mother Of All Databases). Proteins 60, 333–340 Wiley-Blackwell, 2005. Link

51. Ajay N. Jain. Bias reporting, and sharing: computational evaluations of docking methods. Journal of Computer-Aided Molecular Design 22, 201–212 Springer Science $$\mathplus$$ Business Media, 2007. Link

52. L. de Garavilla, M. N. Greco, N. Sukumar, Z.-W. Chen, A. O. Pineda, F. S. Mathews, E. Di Cera, E. C. Giardino, G. I. Wells, B. J. Haertlein, J. A. Kauffman, T. W. Corcoran, C. K. Derian, A. J. Eckardt, B. P. Damiano, P. Andrade-Gordon, B. E. Maryanoff. A Novel Potent Dual Inhibitor of the Leukocyte Proteases Cathepsin G and Chymase: MOLECULAR MECHANISMS AND ANTI-INFLAMMATORY ACTIVITY IN VIVO. Journal of Biological Chemistry 280, 18001–18007 American Society for Biochemistry & Molecular Biology (ASBMB), 2005. Link

53. Noel M OBoyle, Michael Banck, Craig A James, Chris Morley, Tim Vandermeersch, Geoffrey R Hutchison. Open Babel: An open chemical toolbox. Journal of Cheminformatics 3, 33 Springer Science $$\mathplus$$ Business Media, 2011. Link

54. Sergio Ruiz-Carmona, Daniel Alvarez-Garcia, Nicolas Foloppe, A. Beatriz Garmendia-Doval, Szilveszter Juhos, Peter Schmidtke, Xavier Barril, Roderick E. Hubbard, S. David Morley. rDock: A Fast Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids. PLoS Comput Biol 10, e1003571 Public Library of Science (PLoS), 2014. Link

55. RA Abagyan, A Orry, E Raush, M and Totrov. ICM Manual, 3.8. MolSoft LLC, La Jolla, CA (2015).