Title: Example Poster
1CLIP The Candidate Ligand Identification Program
Nicholas Rhodes1, Peter Willett1, Alain Calvet2
and Christine Humblet2
- Background
- Recent improvements in combinatorial synthesis
techniques have resulted in the availability of
very large numbers of molecules for
high-throughput screening (HTS) systems.
Although very efficient in operation,
considerations of cost-effectiveness mean that
screening should be restricted as far as possible
to molecules that have a reasonable probability
of being active. This has led to much interest
in methods for virtual screening, i.e., the
ranking of a set of compounds in decreasing
probability of activity so that biological
testing can be restricted to a (hopefully) small
fraction of the total number of molecules
available for consideration. - One of the most important virtual screening
techniques is ligand docking. This involves
determining whether a molecule is complementary
(in terms of its steric, hydrophobic and
electrostatic characteristics) to the binding
site of a protein for which the 3D structure is
available (typically from X-ray crystallography).
Several programs for ligand-docking are now
widely available and although effective in
operation they can be quite slow, especially when
an attempt is made to explore the conformational
space of the potential ligands for the chosen
target. - CLIP was designed to provide a fast alternative
to docking methods, specifically, to meet the
following criteria - It should be based on the 3D structures of
ligands, rather than the 2D structures used in
conventional similarity searching. - It should be able to utilise information about
the binding site if a protein 3D structure is
available - It should be sufficiently fast in operation to
permit the virtual screening of a million
compounds in an overnight run.
- The programs
- CLIP takes as inputs modified MOL2 files that
have been pre-classified to include information
about donors, acceptors, electronegativity etc.
This classification is done by a Python script
(CAP.py) on a once and for all basis, using the
classification scheme proposed by Pepperell et
al.. One of these inputs is the query template,
the others are candidate molecules in a database.
The query is then successively matched against
each element of the database and the results
sorted and presented.
- Results
- The actives clustered into 4 groups (UNITY RNN
clustering) consisting of one group of 36 and
three singletons. When each of the actives was
used as a template all of the structures from the
major cluster retrieved a majority of the actives
from the same cluster in the top-100 indeed, 11
retrieved all 36 cluster members. As expected,
the singletons retrieved only themselves, and in
one case, one other active molecule, indicating
that CLIP is highly discriminating. - The data are presented as cumulative recall plots
in Figures 2-4 (right) - ideal situation (actives rated 1-39)
- average (random) or one hit every 5000/39
structures - ranking obtained by docking of 3D structures
into the HIV protease binding site using GOLD in
command-line mode with default parameters. The
cavity was centred on atom 242 (D25 OD1) by GOLD
flood-fill with a radius of 15?. - rankings obtained from CLIP UNITY 2D searches
for each of two templates - Using the templates 0154385 (6 nodes) and 0162034
(3 nodes), CLIP ranked respectively 12 and 61
molecules with a coefficient of unity. Note that
CLIP will not analyse any structures that have
fewer nodes than the minimum clique size
parameter. These analyses were performed with
MINCLQ 3 and only compounds containing matching
cliques were ranked, so for template 0154386 only
1822 of a possible 4981 were ranked and only 1155
for template 162034. Because CLIP ranked many
structures with identical Simpsons coefficients,
these were averaged and so the top-ranked
molecules (Simpsons coefficient is 1) all
received an equal ranking of 6.5. This technique
results in some discontinuous jumps in the CLIP
data series. -
- The same two molecules were also used as UNITY
queries for a default 2D similarity search here,
by reducing the minimum similarity to zero,
UNITY effectively ranked the entire data set.
UNITY is probably marginally more effective than
CLIP and whilst GOLD is an improvement on random
selection, it is considerably less effective at
ranking actives in this data set. However it
should be noted that GOLD was designed to
identify the binding modes of small numbers of
molecule and not for this type of approach.
Figure 2
- The experiments
- Using a set of 5k candidate HIV protease
inhibitors containing 39 known actives, we
present performance comparisons of CLIP against - UNITY 2D fingerprints
- docking of 3D structures into the HIV protease
binding site using GOLD (Jones et al.) - In the first of these, though performing a 3-D
match, CLIP is effectively acting as a similarity
tool. Ranking was performed using a similarity
metric based on Simpsons coefficient - where a is the clique size, b the candidate
molecule size and c the template size, and where
the sizes are the number of vertices in that
graph or subgraph. To remove bias towards large
molecules, the coefficient was normalised using a
correction based on the differences between the
template and candidate intra-node distances, the
aim being to increase the similarity for cases
where there was a high measure of agreement in
the matched distances from the template graph and
a candidate graph.
Figure 3
- Theory
- CLIP is based on mapping the 3D arrangement of
pharmacophore features, typically donors and
acceptors, in a target molecule against either - Corresponding donors and acceptors in other
database structures, this being an example of 3D
similarity searching - Complementary acceptors and donors in a protein
binding site, which we will refer to as
complementary searching - In both cases, the mapping is generated using a
3D maximum common subgraph isomorphism algorithm,
specifically the clique-detection algorithm of
Bron and Kerbosch that has been used, both by us
and by other workers, in several previous
studies. This algorithm was chosen for two
reasons it has been shown to both effective and
efficient in operation (Brint, A.T. Willett,
P. Gardiner, E.J., Artymiuk, P.J. Willett, P.)
and it is also fairly easy to implement, in
contrast to several of the other algorithms for
MCS detection that have been described in the
graph-theoretic literature.
- Test data
- A subset (5000) of the in-house HIV protease
database (SMILES) with activities was supplied,
this subset (candidates) contained a total of 39
actives which were marked by renaming from
xyz-0000 to xyz- to facilitate their
identification by scripts processing result
files. The SMILES were converted to 3D MOL2
representations using CONCORD with default
parameters. 19 molecules failed the conversion,
none of them active. The resulting MOL2 file was
then passed to the Python preprocessor, CAP.py,
giving two files, both containing information on
likely H-bond formers and one containing
additional information on aromatic and
hydrophobic moieties. The results described here
were all obtained using the former, as aromatic
and hydrophobic interactions in aspartyl protease
binding sites were observed to be long-range and
non-directional. - Templates were constructed from 3D structures
with bound inhibitors and also from each of the
actives. These were used for similarity searches
against the whole subset.
Figure 4
- Runtimes
- With regard to timings, GOLD processed between
1.25 and 3.77 structures per hour, depending on
processor speed and machine load. Assuming two
structures per hour, the total GOLD runtime was
around 100 CPU days. CLIP will rank about
250,000 structures per hour for a 3-node
structure, and about 150,000 for a 6-node one
for the dataset in question CLIP took just under
two minutes for the 6-node structure and 72
seconds for the smaller one (both well within the
design criterion of one million compounds in an
overnight run). The runtimes for UNITY 2D
similarity searching are comparable to those for
CLIP (about 3 minutes per search) but the modus
operandi makes it difficult to time UNITY
searches accurately. - It is difficult to compare CLIP against the SYBYL
3-D searches, there seems to be little difference
at all between the two approaches in terms of
effectiveness there is, however, a substantial
difference in terms of efficiency. Though
difficult to time because of the way it operates,
SYBYL takes around 4-5 minutes to search 4755
compounds. CLIP is considerably faster, taking
only 72 seconds for the same search. However, it
is when taking into account the combinatorial
problem of matching a larger template that the
real advantage is seen. To search for all 3-point
matches for a 6-entity template would take SYBYL
approximately 80 to 100 minutes, CLIP takes
around 150 seconds (2.5 minutes).
- Conclusions
- CLIP is capable of both similarity and
complementary matches. In most cases, when doing
a complementary match with a binding site, the
sought-for positions of the entities are those of
the bound ligand so the problem reduces to one of
taking their positions and inverting the
donor/acceptor status and is thus equivalent to a
similarity search in 3D space. The program can,
however, additionally be used when just the
protein structure is available without a bound
ligand. - CLIP proved comparable in retrieval performance
and speed with the fingerprint search, and
outperformed the docking search (which is, after
all, designed for more exhaustive exploration of
a much smaller number of ligands) in both
respects. CLIP is capable of ranking between
150k and 250k structures per hour and thus
provides a fast 3D alternative to traditional 2D
screening methods.
- Implementation
- Written in entirely in C, CLIP employs a
user-supplied file of rules to determine whether
or not two nodes are compatible and a match has
been made. The current implementation supports 8
types of node (donor, acceptor, donor-acceptor,
electronegative, electropositive, ambivalent,
hydrophobic and aromatic), thus there are 88
possible matches. In addition, the user specifies
that one of four match modes (rule sets) is to be
used - SIMPLE e.g. donors match with donors
donor-acceptors - IDENTITY e.g.donors match only with donors
- COMPLEMENTARY e.g. donors match with acceptors
donor-acceptors - FUZZY anything else the user might wish
- The four match modes in CLIP are all equally
fast. They have been implemented systematically
rather than specifically CLIP has not been
programmed with rules relating DONORs, ACCEPTORs
etc. but can only apply the following predicate - if (entity1) is compatible with (entity2) then
- return (result)
- CLIP has three levels of configuration
hard-coded defaults, configuration files and
command-line parameters for running in script
mode. For large datasets, writing to disk takes
place at user-specified intervals (CHUNKSIZE).
Results are summarised in the main output file
and details of the cliques (the bulk of the
output) are written to a separate file, and
optionally gzip-compressed using the zlib library
routines.
- References
- Brint, A.T. Willett, P. "Algorithms for the
identification of three-dimensional maximal
common substructures." JCICS 27, 1987, 152-158 - Bron, C. Kerbosch, J. Finding all cliques of
an undirected graph. Communications of the ACM
16, 1973, 575-577 - Gardiner, E.J., Artymiuk, P.J. Willett, P.
Clique-detection algorithms for matching
three-dimensional molecular structures. JMGM 15,
1998, 245-253 - Jones, G., Willett, P., Glen, R.C., Leach, A.R.
Taylor, R. "Development and validation of a
genetic algorithm for flexible docking., JMB
267, 1997, 727-748 - Pepperell, C.A., Poirrette, A.R., Willett, P.
Taylor, R.., Development of an atom-mapping
procedure for similarity searching in databases
of three-dimensional chemical structures,
Pestic. Sci. 33, 1991, 97-111
1 Department of Information Studies, University
of Sheffield, Western Bank, Sheffield, S10 2TN. 2
Pfizer Global Research and Development, Ann
Arbor, MI 48105, USA. Acknowledgements This work
was funded by Parke-Davis and Pfizer.
Computational facilities were provided by the
BBSRC.
Figure1 Template created from 1hvi showing the 9
nodes of the inhibitor A77003 and their
interaction nodes in HIV protease.