Title: Six similarity coefficients used in this study
1Virtual Screening of Natural Product Databases by
Similarity Searching and Data Fusion
Martin Whittle1, Peter Willett1, Werner Klaffke2
and Paula van Noort2.
Introduction The development of High Throughput
Screening (HTS) techniques has galvanized the
search for biologically active compounds. But
the success of these methods can be significantly
enhanced if the pool of test compounds has
already been enriched by other means. Similarity
searching is one method of creating such a pool
by the virtual screening of a database to
pre-select those molecules having the strongest
similarity to a chosen target structure 1.
Typically, the target will be a known bioactive
molecule and the method rests on the assumption
that the desired activity is most likely to be
exhibited by compounds with a similar structure
to the target. Natural products comprise a rich
source of novel bioactive compounds that include
molecules of high complexity. Additionally,
since not all of them can be readily synthesized
they represent a unique pool of compounds
distinct from collections of artificial materials
2. The results presented here have been
obtained for the Chapman Hall Dictionary of
Natural Products (DNP), which provides a suitably
large and comprehensive collection of compounds.
Six similarity coefficients used in this
study
Figure 2
Coefficient
Expression
Figure 3
Recall analysis By choosing a group of known
actives from the database we can examine the
effectiveness of retrieval by computing the
recall R 3 . Suppose that there are N
compounds in a database and a target set contains
A known actives. If a similarity search using
one of these actives as target retrieves a
active-group members amongst the top n nearest
(most similar) neighbours then the fraction of
the original active set that is retrieved
(excluding the target molecule) is the
recall  Figure 1 illustrates calculation of
the recall for a search using the first member in
the group of 7 actives as the target structure
where a total of n compounds are recovered of
which 5 are members of the original active group.
Conclusions  A full analysis of our results
using five different active sets showed that data
fusion can yield a worthwhile increase in the
retrieval of actives providing the correct
individual sets are chosen for combination.
Optimal improvement is achieved by using three
or, at most, four individual searches. The most
consistent enhancements were obtained by using
molecular holograms and Unity substructural
fingerprints to represent the structures combined
with similarity searches using the Squared
Euclidean similarity coefficient. Our results
suggest that this choice should give the best
chance of enrichment with an unknown set. The
popular Tanimoto and Cosine coefficients rarely
appeared among the best fusion combinations.
Unusually, the Tanimoto coefficient does not seem
to be particularly suited to these datasets.
In this case, the BCI substructural fingerprint
representation teamed with the Squared Euclidean
similarity coefficient has been the most
successful in recovering the original actives in
a search recovering 1000 compounds in total. The
horizontal red line shows the expected recall for
random retrieval.
Similarity Searching To enable the computational
comparison of molecules the structures must first
be translated into a condensed machine-readable
representation. This usually consists of an
ordered list of numbers each coding some aspect
of the structure. Six have been used in this
study, as detailed in the table below.
Data Fusion  Data fusion is a method for
combining results from several different sensors
or inputs 4 with the goal of synergistically
increasing the value of the available data over
that of the uncombined techniques. For example,
data from our own senses are frequently combined
as in the use of eyes and ears in crossing the
road. In our own case we have several methods of
assessing similarity through a range of
representations and similarity coefficients.
Ranked lists from these searches are combined by
using a sum rule previously shown to be
effective. If ri denotes the rank of a
particular molecule obtained by the i-th method,
we generate the sum rs, over all m methods for
that molecule  and then rank the combined
scores. These combined scores are again used to
generate recall values to monitor the retrieval
efficiency.
Six representations used in this
study
a 4 near neighbours out of a possible 6.
References 1 Willett, P. Barnard, J. M.
Downs, G. M. J. Chem. Inf. Comput. Sci. 1998,
38, 983996. 2 Henkel, T. Brunne, R. M.
Müller, H Reichel, F Angew. Chem. Int. Ed.
1999, 38, 643- 647. 3 Edgar, S. J. Holliday,
J. D. Willett, P. J. Mol. Graphics Mod. 2000,
18, 343-357. 4 Ginn, C. M. R. Willett, P.
Bradshaw, J. Persp. Drug Discov. Design, 2000,
20, 1-16.
The recall is
These lists of numbers are then compared by using
one of several similarity coefficients designed
to provide a quantitative measure of the
structural comparison between a pair of
representations. Specifically, we define the
following quantities for two number lists A, B of
length n where xjA and xjB are the jth
elements of the representation. In terms of
these, the similarity coefficients used are given
in the following table.
Results for the 6 top scoring results of Figure 2
are shown in Figure 3. This shows the fused
recall at rank 1000 for all possible combinations
of the individual results plotted against the
order of fusion i.e. the number of combined sets.
Here a combination of 3 individual results
gives the best recovery. This particular
combination involved the BCI, UNITY and Molecular
Hologram representations, using the squared
Euclidean distance as the similarity coefficient.
Figure 1
1. The Krebs Institute for Biomolecular Research,
Department of Information Studies, University of
Sheffield, Western Bank, Sheffield S10 2TN, UK 2.
Unilever Research Vlaardingen, Olivier van
Noortlaan 120, 3133 AT Vlaardingen, The
Netherlands
It is normal to average over all members of the
active group. Typical results for a single
representation and similarity coefficient are
shown in Figure 2 for a group of 50 target
compounds chosen from the DNP with a spread of
pharmacological activities.