Title: Molecular Diversity - A Shell Game?
1Molecular Diversity - A Shell Game? Experiments
in Measuring Molecular Diversity C. John
BlankleyDaylight User Group Meeting MUG
97 Laguna Beach, CAFebruary 25-28, 1997
Parke-Davis Pharmaceutical Research Division of
Warner Lambert Company2800 Plymouth RoadAnn
Arbor, MI 48105
2Basic Concepts
- What do we mean by molecular diversity?
- Structural Diversity Property Diversity
- templates / scaffolds / backbones lipophilicit
y - functional groups / fragments acid/base
- bridges / bioisosteres H-bonding
- aromatic / aliphatic dipolarity
- geometry (shape, chirality, charge
- connectivity, spatial disposition) size
3Basic Concepts (cont.)
- Parameters/metrics/descriptors
- continuous, discrete, categorical
- Structural Descriptors Property Descriptors
- topological indices log P
- molecular fingerprints pKa
- atom / group / fragment counts molecular
orbital indices - molecular dimensions charge (volume,
area, moments) spectroscopic data - distances between key groups molecular
fields (atom pairs, pharmacophores) - composite descriptors (principal properties)
- similarity/dissimilarity metrics
4Issues in diversity
- basis for comparison
- perspective
- macro or micro
- expansionary or inclusionary expanding space,
filling holes, increasing density - biologically relevant vs. chemical - correlative
- how much diversity is necessary, possible,
desirable (random, bias) - concordance between quantitative and qualitative
notions - tailor to information available and purpose
required
5Types of structural diversity
- Global or macro diversityimplies neither
consistent or significant similar features - Local or micro diversity ??varietyimplies a
consistent common feature(s) - template or scaffold
- common functional group
6Small (lt ca. 500) datasets of practical interest
- Building block datasets (functional group based)
- Combinatorial arrays (template based)
- FAQs
- select a diverse subset(s) for screening
- is one dataset more diverse than another
- what subset will represent the diversity of a
library - what compounds will increase/extend the diversity
of an existing set
7Other small datasets
- SAR datasets
- defined by potency/selectivity
- defined by activity for a enzyme/receptor family
or subtype - Benchmark datasets
- miscellaneous drugs
- miscellaneous chemical compounds
- 20 natural amino acids
- 400 natural dipeptides
8Questions of a structural diversity measure
- How does it perform for different types of
diversity? suitability sensitivity - How does it accord with measures of property
diversity? - Is there a saturationeffect?
- How does it behave on partition or combination of
datasets? - How can it be validated?
- How does it accord with chemists visual
perceptions? - Can it capture unperceived aspects of diversity?
9Questions for a given dataset
- Which class?
- Extent of common feature in dataset
- Variation around common feature
- template variation
- appendage variation
- Outliers and their influence
10Possibilities for quantitation
- Statistics on bit counts
- univariate measures
- comparisons to mean e.g., modal
- Statistics on fragment counts
- Statistics on dissimilarities
- all or partial pairwise
- comparison to mean e.g., modal, centroid
- Parametric on topological indices
11Some proposed database diversity measures
- mean pairwise dissimilarity (Willett et al.)
- self-similarity (Tripos)(mean similarity to
1st nearest neighbor) - maximum bits set (E. Martin et al.)(??union
bit set or modal fingerprint _at_ t 0) - diversity density (E. Martin et al.)( bits
per molecular mass)
12Stigmata
- extracts similarity of bit strings with flexible
threshold - low bits at high stringency - not much in common
- high bits at low stringency - much variety
- plateau bits at intermediate stringencies -
significant common element - similarity to common element - large range
signals diversity of dataset
13Diversity measures and modal fingerprints
- modal fingerprint ? degree of similarity
- existing metrics in Stigmata (Daylight
fingerprints) - modp, msimalabRminfp, Rmaxfp
- new metrics
- alab_av (regional or partial similarity(?))Rfp
Rmaxfp / Rminfp
14Extend threshold analysis
- t ? 0 (2 compounds)
- maximal modal (? non-unique bits) ( total
bits set) - concept of maximal, median and minimal modal
15Relative vs. absolute
- Rminfp (Rifp(Rm)) f(modp, msim)
- 1/Rm modp modp/msim - 1
- bits set for modal fp
- bcom bits (i) x Rm(i)
- Thus
- maxbcom modal bits _at_ t ? 0
- medbcom modal bits _at_ t 0.5
- minbcom modal bits _at_ t 1.0
- Rtmax maxbcom / minbcom
16Other measures derived from bits or similarities
- Average bits
- Mean similarities (msim) at t 0, 0.5, 1.0
- fraction of pairwise similarities gt 0.85 or lt
0.50 - mean distance from dataset centroid
- standard deviations or coefficients of variation
- normalize by
- dataset size
- average bits
- molecular mass
17 dataset type1 type2 N av_mwt source BIOLOGICAL
kappa act p 59 409.46 CIPSLINEnonkappa act n 9
7 441.95 CIPSLINE pipopiate act p 48 382.84
opiate act p 32 383.37 piperidine act SS 18 37
4.76 peptide_op act p 24 612.41 D2_ag act p 33
249.94 Seeman et al.D2_antag act n 25 373.97 See
man et al.renin_hisleu act SS 112 752.66 CIPSLINE
ci976 sar p 74 410.47 Roth et al.
acathet sar p 41 453.82 White et al. BUILDING
BLOCK bbd_phnco bbd T 129 185.51 ACD bbd_ar
ncs bbd SS 202 216.40 ACD aa20 bbd T 20 136.92
bbd_allaa bbd T 651 218.63 ACD COMBINATORIAL
dipeptides comb T 400 255.83 dhydantoin comb T
40 242.47 deWitt et al. benzodiaz1 comb SS 40 312.
37 deWitt et al. benzodiaz2 comb SS 160 382.43 El
lman et al. REFERENCE newtopdrugs_95 misc n 56 39
5.35 Med. Ad. Newsintrodrugs_9295 misc n 141 403.
51 Ann. Repts. Med.Chem. ACDrandom2 misc n 51 181.
29 ACD ACDrandom3 misc n 51 221.37 ACD
18 modal bit compound bit dataset max
median min Rtmax min max mean
Rfp BIOLOGICAL kappa 1052 169 43 24.47 163
734 249.30 4.50nonkappa 1467 164 7 209.33 88
689 305.42 7.83 pipopiate 1219 276 47 25.94 1
38 689 378.63 4.99 opiate 451 404 88 11.69 1
64 689 439.71 4.20 piperidine 570 169 75 7.6
0 138 368 249.61 2.67 peptide_op 425 170 40 1
0.63 88 329 206.96 3.74D2_ag 606 145 43 14.09
80 390 203.82 4.87D2_antag 841 115 33 25.48 16
5 327 235.01 1.98renin_hisleu 998 278 179 5.58
249 466 337.04 1.87ci976 371 138 68 5.46 117
222 152.50 1.90acathet 695 195 93 7.47 167 303
245.80 1.81 BUILDING BLOCK bbd_phnco 43
3 63 61 7.10 61 136 92.48 2.23bbd_arncs 883 64
45 19.64 61 286 107.37 4.69aa20 155 43 34 4
.56 34 171 67.00 5.03bbd_allaa 1785 55 34 52.5
0 34 429 127.10 12.62 COMBINATORIAL dipe
ptides 448 78 53 8.45 53 269 126.53 5.08dhydan
toin 437 129 71 6.15 71 323 176.70 4.55benzodi
az1 586 220 183 3.20 208 410 284.06 1.97benzod
iaz2 533 294 222 2.40 229 431 308.92 1.88 REFER
ENCE newtopdrugs_95 1629 80 4 407.75 52
640 254.85 12.31introdrugs_9295 1983 84 4 495.
75 44 709 265.80 16.11ACDrandom2 592 19 0 gt100
0 23 166 73.75 7.22ACDrandom3 1067 38 0 gt1000
23 497 130.08 21.61
19 average msim centroid
mean dataset t 0 t 0.5 t
1 distance BIOLOGICAL kappa 0.23 0.56
0.18 0.34 nonkappa 0.21 0.42 0.03 0.48
pipopiate 0.31 0.50 0.15 0.40 opiate 0.
42 0.62 0.24 0.26 piperidine 0.39 0.54
0.32 0.36 peptide_op 0.45 0.68 0.21 0.23
D2_ag 0.32 0.55 0.25 0.35 D2_antag 0.26
0.34 0.15 0.53 renin_hisleu 0.33 0.75 0.
54 0.12 ci976 0.40 0.75 0.45 0.14 acat
het 0.34 0.64 0.38 0.25 BUILDING
BLOCK bbd_phNCO 0.21 0.68 0.67 0.20 b
bd_arNCS 0.12 0.61 0.45 0.28 aa20 0.35 0
.70 0.60 0.25 bbd_allaa 0.07 0.41 0.35 0
.49 COMBINATORIAL dipeptides 0.28 0.62 0
.48 0.27 dhydantoin 0.40 0.58 0.47 0.28
benzodiaz1 0.48 0.76 0.66 0.12 benzodiaz2
0.58 0.81 0.73 0.07 REFERENCE newtopdr
ugs_95 0.15 0.23 0.02 0.68 introdrugs_9295
0.22 0.24 0.04 0.68 ACDrandom2 0.11 0.19
0.00 0.74 ACDrandom3 0.11 0.20 0.00 0.7
5
20 dataset mps ss_nn1 mdd mfrgs_BCI BIOLOGICAL
kappa 0.45 0.83 0.62 117.69 nonkappa 0.33
0.77 0.74 110.47 pipopiate 0.40 0.82 0.9
8 122.52 opiate 0.53 0.86 1.13 133.13
piperidine 0.45 0.72 0.66 229.94 peptide_o
p 0.57 0.79 0.35 277.92 D2_ag 0.46 0.88 0
.80 78.48 D2_antag 0.32 0.71 0.64 195.64
renin_hisleu 0.67 0.90 0.45 112.26 ci976 0
.67 0.94 0.37 48.01 acathet 0.55 0.89 0.54
120.10 BUILDING BLOCK bbd_phnco 0.58 0.
88 0.51 23.87 bbd_arncs 0.49 0.89 0.50 32.
87 aa20 0.57 0.81 0.48 78.05 bbd_allaa 0
.31 0.87 0.57 21.79 COMBINATORIAL dipept
ides 0.51 0.96 0.49 7.77 dhydantoin 0.52 0
.92 0.72 56.83 benzodiaz1 0.68 0.94 0.92 5
9.68 benzodiaz2 0.75 0.98 0.81 16.64 REFERE
NCE newtopdrugs_95 0.19 0.48 0.72 238.7
3 introdrugs_9295 0.19 0.50 0.70 142.89 AC
Drandom2 0.16 0.49 0.44 115.45 ACDrandom3 0
.15 0.38 0.58 156.53
21Principal Components (n 23, k 18)
EigenValue 9.19 4.35 1.85 1.15 0.44 Percent 5
1.04 24.14 10.27 6.40 2.44 CumPercent 51.04 75.1
8 85.45 91.86 94.30
22Rotated Factor Pattern (dissimilarities and ln)
N -0.04 -0.12 -0.95 -0.06 av_mwt 0.16 0.12 0.17
0.94 Rfp -0.84 0.06 -0.21 -0.14 ln_Rtmax -0.98
-0.01 -0.01 -0.01 mdsim0 -0.77
-0.27 -0.38 -0.10 mdsim5 -0.97
0.04 0.02 -0.08 mdsim1 -0.88 0.15 0.23 0.19 ln_
mxbcom -0.65 0.34 -0.42 0.45 ln_mdbcom 0.56 0.63
0.14 0.50 ln_mnbcom 0.93 0.18 -0.12 0.17 min
bits 0.60 0.40 0.15 0.54 maxbits -0.39 0.77
-0.06 0.39 avbits 0.14 0.81 0.18 0.51 mpds -0.9
8 0.02 -0.02 -0.10 msds_nn1 -0.91
-0.08 0.30 -0.02 mcentr_dist -0.98
0.00 0.04 -0.11 mdd 0.03 0.97
0.10 -0.18 ln_mBCI -0.45 0.12 0.72 0.41
23Datasets Plotted vs. First Two Rotated Factors
24Datasets plotted vs. mean modal dissimilarity
(t0.5) and average bits
25msds_nn1
Relative diversity by dataset type (groups along
x axis)
ACDrandom3
newtopdrugs
ACDrandom2
introdrugs9295
y
D2antag
piperidine
nonkappa
x
z
peptide_op
aa20
pipopiate
kappa
opiate
bbd_allaa
D2ag
bbd_phnco
acathet
bbd_arncs
renin_hisleu
dhydantoin
ci976
benzodiaz1
dipeptides
benzodiaz2
26mean centr_dist
Relative diversity by dataset type (groups along
x axis)
ACDrandom3
ACDrandom2
y
newtopdrugs
introdrugs9295
D2antag
bbd_allaa
nonkappa
pipopiate
x
z
piperidine
D2ag
kappa
bbd_arncs
dhydantoin
dipeptides
opiate
acathet
aa20
peptide_op
bbd_phnco
ci976
benzodiaz1
renin_hisleu
benzodiaz2
27ln_mxbcom
Relative diversity by dataset type (groups along
x axis)
introdrugs9295
y
bbd_allaa
newtopdrugs
nonkappa
pipopiate
ACDrandom3
kappa
renin_hisleu
bbd_arncs
D2antag
x
z
acathet
D2ag
ACDrandom2
benzodiaz1
piperidine
benzodiaz2
opiate
dipeptides
dhydantoin
bbd_phnco
peptide_op
ci976
aa20
28Directions
- behavior of metrics e.g.
- on combining or subsetting datasets
- other similarity functions - same or
different? - calibration of metrics with chemists perception
- other fingerprints
- BCI, MACCS,Tripos
- 3D information
- use of modal similarities as dataset parameters
for correlation or classification - how to discover the congruence between molecular
similarity and biological function
29Conclusions to date
- Different metrics can capture different aspects
of structural diversity - One metric will not suffice to provide adequate
discrimination for all different types of
diversity - Modal fingerprints and similarities may prove to
be useful additions to the measurement of
diversity
30Acknowledgments
- Parke-Davis
- Biomolecular Structure and Drug Design
- Christine Humblet
- Daylight CIS. Inc
- Norah Shemetulskis
- David Weininger
- Jeremy Yang
31Correlations among diversity measures (n 23)
32Correlations (cont.)
33Datasets plotted vs. mean pairwise dissimilarity
and average bits
34Datasets plotted vs. mean centroid distance and
average bits
35?