Title: Outline
1Outline
- Motivation Imprecise Data in BioInformatics
- Drug Discovery and High Throughput Screening
- Finding Clusters in Chemistry Space
- Synergies Clever Algorithms and Domain Knowledge
- Mining Molecular Fragments
- What the Chemist really wants Imprecision(Fuzzy
Atoms and flexible Chains) - Some Experimental Results (NCI HIV screens)
- ConclusionsOutlook Learning from Past
Experience - BitsPieces of Evidence
- Storing and Retrieving Knowledge
2Drug Discovery
- Classic
- Expert Knowledge available
- Metabolic pathway information
- Binding site information
- After Specific Target is identified
- Generate Assay to identify desirable effect
- Assemble Test (focused) library of compounds
- First Phase High Throughput Screening
(HTS)Often hundreds of thousands of molecules
tested in highly automated fashion - After clever data analysis
- Second Phase Test a few hundred compounds more
carefully (IC50)
3Drug Discovery
- And then (in the remaining 8-9 years)
- Animal Testing
- Several rounds of clinical testing
- Approval procedures
- And most often late stage failure
- Go back to start, do not collect 1,000,000,000
- Lead Rescue eliminate side effects (ADME/Tox,
cardiac effects, sometimes also avoid patents)?
avoid bad areas in drug space (lead hopping)
4High Throughput Screening
- Rapidly screen 100-thousands of candidates.
- Problems
- Often thousands of actives
- Data extremely noisy(up to 50 false positives,
unknown false negatives!) - Positives almost always active for different
reasons ? Separate, diverse clusters! - ? GoalFind common properties among similar
subsets of active molecules(help user understand
activity patterns!)
5What is Similarity?
Tropacocaine
1518-12246 Local Anesthetic
6Types of Similarity
- Structural similarity
- Same basic layout of overall graph
- or at least existence of a common subgraph
- Geometrical similarity
- Roughly same shape in 3D, independent of exact
atom matches - Instead of simple shape, also other properties
(surface charge) can be compared - Global properties
- Molecular weight
- Number of hitrogen donors/acceptors
- And many others
7Motivation
- GoalFind (and describe!) structural groups of
molecules that share activity. - For few molecules, manual inspection is feasible.
8Motivation
- GoalFind (and describe!) structural groups of
molecules that share activity. - For few molecules, manual inspection is feasible.
- For more molecules, automated methods are needed
9Motivation
10Motivation
11Motivation
12Molecular Fragment Miner (MoFa) Ch.
Borgelt, M. R. Berthold, IEEE Data Mining, 2002.
- Goal
- Find Fragments that are discriminative for a
class of interest (high activity, good synthesis
result, ) - Appear often in Positives freq(high
activity)gtthreshold - Appear rarely in Negatives freq(low
activity)ltdelta - MoFa
- Based on Market Basket Analysis (Eclat Algorithm)
- Grow Fragment-Candidates from scratch
atom-by-atom - Only report significant and unique fragments
13Example
- 6 Example Molecules
- Find all unique fragments that occur in ? 4
Molecules
O
O
O
_
_
_
_
_
_
C
N
S
C
N
S
C
C
N
S
C
_
_
O
C
C
N
N
N
_
_
_
_
_
_
_
C
S
C
O
N
S
C
N
S
C
14Examples (a) (b)
(c) (d) (e)
(f)
15C
N
O
S
Examples (a) (b)
(c) (d) (e)
(f)
166
4
6
6
C
N
O
S
Examples (a) (b)
(c) (d) (e)
(f)
176
4
6
6
C
N
O
S
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
186
4
6
6
C
N
O
S
3
4
6
4
4
4
6
4
4
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
196
4
6
6
C
N
O
S
3
4
6
4
4
4
6
4
4
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
206
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
216
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
226
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
236
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
24Duplicate Fragments!
- How do Apriori, Eclat Co avoid Duplicate
Itemsets?? Prefix Tree
a,b,c
BUT Prefix Tree requires a global order defined
on items
25Local Order on Atoms/Bonds
- Global order on atoms/bonds is not possible
- Use local order on atoms C lt N lt O lt S
- In case of same atom type, use secondary order
based on bond single (-) lt aromatic lt double
() lt triple - Higher (or equal) extensions are only allowed on
last atom extended, and - All extensions are allowed on atoms inserted
after last atom extended.
266
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
276
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N N
S-N O
Examples (a) (b)
(c) (d) (e)
(f)
286
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
2
2
2
4
4
4
3
S-C-C
S-C N
S-C N
S-C O
S-N N
S-N O
SN O
Examples (a) (b)
(c) (d) (e)
(f)
29Support Based Pruning
- Support of fragment A supp(A) Frequency of
appearance in molecules - Monotone conditions decline with size of
fragment fragment A is contained in fragment
B ? supp(A) ? supp(B) - If supp(node) in branch is below thresholdthen
all child-nodes will also be below threshold.
306
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
2
2
4
4
4
3
2
S-C-C
S-C N
S-C N
S-C O
S-N N
S-N O
SN O
Examples (a) (b)
(c) (d) (e)
(f)
316
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
4
4
4
S-C N
S-C N
S-C O
Examples (a) (b)
(c) (d) (e)
(f)
32Resulting Fragments for supp(A)?4
4
4
6
4
S-C O
S-C N
C-S
C-S-N
Examples (a) (b)
(c) (d) (e)
(f)
33Some fragments which are not reported (due to
redundant support)
4
6
6
S N
C
S
Resulting Fragments for supp(A)?4
4
4
6
4
S-C O
S-C N
C-S
C-S-N
Examples (a) (b)
(c) (d) (e)
(f)
34Discriminative Fragments
- Just finding frequent fragments usually not
interesting - Find fragments that are
- frequent in one class of molecules
- and infrequent in the remainder of molecules
- Discriminative Fragments summarize shared
properties. - Number of actives and inactives (and the ratio)
that contain fragment indicates relevance.
35ExampleNCI HIV dataset 45000 (400 active)
compounds, threshold15
..
15.08 vs. 0.02
36A few more fragments
5.23 vs. 0.05
5.23 vs. 0.08
4.92 vs. 0.07
9.85 vs. 0.07
9.85 vs. 0.0
10.15 vs. 0.04
37Problems
- However, some fragments puzzled our chemists
38Small Differences
39Chemists view
- Strict graph-based view of molecules is too
restrictive - Some tolerances do not affect function, e.g.
- In a specific context, some atoms may be of
different type(e.g. N/C equivalence in aromatic
rings, all halogens are equivalent, ) - The exact length of a chain connecting two rigid
substructures does not matter(e.g. chains of CH2
can be 2-4 carbons long, )
40Fuzzy MatchesH. Hofer, Ch. Borgelt, M. R.
Berthold, IDA, Berlin, 2003
- Specifying wildcards via equivalence classes,
here - Meta Atoms Certain atoms can be matched
- Maximum number of fuzzy-atoms allowed
- Equivalence classes can overlap (e.g. O,C and
C,N) - Fuzzy Chains Model flexible chains explicitly
- Specify min/max length of chains
41Fuzzy Atoms and Chains
42MoFa - Summary
- Search based on parallel embeddingsand large
scale data mining algorithm (Apriori/Eclat) - Computationally very efficient
- Discovered knowledge is immediately meaningful
- Fragments understandable to chemist
- Better than rules/decision trees on mystic
attributes - Really useful after incorporating Expert
Feedback - Markush structures allow for wildcards in
fragments(fuzzy atoms and chains of flexible
length) - Applied successfully to HTS data analysis,
chemical synthesis success prediction.
43Conclusions
- Data Analysis in Life Sciences is inherently
- multi-disciplinary
- Imprecise
- Interactive
- context-dependent notions of similarity
- Focus is not exclusively on building good
predictors - Instead the user wants understandable pieces of
knowledge (Information Mining). - Value of knowledge depends on archival
- StoreRetrieve past experience
- and on usability
44Knowledge Recycling
HIV activityrelatedfragments
New StructureGood Candidate?
45Knowledge Recycling
SynthesisSuccessfragments
HIV activityrelatedfragments
Metabolic PathwayInformation
1000 moleculeion channel side effects(hERG
assay)
5000 moleculeRat LiverToxicity tests
New StructureGood Candidate?
Rat Cancer CoMFA model
Cluster model(3D similarity)kidneyside effects
Gene Expressiondata for other diseased cells
CompetitorsPatent Space
46Knowledge Recycling
- Hardly ever do we find precise fits
- find similar structures
- chemical similarity
- activity related similarity
-
- determine related context
- cardiac effects vs. ion channel effects (hERG
assay) - appear in same metabolic pathway
- Related gene expression profiles
-
- and finally draw (inherently imprecise!)
inferences - Knowledge Archival, Management and Usability are
crucial.
47Thank you. Preprints/Remarks/further
Questionssend eMail toberthold_at_ieee.org
48(No Transcript)