Outline - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Outline

Description:

Synergies: Clever Algorithms and Domain Knowledge. Mining Molecular Fragments ... Based on Market Basket Analysis (Eclat Algorithm) ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 49

Provided by: michaelb

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Motivation Imprecise Data in BioInformatics
Drug Discovery and High Throughput Screening
Finding Clusters in Chemistry Space
Synergies Clever Algorithms and Domain Knowledge
Mining Molecular Fragments
What the Chemist really wants Imprecision(Fuzzy
Atoms and flexible Chains)
Some Experimental Results (NCI HIV screens)
ConclusionsOutlook Learning from Past
Experience
BitsPieces of Evidence
Storing and Retrieving Knowledge

2
Drug Discovery

Classic
Expert Knowledge available
Metabolic pathway information
Binding site information
After Specific Target is identified
Generate Assay to identify desirable effect
Assemble Test (focused) library of compounds
First Phase High Throughput Screening
(HTS)Often hundreds of thousands of molecules
tested in highly automated fashion
After clever data analysis
Second Phase Test a few hundred compounds more
carefully (IC50)

3
Drug Discovery

And then (in the remaining 8-9 years)
Animal Testing
Several rounds of clinical testing
Approval procedures
And most often late stage failure
Go back to start, do not collect 1,000,000,000
Lead Rescue eliminate side effects (ADME/Tox,
cardiac effects, sometimes also avoid patents)?
avoid bad areas in drug space (lead hopping)

4
High Throughput Screening

Rapidly screen 100-thousands of candidates.
Problems
Often thousands of actives
Data extremely noisy(up to 50 false positives,
unknown false negatives!)
Positives almost always active for different
reasons ? Separate, diverse clusters!
? GoalFind common properties among similar
subsets of active molecules(help user understand
activity patterns!)

5
What is Similarity?
Tropacocaine
1518-12246 Local Anesthetic
6
Types of Similarity

Structural similarity
Same basic layout of overall graph
or at least existence of a common subgraph
Geometrical similarity
Roughly same shape in 3D, independent of exact
atom matches
Instead of simple shape, also other properties
(surface charge) can be compared
Global properties
Molecular weight
Number of hitrogen donors/acceptors
And many others

7
Motivation

GoalFind (and describe!) structural groups of
molecules that share activity.
For few molecules, manual inspection is feasible.

8
Motivation

GoalFind (and describe!) structural groups of
molecules that share activity.
For few molecules, manual inspection is feasible.
For more molecules, automated methods are needed

9
Motivation
10
Motivation
11
Motivation
12
Molecular Fragment Miner (MoFa) Ch.
Borgelt, M. R. Berthold, IEEE Data Mining, 2002.

Goal
Find Fragments that are discriminative for a
class of interest (high activity, good synthesis
result, )
Appear often in Positives freq(high
activity)gtthreshold
Appear rarely in Negatives freq(low
activity)ltdelta
MoFa
Based on Market Basket Analysis (Eclat Algorithm)
Grow Fragment-Candidates from scratch
atom-by-atom
Only report significant and unique fragments

13
Example

6 Example Molecules
Find all unique fragments that occur in ? 4
Molecules

O
O
O

_
_
_
_
_
_

C
N
S
C
N
S
C
C
N
S
C

_

_
O
C
C
N
N
N
_

_
_
_
_
_
_

C
S
C
O
N
S
C
N
S
C
14
Examples (a) (b)
(c) (d) (e)
(f)
15
C
N
O
S
Examples (a) (b)
(c) (d) (e)
(f)
16
6
4
6
6
C
N
O
S
Examples (a) (b)
(c) (d) (e)
(f)
17
6
4
6
6
C
N
O
S
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
18
6
4
6
6
C
N
O
S
3
4
6
4
4
4
6
4
4
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
19
6
4
6
6
C
N
O
S
3
4
6
4
4
4
6
4
4
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
20
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
21
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
22
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
23
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
24
Duplicate Fragments!

How do Apriori, Eclat Co avoid Duplicate
Itemsets?? Prefix Tree

a,b,c
BUT Prefix Tree requires a global order defined
on items
25
Local Order on Atoms/Bonds

Global order on atoms/bonds is not possible
Use local order on atoms C lt N lt O lt S
In case of same atom type, use secondary order
based on bond single (-) lt aromatic lt double
() lt triple
Higher (or equal) extensions are only allowed on
last atom extended, and
All extensions are allowed on atoms inserted
after last atom extended.

26
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
27
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N N
S-N O
Examples (a) (b)
(c) (d) (e)
(f)
28
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
2
2
2
4
4
4
3
S-C-C
S-C N
S-C N
S-C O
S-N N
S-N O
SN O
Examples (a) (b)
(c) (d) (e)
(f)
29
Support Based Pruning

Support of fragment A supp(A) Frequency of
appearance in molecules
Monotone conditions decline with size of
fragment fragment A is contained in fragment
B ? supp(A) ? supp(B)
If supp(node) in branch is below thresholdthen
all child-nodes will also be below threshold.

30
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
2
2
4
4
4
3
2
S-C-C
S-C N
S-C N
S-C O
S-N N
S-N O
SN O
Examples (a) (b)
(c) (d) (e)
(f)
31
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
4
4
4
S-C N
S-C N
S-C O
Examples (a) (b)
(c) (d) (e)
(f)
32
Resulting Fragments for supp(A)?4
4
4
6
4
S-C O
S-C N
C-S
C-S-N
Examples (a) (b)
(c) (d) (e)
(f)
33
Some fragments which are not reported (due to
redundant support)
4
6
6
S N
C
S
Resulting Fragments for supp(A)?4
4
4
6
4
S-C O
S-C N
C-S
C-S-N
Examples (a) (b)
(c) (d) (e)
(f)
34
Discriminative Fragments

Just finding frequent fragments usually not
interesting
Find fragments that are
frequent in one class of molecules
and infrequent in the remainder of molecules
Discriminative Fragments summarize shared
properties.
Number of actives and inactives (and the ratio)
that contain fragment indicates relevance.

35
ExampleNCI HIV dataset 45000 (400 active)
compounds, threshold15
..
15.08 vs. 0.02
36
A few more fragments
5.23 vs. 0.05
5.23 vs. 0.08
4.92 vs. 0.07
9.85 vs. 0.07
9.85 vs. 0.0
10.15 vs. 0.04
37
Problems

However, some fragments puzzled our chemists

38
Small Differences
39
Chemists view

Strict graph-based view of molecules is too
restrictive
Some tolerances do not affect function, e.g.
In a specific context, some atoms may be of
different type(e.g. N/C equivalence in aromatic
rings, all halogens are equivalent, )
The exact length of a chain connecting two rigid
substructures does not matter(e.g. chains of CH2
can be 2-4 carbons long, )

40
Fuzzy MatchesH. Hofer, Ch. Borgelt, M. R.
Berthold, IDA, Berlin, 2003

Specifying wildcards via equivalence classes,
here
Meta Atoms Certain atoms can be matched
Maximum number of fuzzy-atoms allowed
Equivalence classes can overlap (e.g. O,C and
C,N)
Fuzzy Chains Model flexible chains explicitly
Specify min/max length of chains

41
Fuzzy Atoms and Chains
42
MoFa - Summary

Search based on parallel embeddingsand large
scale data mining algorithm (Apriori/Eclat)
Computationally very efficient
Discovered knowledge is immediately meaningful
Fragments understandable to chemist
Better than rules/decision trees on mystic
attributes
Really useful after incorporating Expert
Feedback
Markush structures allow for wildcards in
fragments(fuzzy atoms and chains of flexible
length)
Applied successfully to HTS data analysis,
chemical synthesis success prediction.

43
Conclusions

Data Analysis in Life Sciences is inherently
multi-disciplinary
Imprecise
Interactive
context-dependent notions of similarity
Focus is not exclusively on building good
predictors
Instead the user wants understandable pieces of
knowledge (Information Mining).
Value of knowledge depends on archival
StoreRetrieve past experience
and on usability

44
Knowledge Recycling
HIV activityrelatedfragments
New StructureGood Candidate?
45
Knowledge Recycling
SynthesisSuccessfragments
HIV activityrelatedfragments
Metabolic PathwayInformation
1000 moleculeion channel side effects(hERG
assay)
5000 moleculeRat LiverToxicity tests
New StructureGood Candidate?
Rat Cancer CoMFA model
Cluster model(3D similarity)kidneyside effects
Gene Expressiondata for other diseased cells
CompetitorsPatent Space
46
Knowledge Recycling