Outline - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Outline

Description:

Synergies: Clever Algorithms and Domain Knowledge. Mining Molecular Fragments ... Based on Market Basket Analysis (Eclat Algorithm) ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 49
Provided by: michaelb
Category:
Tags: eclat | outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Motivation Imprecise Data in BioInformatics
  • Drug Discovery and High Throughput Screening
  • Finding Clusters in Chemistry Space
  • Synergies Clever Algorithms and Domain Knowledge
  • Mining Molecular Fragments
  • What the Chemist really wants Imprecision(Fuzzy
    Atoms and flexible Chains)
  • Some Experimental Results (NCI HIV screens)
  • ConclusionsOutlook Learning from Past
    Experience
  • BitsPieces of Evidence
  • Storing and Retrieving Knowledge

2
Drug Discovery
  • Classic
  • Expert Knowledge available
  • Metabolic pathway information
  • Binding site information
  • After Specific Target is identified
  • Generate Assay to identify desirable effect
  • Assemble Test (focused) library of compounds
  • First Phase High Throughput Screening
    (HTS)Often hundreds of thousands of molecules
    tested in highly automated fashion
  • After clever data analysis
  • Second Phase Test a few hundred compounds more
    carefully (IC50)

3
Drug Discovery
  • And then (in the remaining 8-9 years)
  • Animal Testing
  • Several rounds of clinical testing
  • Approval procedures
  • And most often late stage failure
  • Go back to start, do not collect 1,000,000,000
  • Lead Rescue eliminate side effects (ADME/Tox,
    cardiac effects, sometimes also avoid patents)?
    avoid bad areas in drug space (lead hopping)

4
High Throughput Screening
  • Rapidly screen 100-thousands of candidates.
  • Problems
  • Often thousands of actives
  • Data extremely noisy(up to 50 false positives,
    unknown false negatives!)
  • Positives almost always active for different
    reasons ? Separate, diverse clusters!
  • ? GoalFind common properties among similar
    subsets of active molecules(help user understand
    activity patterns!)

5
What is Similarity?
Tropacocaine
1518-12246 Local Anesthetic
6
Types of Similarity
  • Structural similarity
  • Same basic layout of overall graph
  • or at least existence of a common subgraph
  • Geometrical similarity
  • Roughly same shape in 3D, independent of exact
    atom matches
  • Instead of simple shape, also other properties
    (surface charge) can be compared
  • Global properties
  • Molecular weight
  • Number of hitrogen donors/acceptors
  • And many others

7
Motivation
  • GoalFind (and describe!) structural groups of
    molecules that share activity.
  • For few molecules, manual inspection is feasible.

8
Motivation
  • GoalFind (and describe!) structural groups of
    molecules that share activity.
  • For few molecules, manual inspection is feasible.
  • For more molecules, automated methods are needed

9
Motivation
10
Motivation
11
Motivation
12
Molecular Fragment Miner (MoFa) Ch.
Borgelt, M. R. Berthold, IEEE Data Mining, 2002.
  • Goal
  • Find Fragments that are discriminative for a
    class of interest (high activity, good synthesis
    result, )
  • Appear often in Positives freq(high
    activity)gtthreshold
  • Appear rarely in Negatives freq(low
    activity)ltdelta
  • MoFa
  • Based on Market Basket Analysis (Eclat Algorithm)
  • Grow Fragment-Candidates from scratch
    atom-by-atom
  • Only report significant and unique fragments

13
Example
  • 6 Example Molecules
  • Find all unique fragments that occur in ? 4
    Molecules

O
O
O



_
_
_
_
_
_

C
N
S
C
N
S
C
C
N
S
C

_

_
O
C
C
N
N
N
_


_
_
_
_
_
_

C
S
C
O
N
S
C
N
S
C
14
Examples (a) (b)
(c) (d) (e)
(f)
15
C
N
O
S
Examples (a) (b)
(c) (d) (e)
(f)
16
6
4
6
6
C
N
O
S
Examples (a) (b)
(c) (d) (e)
(f)
17
6
4
6
6
C
N
O
S
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
18
6
4
6
6
C
N
O
S
3
4
6
4
4
4
6
4
4
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
19
6
4
6
6
C
N
O
S
3
4
6
4
4
4
6
4
4
C-C
C-S
NS
N-S
OS
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
20
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
Examples (a) (b)
(c) (d) (e)
(f)
21
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
22
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
23
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
24
Duplicate Fragments!
  • How do Apriori, Eclat Co avoid Duplicate
    Itemsets?? Prefix Tree


a,b,c
BUT Prefix Tree requires a global order defined
on items
25
Local Order on Atoms/Bonds
  • Global order on atoms/bonds is not possible
  • Use local order on atoms C lt N lt O lt S
  • In case of same atom type, use secondary order
    based on bond single (-) lt aromatic lt double
    () lt triple
  • Higher (or equal) extensions are only allowed on
    last atom extended, and
  • All extensions are allowed on atoms inserted
    after last atom extended.

26
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N C
S-N N
S-N O
SN C
SN N
SO C
SO N
SO N
Examples (a) (b)
(c) (d) (e)
(f)
27
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
S-C-C
S-C N
S-C N
S-C O
SN O
S-N N
S-N O
Examples (a) (b)
(c) (d) (e)
(f)
28
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
2
2
2
4
4
4
3
S-C-C
S-C N
S-C N
S-C O
S-N N
S-N O
SN O
Examples (a) (b)
(c) (d) (e)
(f)
29
Support Based Pruning
  • Support of fragment A supp(A) Frequency of
    appearance in molecules
  • Monotone conditions decline with size of
    fragment fragment A is contained in fragment
    B ? supp(A) ? supp(B)
  • If supp(node) in branch is below thresholdthen
    all child-nodes will also be below threshold.

30
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
2
2
4
4
4
3
2
S-C-C
S-C N
S-C N
S-C O
S-N N
S-N O
SN O
Examples (a) (b)
(c) (d) (e)
(f)
31
6
4
6
6
C
N
O
S
4
6
4
4
S-N
SO
S-C
SN
4
4
4
S-C N
S-C N
S-C O
Examples (a) (b)
(c) (d) (e)
(f)
32
Resulting Fragments for supp(A)?4
4
4
6
4
S-C O
S-C N
C-S
C-S-N
Examples (a) (b)
(c) (d) (e)
(f)
33
Some fragments which are not reported (due to
redundant support)
4
6
6
S N
C
S
Resulting Fragments for supp(A)?4
4
4
6
4
S-C O
S-C N
C-S
C-S-N
Examples (a) (b)
(c) (d) (e)
(f)
34
Discriminative Fragments
  • Just finding frequent fragments usually not
    interesting
  • Find fragments that are
  • frequent in one class of molecules
  • and infrequent in the remainder of molecules
  • Discriminative Fragments summarize shared
    properties.
  • Number of actives and inactives (and the ratio)
    that contain fragment indicates relevance.

35
ExampleNCI HIV dataset 45000 (400 active)
compounds, threshold15
..
15.08 vs. 0.02
36
A few more fragments
5.23 vs. 0.05
5.23 vs. 0.08
4.92 vs. 0.07
9.85 vs. 0.07
9.85 vs. 0.0
10.15 vs. 0.04
37
Problems
  • However, some fragments puzzled our chemists

38
Small Differences
39
Chemists view
  • Strict graph-based view of molecules is too
    restrictive
  • Some tolerances do not affect function, e.g.
  • In a specific context, some atoms may be of
    different type(e.g. N/C equivalence in aromatic
    rings, all halogens are equivalent, )
  • The exact length of a chain connecting two rigid
    substructures does not matter(e.g. chains of CH2
    can be 2-4 carbons long, )

40
Fuzzy MatchesH. Hofer, Ch. Borgelt, M. R.
Berthold, IDA, Berlin, 2003
  • Specifying wildcards via equivalence classes,
    here
  • Meta Atoms Certain atoms can be matched
  • Maximum number of fuzzy-atoms allowed
  • Equivalence classes can overlap (e.g. O,C and
    C,N)
  • Fuzzy Chains Model flexible chains explicitly
  • Specify min/max length of chains

41
Fuzzy Atoms and Chains
42
MoFa - Summary
  • Search based on parallel embeddingsand large
    scale data mining algorithm (Apriori/Eclat)
  • Computationally very efficient
  • Discovered knowledge is immediately meaningful
  • Fragments understandable to chemist
  • Better than rules/decision trees on mystic
    attributes
  • Really useful after incorporating Expert
    Feedback
  • Markush structures allow for wildcards in
    fragments(fuzzy atoms and chains of flexible
    length)
  • Applied successfully to HTS data analysis,
    chemical synthesis success prediction.

43
Conclusions
  • Data Analysis in Life Sciences is inherently
  • multi-disciplinary
  • Imprecise
  • Interactive
  • context-dependent notions of similarity
  • Focus is not exclusively on building good
    predictors
  • Instead the user wants understandable pieces of
    knowledge (Information Mining).
  • Value of knowledge depends on archival
  • StoreRetrieve past experience
  • and on usability

44
Knowledge Recycling
HIV activityrelatedfragments
New StructureGood Candidate?
45
Knowledge Recycling
SynthesisSuccessfragments
HIV activityrelatedfragments
Metabolic PathwayInformation
1000 moleculeion channel side effects(hERG
assay)
5000 moleculeRat LiverToxicity tests
New StructureGood Candidate?
Rat Cancer CoMFA model
Cluster model(3D similarity)kidneyside effects
Gene Expressiondata for other diseased cells
CompetitorsPatent Space
46
Knowledge Recycling
  • Hardly ever do we find precise fits
  • find similar structures
  • chemical similarity
  • activity related similarity
  • determine related context
  • cardiac effects vs. ion channel effects (hERG
    assay)
  • appear in same metabolic pathway
  • Related gene expression profiles
  • and finally draw (inherently imprecise!)
    inferences
  • Knowledge Archival, Management and Usability are
    crucial.

47
Thank you. Preprints/Remarks/further
Questionssend eMail toberthold_at_ieee.org
48
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com