Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B

Description:

Many data mining methods characterize activity in ways that are meaningless to a ... How do we characterize the electronic 'face' that the molecule presents to ... – PowerPoint PPT presentation

Number of Views:511
Avg rating:3.0/5.0
Slides: 35
Provided by: Park159
Category:

less

Transcript and Presenter's Notes

Title: Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B


1
Problems and Opportunitiesfor Machine Learning
in Drug Discovery(Can you find lessons for
Systems Biology?)
  • George S. Cowan, Ph.D.
  • Computer Aided Drug Discovery
  • Pfizer Global Research and Development, Ann Arbor
    Labs

CSSB, Rovereto, Italy
19 April 2004
2
Working as a Computer Scientist in a Life
Sciences field requires an array of supporting
scientists Thanks to
Project ColleaguesDavid Wild Kjell Johnson
Cheminformatics Mentors and ColleaguesJohn
Blankley Alain Calvet David Moreland
Risk TakersEric Gifford Mark Snow Christine
Humblet Mike Rafferty
Academic Peter Willett Robert Pearlman
3
Drug Discovery and Development
  • Discern unmet medical need
  • Discover mechanism of action of disease
  • Identify target protein
  • Screen known compounds against target
  • Synthesize promising leads
  • Find 1-2 potential drugs
  • Toxicity, ADME
  • Clinical Trials

Biology
Chemistry
Pharmacology
4
(No Transcript)
5
Lock and Key Model
6
Virtual HTS Screening
  • Virtual Screening Definition
  • estimate some biological behavior of new
    compounds
  • identify characteristics of compounds related to
    that biological behavior
  • only use some computer representation of the
    compounds
  • HTS Virtual Screening is Not QSAR/QSPR
  • Based on large amounts of easy to measure
    observations
  • Uses early stage data from multiple chemical
    series (no X-ray Crystallography)
  • Observations are not refined (Percent
    Inhibition at a single concentration)
  • Looking for research direction, not best activity

7
Promise of Data Mining
  • Data Mining
  • Works with large sets of data
  • Efficient Processing
  • Finds non-intuitive information
  • Methods do not depend on the Domain (Marketing,
    Fraud detection, Chemistry, )
  • Alternative Data Mining Approaches
  • Regression - Linear or Non-Linear - PLS
  • Principal Components
  • Association Rules
  • Clustering Approach - Unsupervised - Concept
    Formation
  • Classification Approach - Supervised

8
Virtual Screening Challenges to Machine Learning
Overview (1)
  • No single computer representation captures all
    the important information about a molecule
  • The candidate features for representing molecules
    are highly correlated
  • Features are entangled
  • Multiple binding modes use different combinations
    of features
  • Multiple chemical series / scaffolds use the same
    binding mode
  • Evidence that some ligands take on multiple
    conformations when binding to a target
  • Any 4 out of 5 important features may be
    sufficient

9
More Challenges to Machine Learning
Overview (2)
  • Training data and validation data are not
    representative
  • Measurements of activity are inherently noisy
  • Activity is a rare event target populations are
    unbalanced
  • Classification requires choosing cutoffs for
    activity
  • There is no good measure for a successful
    prediction
  • Many data mining methods characterize activity in
    ways that are meaningless to a chemist
  • Data mining results must be reversible to assist
    a chemist in inventing new molecules that will be
    active (inverse QSAR)

10
Deep Challenges to Machine Learning
Overview (3)
  • No free lunch theorem
  • Science is different from marketing

11
No Single Computer Representation captures all
the important information
  • How do we characterize the electronic face that
    the molecule presents to the protein?
  • Grid of surface or surrounding points with field
    calculations
  • Conformational flexibility
  • 3-D relationships of pharmacophores
  • Complementary volumes and surfaces
  • Complementary charges
  • Complementary hydrogen bonding atoms
  • Similar Hydrophbicity/Hydrophilicity
  • Connectivity Bonding between Atoms (2-D)
  • pharmacophore info is implicitly present to some
    extent
  • not biased toward any particular conformation
  • Presence of molecular fragments (fingerprints)
  • Other Linear (SLN, SMILES)? Free-tree?

12
Pharmacophores
13
Representation of Chemical Structures (2D)
  • Aspirin

14
BCI Chemical Descriptors
  • Descriptors are binary and represent

15
We dont have the right descriptors, but we have
thousands that are easy to compute
  • Thousands of molecular fragments
  • Hundreds of calculated quasi-physical properties
  • Hundreds of structural connectivity indicators
  • Much of this information is redundant

16
Feature Interaction andMultiple Configurations
for ActivityRequire Disjunctive Models
  • Multiple binding modes where different
    combinations of features contribute to the
    activity(including non-competitive ligands)
  • Multiple chemical series / scaffolds use the same
    binding mode
  • Any 3 out of 4 important features may be
    sufficient
  • Evidence that some targets require multiple
    conformations from a ligand in order to bind

17
Non-competitive Binding
18
Non-competitive Binding
19
Unbalanced target populations (activity is a
rare event)
  • About 1 of drug-like molecules have interesting
    activity
  • Most of our experience in classification methods
    is with roughly balanced classes
  • Predictive methods are most accurate where they
    have the most data (interpolation), but where we
    need the most accuracy is with the extremely
    active compounds (extrapolation)
  • Warning Your data may look balanced
  • True population of interest
  • new and different compounds
  • Unrepresentative HTS training data
  • What chemists made in the past
  • Unrepresentative follow-up compounds for
    validation
  • What chemists intuition led them to submit to
    testing

20
Populations
next
21
Cipsline, Anti-infectives
Our models are accurate on the compounds made by
our labs
22
(No Transcript)
23
Kappa 0.147
24
Choosing cutoffs for activity and cutoffs for
compounds to pursue
  • Overlapping ranges of Inactive and Active
  • Cost of missing an active vs. cost of pursuing
    an inactive

25
Ideal vs. Actual HTS Observations
26
(No Transcript)
27
Virtual Screening of active retrieved vs of
compounds tested
tested
28
We use the log-linear graph to compare methods at
different follow-up levelsSee how 3 different
methods perform at selecting 5, 50, or 500
compounds to test
RP
SOM
LVQ
Reference
Random
29
Noise in measurement of activity
  • Suppose 1 active and 1 error, then our
    predicted actives are 50 false positives
  • This is out of the range of data-mining methods
    (but see Identifying Mislabeled Training Data,
    Brodley Friedl, JAIR, 1999)
  • Luckily, the error in measuring inactives is
    dampened
  • Methods can take advantage of the accuracy in
    inactive information in order to characterize
    actives
  • On the other hand, inactives have nothing in
    common, except that they are the other 99

30
Mysterious AccuracyORNeural Networks are great,
but what are they telling me?
  • We have a decision to make about data mining
    goals
  • Do we try toOutperform the chemist or engage
    the chemist
  • We need to assist a chemist in inventing new
    molecules that will be active (inverse QSAR)
  • We need to characterize activity in ways that are
    meaningful to a chemist

31
No Free Lunch Theorem
  • Proteins recognize molecules
  • Proteins compute a recognition function over the
    set of molecules
  • Proteins have a very general architecture
  • Proteins can recognize very complex or very
    simple characteristics of molecules
  • Proteins can compute any recognition function(?)
  • No single data-mining/machine-learning method can
    outperform all others on arbitrary functions
  • Therefore every new target protein requires its
    own modeling method
  • Cheap Brunch Hypothesis
  • Maybe proteins have a bias

32
Science, Not Marketing
  • We are looking for hypotheses that are worth the
    effort of experimental validation(not
    e-marketing opportunities)
  • Data-mining rules and models need to be in the
    form of a hypothesis comparable to the chemists
    hypotheses
  • Chemists need tools that help them design
    experiments to validate or invalidate these
    competing hypotheses
  • HTS is an experiment in need of a design

33
Conclusion
  • Machine-learning tools provide an opportunity for
    processing the new quantities of data that a
    chemist is seeing
  • The naïve data-mining expert has a lot to learn
    about chemical information
  • The naïve chemist has a lot to learn about
    data-mining for information

34
If there are so many problemswhy are we having
so much fun? Maybe weve stumbled into the
cheap brunch
Write a Comment
User Comments (0)
About PowerShow.com