Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B

Description:

Many data mining methods characterize activity in ways that are meaningless to a ... How do we characterize the electronic 'face' that the molecule presents to ... – PowerPoint PPT presentation

Number of Views:511

Avg rating:3.0/5.0

Slides: 35

Provided by: Park159

Category:

more less

Transcript and Presenter's Notes

Title: Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B

1
Problems and Opportunitiesfor Machine Learning
in Drug Discovery(Can you find lessons for
Systems Biology?)

George S. Cowan, Ph.D.
Computer Aided Drug Discovery
Pfizer Global Research and Development, Ann Arbor
Labs

CSSB, Rovereto, Italy
19 April 2004
2
Working as a Computer Scientist in a Life
Sciences field requires an array of supporting
scientists Thanks to
Project ColleaguesDavid Wild Kjell Johnson
Cheminformatics Mentors and ColleaguesJohn
Blankley Alain Calvet David Moreland
Risk TakersEric Gifford Mark Snow Christine
Humblet Mike Rafferty
Academic Peter Willett Robert Pearlman
3
Drug Discovery and Development

Discern unmet medical need
Discover mechanism of action of disease
Identify target protein
Screen known compounds against target
Synthesize promising leads
Find 1-2 potential drugs
Toxicity, ADME
Clinical Trials

Biology
Chemistry
Pharmacology
4
(No Transcript)
5
Lock and Key Model
6
Virtual HTS Screening

Virtual Screening Definition
estimate some biological behavior of new
compounds
identify characteristics of compounds related to
that biological behavior
only use some computer representation of the
compounds
HTS Virtual Screening is Not QSAR/QSPR
Based on large amounts of easy to measure
observations
Uses early stage data from multiple chemical
series (no X-ray Crystallography)
Observations are not refined (Percent
Inhibition at a single concentration)
Looking for research direction, not best activity

7
Promise of Data Mining

Data Mining
Works with large sets of data
Efficient Processing
Finds non-intuitive information
Methods do not depend on the Domain (Marketing,
Fraud detection, Chemistry, )
Alternative Data Mining Approaches
Regression - Linear or Non-Linear - PLS
Principal Components
Association Rules
Clustering Approach - Unsupervised - Concept
Formation
Classification Approach - Supervised

8
Virtual Screening Challenges to Machine Learning
Overview (1)

No single computer representation captures all
the important information about a molecule
The candidate features for representing molecules
are highly correlated
Features are entangled
Multiple binding modes use different combinations
of features
Multiple chemical series / scaffolds use the same
binding mode
Evidence that some ligands take on multiple
conformations when binding to a target
Any 4 out of 5 important features may be
sufficient

9
More Challenges to Machine Learning
Overview (2)

Training data and validation data are not
representative
Measurements of activity are inherently noisy
Activity is a rare event target populations are
unbalanced
Classification requires choosing cutoffs for
activity
There is no good measure for a successful
prediction
Many data mining methods characterize activity in
ways that are meaningless to a chemist
Data mining results must be reversible to assist
a chemist in inventing new molecules that will be
active (inverse QSAR)

10
Deep Challenges to Machine Learning
Overview (3)

No free lunch theorem
Science is different from marketing

11
No Single Computer Representation captures all
the important information

How do we characterize the electronic face that
the molecule presents to the protein?
Grid of surface or surrounding points with field
calculations
Conformational flexibility
3-D relationships of pharmacophores
Complementary volumes and surfaces
Complementary charges
Complementary hydrogen bonding atoms
Similar Hydrophbicity/Hydrophilicity
Connectivity Bonding between Atoms (2-D)
pharmacophore info is implicitly present to some
extent
not biased toward any particular conformation
Presence of molecular fragments (fingerprints)
Other Linear (SLN, SMILES)? Free-tree?

12
Pharmacophores
13
Representation of Chemical Structures (2D)

Aspirin

14
BCI Chemical Descriptors

Descriptors are binary and represent

15
We dont have the right descriptors, but we have
thousands that are easy to compute

Thousands of molecular fragments
Hundreds of calculated quasi-physical properties
Hundreds of structural connectivity indicators
Much of this information is redundant

16
Feature Interaction andMultiple Configurations
for ActivityRequire Disjunctive Models

Multiple binding modes where different
combinations of features contribute to the
activity(including non-competitive ligands)
Multiple chemical series / scaffolds use the same
binding mode
Any 3 out of 4 important features may be
sufficient
Evidence that some targets require multiple
conformations from a ligand in order to bind

17
Non-competitive Binding
18
Non-competitive Binding
19
Unbalanced target populations (activity is a
rare event)

About 1 of drug-like molecules have interesting
activity
Most of our experience in classification methods
is with roughly balanced classes
Predictive methods are most accurate where they
have the most data (interpolation), but where we
need the most accuracy is with the extremely
active compounds (extrapolation)
Warning Your data may look balanced
True population of interest
new and different compounds
Unrepresentative HTS training data
What chemists made in the past
Unrepresentative follow-up compounds for
validation
What chemists intuition led them to submit to
testing

20
Populations
next
21
Cipsline, Anti-infectives
Our models are accurate on the compounds made by
our labs
22
(No Transcript)
23
Kappa 0.147
24
Choosing cutoffs for activity and cutoffs for
compounds to pursue

Overlapping ranges of Inactive and Active
Cost of missing an active vs. cost of pursuing
an inactive

25
Ideal vs. Actual HTS Observations
26
(No Transcript)
27
Virtual Screening of active retrieved vs of
compounds tested
tested
28
We use the log-linear graph to compare methods at
different follow-up levelsSee how 3 different
methods perform at selecting 5, 50, or 500
compounds to test
RP
SOM
LVQ
Reference
Random
29
Noise in measurement of activity

Suppose 1 active and 1 error, then our
predicted actives are 50 false positives
This is out of the range of data-mining methods
(but see Identifying Mislabeled Training Data,
Brodley Friedl, JAIR, 1999)
Luckily, the error in measuring inactives is
dampened
Methods can take advantage of the accuracy in
inactive information in order to characterize
actives
On the other hand, inactives have nothing in
common, except that they are the other 99

30
Mysterious AccuracyORNeural Networks are great,
but what are they telling me?

We have a decision to make about data mining
goals
Do we try toOutperform the chemist or engage
the chemist
We need to assist a chemist in inventing new
molecules that will be active (inverse QSAR)
We need to characterize activity in ways that are
meaningful to a chemist

31
No Free Lunch Theorem

Proteins recognize molecules
Proteins compute a recognition function over the
set of molecules
Proteins have a very general architecture
Proteins can recognize very complex or very
simple characteristics of molecules
Proteins can compute any recognition function(?)
No single data-mining/machine-learning method can
outperform all others on arbitrary functions
Therefore every new target protein requires its
own modeling method
Cheap Brunch Hypothesis
Maybe proteins have a bias

32
Science, Not Marketing

We are looking for hypotheses that are worth the
effort of experimental validation(not
e-marketing opportunities)
Data-mining rules and models need to be in the
form of a hypothesis comparable to the chemists
hypotheses
Chemists need tools that help them design
experiments to validate or invalidate these
competing hypotheses
HTS is an experiment in need of a design

33
Conclusion

Machine-learning tools provide an opportunity for
processing the new quantities of data that a
chemist is seeing
The naïve data-mining expert has a lot to learn
about chemical information
The naïve chemist has a lot to learn about
data-mining for information

34
If there are so many problemswhy are we having
so much fun? Maybe weve stumbled into the
cheap brunch

Write a Comment

User Comments (0)