Title: Problems and Opportunities for Machine Learning in Drug Discovery Can you find lessons for Systems B
1 Problems and Opportunitiesfor Machine Learning
in Drug Discovery(Can you find lessons for
Systems Biology?)
- George S. Cowan, Ph.D.
- Computer Aided Drug Discovery
- Pfizer Global Research and Development, Ann Arbor
Labs
CSSB, Rovereto, Italy
19 April 2004
2 Working as a Computer Scientist in a Life
Sciences field requires an array of supporting
scientists Thanks to
Project ColleaguesDavid Wild Kjell Johnson
Cheminformatics Mentors and ColleaguesJohn
Blankley Alain Calvet David Moreland
Risk TakersEric Gifford Mark Snow Christine
Humblet Mike Rafferty
Academic Peter Willett Robert Pearlman
3Drug Discovery and Development
- Discern unmet medical need
- Discover mechanism of action of disease
- Identify target protein
- Screen known compounds against target
- Synthesize promising leads
- Find 1-2 potential drugs
- Toxicity, ADME
- Clinical Trials
Biology
Chemistry
Pharmacology
4(No Transcript)
5Lock and Key Model
6Virtual HTS Screening
- Virtual Screening Definition
- estimate some biological behavior of new
compounds - identify characteristics of compounds related to
that biological behavior - only use some computer representation of the
compounds - HTS Virtual Screening is Not QSAR/QSPR
- Based on large amounts of easy to measure
observations - Uses early stage data from multiple chemical
series (no X-ray Crystallography) - Observations are not refined (Percent
Inhibition at a single concentration) - Looking for research direction, not best activity
7Promise of Data Mining
- Data Mining
- Works with large sets of data
- Efficient Processing
- Finds non-intuitive information
- Methods do not depend on the Domain (Marketing,
Fraud detection, Chemistry, ) - Alternative Data Mining Approaches
- Regression - Linear or Non-Linear - PLS
- Principal Components
- Association Rules
- Clustering Approach - Unsupervised - Concept
Formation - Classification Approach - Supervised
8Virtual Screening Challenges to Machine Learning
Overview (1)
- No single computer representation captures all
the important information about a molecule - The candidate features for representing molecules
are highly correlated - Features are entangled
- Multiple binding modes use different combinations
of features - Multiple chemical series / scaffolds use the same
binding mode - Evidence that some ligands take on multiple
conformations when binding to a target - Any 4 out of 5 important features may be
sufficient
9More Challenges to Machine Learning
Overview (2)
- Training data and validation data are not
representative - Measurements of activity are inherently noisy
- Activity is a rare event target populations are
unbalanced - Classification requires choosing cutoffs for
activity - There is no good measure for a successful
prediction - Many data mining methods characterize activity in
ways that are meaningless to a chemist - Data mining results must be reversible to assist
a chemist in inventing new molecules that will be
active (inverse QSAR)
10Deep Challenges to Machine Learning
Overview (3)
- No free lunch theorem
- Science is different from marketing
11No Single Computer Representation captures all
the important information
- How do we characterize the electronic face that
the molecule presents to the protein? - Grid of surface or surrounding points with field
calculations - Conformational flexibility
- 3-D relationships of pharmacophores
- Complementary volumes and surfaces
- Complementary charges
- Complementary hydrogen bonding atoms
- Similar Hydrophbicity/Hydrophilicity
- Connectivity Bonding between Atoms (2-D)
- pharmacophore info is implicitly present to some
extent - not biased toward any particular conformation
- Presence of molecular fragments (fingerprints)
- Other Linear (SLN, SMILES)? Free-tree?
12Pharmacophores
13Representation of Chemical Structures (2D)
14BCI Chemical Descriptors
- Descriptors are binary and represent
15We dont have the right descriptors, but we have
thousands that are easy to compute
- Thousands of molecular fragments
- Hundreds of calculated quasi-physical properties
- Hundreds of structural connectivity indicators
- Much of this information is redundant
16Feature Interaction andMultiple Configurations
for ActivityRequire Disjunctive Models
- Multiple binding modes where different
combinations of features contribute to the
activity(including non-competitive ligands) - Multiple chemical series / scaffolds use the same
binding mode - Any 3 out of 4 important features may be
sufficient - Evidence that some targets require multiple
conformations from a ligand in order to bind
17Non-competitive Binding
18Non-competitive Binding
19Unbalanced target populations (activity is a
rare event)
- About 1 of drug-like molecules have interesting
activity - Most of our experience in classification methods
is with roughly balanced classes - Predictive methods are most accurate where they
have the most data (interpolation), but where we
need the most accuracy is with the extremely
active compounds (extrapolation) - Warning Your data may look balanced
- True population of interest
- new and different compounds
- Unrepresentative HTS training data
- What chemists made in the past
- Unrepresentative follow-up compounds for
validation - What chemists intuition led them to submit to
testing
20Populations
next
21Cipsline, Anti-infectives
Our models are accurate on the compounds made by
our labs
22(No Transcript)
23Kappa 0.147
24Choosing cutoffs for activity and cutoffs for
compounds to pursue
- Overlapping ranges of Inactive and Active
- Cost of missing an active vs. cost of pursuing
an inactive
25Ideal vs. Actual HTS Observations
26(No Transcript)
27Virtual Screening of active retrieved vs of
compounds tested
tested
28We use the log-linear graph to compare methods at
different follow-up levelsSee how 3 different
methods perform at selecting 5, 50, or 500
compounds to test
RP
SOM
LVQ
Reference
Random
29Noise in measurement of activity
- Suppose 1 active and 1 error, then our
predicted actives are 50 false positives - This is out of the range of data-mining methods
(but see Identifying Mislabeled Training Data,
Brodley Friedl, JAIR, 1999) - Luckily, the error in measuring inactives is
dampened - Methods can take advantage of the accuracy in
inactive information in order to characterize
actives - On the other hand, inactives have nothing in
common, except that they are the other 99
30Mysterious AccuracyORNeural Networks are great,
but what are they telling me?
- We have a decision to make about data mining
goals - Do we try toOutperform the chemist or engage
the chemist - We need to assist a chemist in inventing new
molecules that will be active (inverse QSAR) - We need to characterize activity in ways that are
meaningful to a chemist
31No Free Lunch Theorem
- Proteins recognize molecules
- Proteins compute a recognition function over the
set of molecules - Proteins have a very general architecture
- Proteins can recognize very complex or very
simple characteristics of molecules - Proteins can compute any recognition function(?)
- No single data-mining/machine-learning method can
outperform all others on arbitrary functions - Therefore every new target protein requires its
own modeling method - Cheap Brunch Hypothesis
- Maybe proteins have a bias
32Science, Not Marketing
- We are looking for hypotheses that are worth the
effort of experimental validation(not
e-marketing opportunities) - Data-mining rules and models need to be in the
form of a hypothesis comparable to the chemists
hypotheses - Chemists need tools that help them design
experiments to validate or invalidate these
competing hypotheses - HTS is an experiment in need of a design
33Conclusion
- Machine-learning tools provide an opportunity for
processing the new quantities of data that a
chemist is seeing - The naïve data-mining expert has a lot to learn
about chemical information - The naïve chemist has a lot to learn about
data-mining for information
34If there are so many problemswhy are we having
so much fun? Maybe weve stumbled into the
cheap brunch