Title: Machine%20Learning%20in%20Drug%20Design
1Machine Learning in Drug Design
- David Page
- Dept. of Biostatistics and Medical Informatics
and Dept. of Computer Sciences
2Collaborators
- Michael Waddell
- Paul Finn
- Ashwin Srinivasan
- John Shaughnessy
- Bart Barlogie
- Frank Zhan
- Stephen Muggleton
- Arno Spatola
- Sean McIlwain
- Brian Kay
3Outline
- Overview of Drug Design
- How Machine Learning Fits Into the Process
- Target Search Single Nucleotide Polymorphisms
(SNPs) - Machine Learning from Feature Vectors
- Decision Trees
- Support Vector Machines
- Voting/Ensembles
- Predicting Molecular Activity Learning from
Structure
4Drugs Typically Are
- Small organic molecules that
- Modulate disease by binding to some target
protein - At a location that alters the proteins behavior
(e.g., antagonist or agonist). - Target protein might be human (e.g., ACE for
blood pressure) or belong to invading organism
(e.g., surface protein of a bacterium).
5Example of Binding
6So To Design a Drug
Identify Target Protein
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR Difficult if Membrane-Bound
Determine Target Site Structure
Synthesize a Molecule that Will Bind
Imperfect modeling of structure Structures may
change at binding And even then
7Molecule Binds Target But May
- Bind too tightly or not tightly enough.
- Be toxic.
- Have other effects (side-effects) in the body.
- Break down as soon as it gets into the body, or
may not leave the body soon enough. - It may not get to where it should in the body
(e.g., crossing blood-brain barrier). - Not diffuse from gut to bloodstream.
8And Every Body is Different
- Even if a molecule works in the test tube and
works in animal studies, it may not work in
people (will fail in clinical trials). - A molecule may work for some people but not
others. - A molecule may cause harmful side-effects in some
people but not others.
9Outline
- Overview of Drug Design
- How Machine Learning Fits Into the Process
- Target Search Single Nucleotide Polymorphisms
(SNPs) - Machine Learning from Feature Vectors
- Decision Trees
- Support Vector Machines
- Voting/Ensembles
- Predicting Molecular Activity Learning from
Structure
10Places to use Machine Learning
- Finding target proteins.
- Inferring target site structure.
- Predicting who will respond positively/negatively.
11Places to use Machine Learning
- Finding target proteins.
- Inferring target site structure.
- Predicting who will respond positively/negatively.
12 Healthy vs. Disease
Healthy
Diseased
13If We Could Sequence DNA Quickly and Cheaply, We
Could
- Sequence DNA of people taking a drug, and use ML
to identify consistent differences between those
who respond well and those who do not. - Sequence DNA of cancer cells and healthy cells,
and use ML to detect dangerous mutations
proteins these genes code for may be useful
targets. - Sequence DNA of people who get a disease and
those who dont, and use ML to determine genes
related to succeptibility proteins these genes
code for may be useful targets.
14Problem Cant Sequence Quickly
- Can quickly test single positions where variation
is common Single Nucleotide Polymorphisms
(SNPs). - Can quickly test degree to which every gene is
being transcribed Gene Expression Microarrays
(e.g., Affymetrix Gene Chips). - Can (moderately) quickly test which proteins are
present in a sample (Proteomics).
15Outline
- Overview of Drug Design
- How Machine Learning Fits Into the Process
- Target Search Single Nucleotide Polymorphisms
(SNPs) - Machine Learning from Feature Vectors
- Decision Trees
- Support Vector Machines
- Voting/Ensembles
- Predicting Molecular Activity Learning from
Structure
16Example of SNP Data
17Problem SNPs are not Genes
- If we find a predictive SNP, it may not be part
of a gene we can only infer that the SNP is
near a gene that may be involved in the
disease. - Even if the SNP is part of a gene, it may be
another nearby gene that is the key gene.
18Problem Even SNPs are Costly
- Typically cannot use all known SNPs.
- Can focus on a particular chromosome and area if
knowledge permits that. - Can use a scattering of SNPs, since SNPs that are
very close together may be redundant use one SNP
per haplotype block, or region where
recombination is rare.
19Why Machine Learning?
- There may be no single SNP in our data that
distinguishes disease vs. healthy. - Still may be possible to have some combination of
SNPs to predict. Can gain insight from this
combination.
20Outline
- Overview of Drug Design
- How Machine Learning Fits Into the Process
- Target Search Single Nucleotide Polymorphisms
(SNPs) - Machine Learning from Feature Vectors
- Decision Trees
- Support Vector Machines
- Voting/Ensembles
- Predicting Molecular Activity Learning from
Structure
21Decision Trees in One Picture
22(No Transcript)
23Naïve Bayes in One Picture
Age
SNP 3000
SNP 1
SNP 2
. . .
24Voting Approach
- Score SNPs using information gain.
- Choose top 1 scoring SNPs.
- To classify a new case, let these SNPs vote
(majority or weighted majority vote). - We use majority vote here.
25Task Predict Early Onset DiseaseFrom SNP Data
- Only 3000 SNPs, coarsely sampled over entire
genome. - 80 patients (examples), 40 with early onset.
- Using technology from Orchid.
- Can a predictor be learned that performs
significantly better than chance on unseen data?
26Results
- Use all data, only top 1 of features, or only
top 10 of features (according to decision trees
purity measure). - Use Trees, SVMs, Voting.
- SVMs with top 10 achieve 71 accuracy.
Significantly better than chance (50).
27Lessons
- Feature selection is important for performance.
- Methodology note for machine learning
specialists must repeat this entire process on
each fold of cross-validation or results will be
overly-optimistic. - SNP approach is promising get funding to measure
more SNPs. - More work on SVM comprehensibility.
28Outline
- Overview of Drug Design
- How Machine Learning Fits Into the Process
- Target Search Single Nucleotide Polymorphisms
(SNPs) - Machine Learning from Feature Vectors
- Decision Trees
- Support Vector Machines
- Voting/Ensembles
- Predicting Molecular Activity Learning from
Structure
29Places to use Machine Learning
- Finding target proteins.
- Inferring target site structure.
- Predicting who will respond positively/negatively.
30Typical Practice when Target Structure is Unknown
- Test many molecules (1,000,000) to find some that
bind to target (ligands). - Infer (induce) shape of target site from 3D
structural similarities. - Shared 3D substructure is called a pharmacophore.
- Perfect example of a machine learning task with
spatial target.
31An Example of Structure Learning
Inactive
Active
32Inductive Logic Programming
- Represents data points in mathematical logic
- Uses Background Knowledge
- Returns results in logic
33The Logical Representation of a Pharmacophore
34Background Knowledge I
- Information about atoms and bonds in the
molecules - atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
- atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
- atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
- bond(m1,a1,a2,1).
- bond(m1,a2,a3,1).
35Background knowledge II
- Definition of distance equivalence
- dist(Drug,Atom1,Atom2,Dist,Error)-
- number(Error),
- coord(Drug,Atom1,X1,Y1,Z1),
- coord(Drug,Atom2,X2,Y2,Z2),
- euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
- Diff is Dist1-Dist,
- absolute_value(Diff,E1),
- E1 lt Error.
- euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D)-
- Dsq is (X1-X2)2(Y1-Y2)2(Z1-Z2)2,
- D is sqrt(Dsq).
36Central Idea Generalize by searching a lattice
37Conformational model
- Conformational flexibility modelled as multiple
conformations - Sybyl randomsearch
- Catalyst
38Pharmacophore description
- Atom and site centred
- Hydrogen bond donor
- Hydrogen bond acceptor
- Hydrophobe
- Site points (limited at present)
- User definable
- Distance based
39Example 1 Dopamine agonists
- Agonists taken from Martin data set on QSAR
society web pages - Examples (5-50 conformations/molecule)
40Pharmacophore identified
- Molecule A has the desired activity if
- in conformation B molecule A contains a
hydrogen acceptor at C, and - in conformation B molecule A contains a basic
nitrogen group at D, and - the distance between C and D is 7.05966 /-
0.75 Angstroms, and - in conformation B molecule A contains a
hydrogen acceptor at E, and - the distance between C and E is 2.80871 /-
0.75 Angstroms, and - the distance between D and E is 6.36846 /-
0.75 Angstroms, and - in conformation B molecule A contains a
hydrophobic group at F, and - the distance between C and F is 2.68136 /-
0.75 Angstroms, and - the distance between D and F is 4.80399 /-
0.75 Angstroms, and - the distance between E and F is 2.74602 /-
0.75 Angstroms.
41Example II ACE inhibitors
- 28 angiotensin converting enzyme inhibitors taken
from literature - D. Mayer et al., J. Comput.-Aided Mol. Design, 1,
3-16, (1987)
42Experiment 1
- Attempt to identify pharmacophore using original
Mayer et al. Data (final conformations). - Initial failed attempt traced to bugs in
background knowledge definition. - 4 pharmacophores found with corrected code
(variations on common theme)
43ACE pharmacophore
- Molecule A is an ACE inhibitor if
- molecule A contains a zinc-site B,
- molecule A contains a hydrogen acceptor C,
- the distance between B and C is 7.899 /-
0.750 A, - molecule A contains a hydrogen acceptor D,
- the distance between B and D is 8.475 /-
0.750 A, - the distance between C and D is 2.133 /-
0.750 A, - molecule A contains a hydrogen acceptor E,
- the distance between B and E is 4.891 /-
0.750 A, - the distance between C and E is 3.114 /-
0.750 A, - the distance between D and E is 3.753 /-
0.750 A.
44Pharmacophore discovered
Zinc site H-bond acceptor
45Experiment 2
- Definition of zinc ligand added to background
knowledge - based on crystallographic data
- Multiple conformations
- Sybyl RandomSearch
46Experiment 2
- Original pharmacophore rediscovered plus one
other - different zinc ligand position
- similar to alternative proposed by Ciba-Geigy
47Example III Thermolysin inhibitors
- 10 inhibitors for which crystallographic data is
available in PDB - Conformationally challenging molecules
- Experimentally observed superposition
48Key binding site interactions
Asn112-NH
OC Asn112
S2
Arg203-NH
S1
OC Ala113
Zn
49Interactions made by inhibitors
50Pharmacophore Identification
- Structures considered 1HYT 1THL 1TLP 1TMN 2TMN
4TLN 4TMN 5TLN 5TMN 6TMN - Conformational analysis using Best conformer
generation in Catalyst - 98-251 conformations/molecule
51Thermolysin Results
- 10 5-point pharmacophore identified, falling into
2 groups (7/10 molecules) - 3 acceptors, 1 hydrophobe, 1 donor
- 4 acceptors, 1 donor
- Common core of Zn ligands, Arg203 and Asn112
interactions identified - Correct assignments of functional groups
- Correct geometry to 1 Angstrom tolerance
52Thermolysin results
- Increasing tolerance to 1.5Angstroms finds common
6-point pharmacophore including one extra
interaction
53Example IV Antibacterial peptides
- Dataset of 11 pentapeptides showing activity
against Pseudomonas aeruginosa - 6 actives lt64mg/ml IC50
- 5 inactives
54Pharmacophore Identified
A Molecule M is active against Pseudomonas
Aeruginosa if it has a conformation B such
that M has a hydrophobic group C, M has a
hydrogen acceptor D, the distance between C and
D in conformation B is 11.7 Angstroms M has a
positively-charged atom E, the distance between
C and E in conformation B is 4 Angstroms the
distance between D and E in conformation B is 9.4
Angstroms M has a positively-charged atom
F, the distance between C and F in conformation
B is 11.1 Angstroms the distance between D and F
in conformation B is 12.6 Angstroms the distance
between E and F in conformation B is 8.7
Angstroms Tolerance 1.5 Angstroms
55(No Transcript)
56Ongoing ILP developments (pharmacophores)
- Continue to extend method validation
- Extending to combinatorial mixtures
- Quantitative models
- Mixing different datatypes in background
knowledge - Developing graphical front-end
57Ongoing developments (Other)
- Analysis of HTS datasets
- Analysis of drug-likeness
- Derivation of new descriptors
- eg Empirical binding functions