Title: 6. Machine Learning and Other Predictive Methods
16. Machine Learning and Other Predictive Methods
2Chemical Space
Stars Small Mol.
Existing 1022 107
Virtual 0 1060 (?)
Mode Real Virtual
Access Difficult Easy
3Predictive Methods
- Predict physical, chemical, and biological
properties - For example 3D structure, NMR and mass spectra,
boiling point, melting point, solubility (log P),
toxicity, reaction rates, binding affinities,
QSAR, - Dock PDB to PubChem
4Methods
- Spetrum of methods
- Schrodinger Equation
-
- Molecular Dynamics
-
- Machine Learning (e.g. SS prediction)
5Chemical Informatics
- Informatics must be able to deal with
variable-size structured data - Graphical Models
- (Recursive) Neural Networks
- ILP
- GA
- SGs
- Kernels
6Neural Networks
- Feedforward applied to fingerprints (1D)
- Recursive applied to bond graph (2D)
- Directed Acyclic Graph
- State vectors
- Weight sharing
7Chemo/Bio Informatics
- Two Key Ingredients
- 1. Data
- 2. Similarity Measures
- Bioinformatics analogy and differences
- Data (GenBank, Swissprot, PDB)
- Similarity (BLAST)
8Fundamental Importance of Similarity Measures
- Rapid Search of Large Databases
- Protein Receptor (Docking)
- Small Molecule/Ligand (Similarity)
- Predictive Methods (Kernel Methods)
9Classification
- Learning to Classify
- Limited number of training examples (molecules,
patients, sequences, etc.) - Learning algorithm (how to build the classifier?)
- Generalization should correctly classify test
data. - Formalization
- X is the input space
- Y (e.g. toxic/non toxic, or 1,-1) is the target
class - f X?Y is the classifier.
10Linear Classifiers
11Classification
- Fundamental Point
- f is entirely determined
- by the dot products ltxixjgt?
- measuring similarity between pairs of data points
12Non Linear Classification(Kernel Methods)
- We can transform a nonlinear problem into a
linear one using a kernel.
13Non Linear Classification(Kernel Methods)
- We can transform a nonlinear problem into a
linear one using a kernel K. - Fundamental property the linear decision surface
depends on - K(xi ,xj)ltf(xi ) , f(xj)gt.
- All we need is the Gram similarity matrix K. K
defines the local metric of the embedding space.
14Finding a Good Kernel
- Given Two molecules.
- Task Systematically compute relevant similarity
while being storage/time efficient. - Motivation Enable efficient application of
search and kernel algorithms.
15Similarity Data Representations
NC(O)C(O)O
161D SMILES Kernel
172D Molecule Graph Kernel
- For chemical compounds
- atom/node labels
- A C,N,O,H,
- bond/edge labels
- B s, d, t, ar,
- Count labeled paths
- Fingerprints
(CsNsCdO)
18Similarity for Binary Fingerprints
- Tally features
- Unique (a,b)
- In common (c)
- Similarity Formula
- Tanimotoc/(abc)
- Tversky(a,ß)c/(aabßc)
19Similarity Measures
203D Coordinate Kernel
21Datasets
22Examples of ResultsMutag and PTC
23Results
24Example of Results (NCI)
25Example of ResultsNCI
Accuracy/ROC
26Comparison of Kernels (NCI)
27 Regression Aqueous Solubility 30 folds
cross-validation Delaney Dataset 1440
Examples
Kernel R² RMSE MAE
1D AS 0.88 0.75 0.55
1D VAS weight factor 1 0.89 0.71 0.53
2D MinMax depth 2 no cycle 0.92 0.61 0.44
2D Tanimoto depth 10 no cycle 0.86 0.79 0.56
2.5D Tanimoto depth 4 0.77 1.02 0.72
3D CH bin. Width 0.1 0.83 0.87 0.67
Published results (train-test) 0.69 0.75
28 XLogP 40 folds cross-validation
Dataset size 1991
Kernel R² RMSE MAE
1D AS 0.91 0.47 0.32
1D VAS weight factor 1 0.91 0.46 0.33
2D MinMax depth 5 no cycle 0.94 0.39 0.25
2D Tanimoto depth 10 cycles 0.88 0.54 0.35
3D CH bin. Width 0.05 0.67 0.88 0.68
S. J. Swamidass, J. Chen, P. Phung, J. Bruand, L.
Ralaivola, and P. Baldi. Kernels for Small
Molecules and the Prediction of Mutagenicity,
Toxicity, and Anti-Cancer Activity. Proceedings
of the 2005 Conference on Intelligent Systems for
Molecular Biology, ISMB 05. Bioinformatics, 21,
Supplement 1, i359-368, (2005).
29Additional Representations
1D SMILES string
2D Atomic connection table
3D XYZ coordinates of labeled points
2.5D 2D surface in 3D space
NC(CO)C(O)O
4D Bag of conformers as XYZ coordinates of
labeled points
3.5D Bag of conformers in 2D surface in 3D space
Multiple Conformers
302.5D Surface Kernel
- Build a graph G (V atoms) which approximates
the surface (convex hull). - Use spectral graph kernels on G.
312.5D Surface Kernel
- Compute regular/Delauney tessellation
(tetrahedrization) of the convex hull of the
atoms in the molecule - Use alpha-shape algorithm to detect surface
triangles at relevant scale (keep interior and
regular edges, remove singular edges, r on the
order of water carbon radius) - This yields a triangulated graph that
approximates the surface (average degree 6). - Use spectral kernel with paths (l3,4) on the
triangulated surface graph.
32Alpha Shape
- The shape formed by a set of points.
- Closely related solvent accessible surface.
- Calculated in O(nlog(n)) using CGAL
http//www.cgal.org/Manual/doc_html/cgal_manual/Al
pha_shapes_3/Chapter_main.html
33The Conformer Problem
- Atoms connected by proximity
- Different conformers have different graphs and
features.
342.5D Conformers 3.5D
Molecule A
Molecule B
35Molecular Representations and Kernels
- 1D SMILES strings
- 2D Graph of bonds
- 2D Surfaces
- 2.5D Conformers
- 3D Atomic coordinates
- (Pharmacophores, Epitopes)
- 3.5D Conformers
- 4D Temporal evolution
- 4D Isomers
36Summary
- ChemDB and other resources
- Variety of kernels for small molecules
- State-of-the-art performance on several benchmark
datasets - For now, 2D kernels slightly better than 1D and
3D kernels - Many possible extensions 2.5D, 3D, 3.5D, 4D
kernels - Need for larger data sets and new models of
cooperation in the chemistry community - Many open (ML) questions (e.g. clustering and
visualizing 107 compounds, intelligent
recognition of useful molecules/reactions,
retrosynthesis, prediction of reaction rates,
information retrieval from literature, docking,
matching table of all proteins against all known
compounds, origin of life, etc.)