6. Machine Learning and Other Predictive Methods

About This Presentation

Title:

6. Machine Learning and Other Predictive Methods

Description:

Bioinformatics analogy and differences: Data (GenBank, Swissprot, PDB) Similarity (BLAST) ... Bioinformatics, 21, Supplement 1, i359-368, (2005). 29 ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 37

Provided by: Sho57

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: 6. Machine Learning and Other Predictive Methods

1
6. Machine Learning and Other Predictive Methods
2
Chemical Space
Stars Small Mol.
Existing 1022 107
Virtual 0 1060 (?)
Mode Real Virtual
Access Difficult Easy
3
Predictive Methods

Predict physical, chemical, and biological
properties
For example 3D structure, NMR and mass spectra,
boiling point, melting point, solubility (log P),
toxicity, reaction rates, binding affinities,
QSAR,
Dock PDB to PubChem

4
Methods

Spetrum of methods
Schrodinger Equation
Molecular Dynamics
Machine Learning (e.g. SS prediction)

5
Chemical Informatics

Informatics must be able to deal with
variable-size structured data
Graphical Models
(Recursive) Neural Networks
ILP
GA
SGs
Kernels

6
Neural Networks

Feedforward applied to fingerprints (1D)
Recursive applied to bond graph (2D)
Directed Acyclic Graph
State vectors
Weight sharing

7
Chemo/Bio Informatics

Two Key Ingredients
1. Data
2. Similarity Measures
Bioinformatics analogy and differences
Data (GenBank, Swissprot, PDB)
Similarity (BLAST)

8
Fundamental Importance of Similarity Measures

Rapid Search of Large Databases
Protein Receptor (Docking)
Small Molecule/Ligand (Similarity)
Predictive Methods (Kernel Methods)

9
Classification

Learning to Classify
Limited number of training examples (molecules,
patients, sequences, etc.)
Learning algorithm (how to build the classifier?)
Generalization should correctly classify test
data.
Formalization
X is the input space
Y (e.g. toxic/non toxic, or 1,-1) is the target
class
f X?Y is the classifier.

10
Linear Classifiers
11
Classification

Fundamental Point
f is entirely determined
by the dot products ltxixjgt?
measuring similarity between pairs of data points

12
Non Linear Classification(Kernel Methods)

We can transform a nonlinear problem into a
linear one using a kernel.

13
Non Linear Classification(Kernel Methods)

We can transform a nonlinear problem into a
linear one using a kernel K.
Fundamental property the linear decision surface
depends on
K(xi ,xj)ltf(xi ) , f(xj)gt.
All we need is the Gram similarity matrix K. K
defines the local metric of the embedding space.

14
Finding a Good Kernel

Given Two molecules.
Task Systematically compute relevant similarity
while being storage/time efficient.
Motivation Enable efficient application of
search and kernel algorithms.

15
Similarity Data Representations
NC(O)C(O)O
16
1D SMILES Kernel
17
2D Molecule Graph Kernel

For chemical compounds
atom/node labels
A C,N,O,H,
bond/edge labels
B s, d, t, ar,
Count labeled paths
Fingerprints

(CsNsCdO)
18
Similarity for Binary Fingerprints

Tally features
Unique (a,b)
In common (c)
Similarity Formula
Tanimotoc/(abc)
Tversky(a,ß)c/(aabßc)

19
Similarity Measures
20
3D Coordinate Kernel
21
Datasets
22
Examples of ResultsMutag and PTC
23
Results
24
Example of Results (NCI)
25
Example of ResultsNCI
Accuracy/ROC
26
Comparison of Kernels (NCI)
27
Regression Aqueous Solubility 30 folds
cross-validation Delaney Dataset 1440
Examples
Kernel R² RMSE MAE
1D AS 0.88 0.75 0.55
1D VAS weight factor 1 0.89 0.71 0.53
2D MinMax depth 2 no cycle 0.92 0.61 0.44
2D Tanimoto depth 10 no cycle 0.86 0.79 0.56
2.5D Tanimoto depth 4 0.77 1.02 0.72
3D CH bin. Width 0.1 0.83 0.87 0.67
Published results (train-test) 0.69 0.75
28
XLogP 40 folds cross-validation
Dataset size 1991
Kernel R² RMSE MAE
1D AS 0.91 0.47 0.32
1D VAS weight factor 1 0.91 0.46 0.33
2D MinMax depth 5 no cycle 0.94 0.39 0.25
2D Tanimoto depth 10 cycles 0.88 0.54 0.35
3D CH bin. Width 0.05 0.67 0.88 0.68
S. J. Swamidass, J. Chen, P. Phung, J. Bruand, L.
Ralaivola, and P. Baldi. Kernels for Small
Molecules and the Prediction of Mutagenicity,
Toxicity, and Anti-Cancer Activity. Proceedings
of the 2005 Conference on Intelligent Systems for
Molecular Biology, ISMB 05. Bioinformatics, 21,
Supplement 1, i359-368, (2005).
29
Additional Representations
1D SMILES string
2D Atomic connection table
3D XYZ coordinates of labeled points
2.5D 2D surface in 3D space
NC(CO)C(O)O
4D Bag of conformers as XYZ coordinates of
labeled points
3.5D Bag of conformers in 2D surface in 3D space
Multiple Conformers
30
2.5D Surface Kernel

Build a graph G (V atoms) which approximates
the surface (convex hull).
Use spectral graph kernels on G.

31
2.5D Surface Kernel

Compute regular/Delauney tessellation
(tetrahedrization) of the convex hull of the
atoms in the molecule
Use alpha-shape algorithm to detect surface
triangles at relevant scale (keep interior and
regular edges, remove singular edges, r on the
order of water carbon radius)
This yields a triangulated graph that
approximates the surface (average degree 6).
Use spectral kernel with paths (l3,4) on the
triangulated surface graph.

32
Alpha Shape

The shape formed by a set of points.
Closely related solvent accessible surface.
Calculated in O(nlog(n)) using CGAL

http//www.cgal.org/Manual/doc_html/cgal_manual/Al
pha_shapes_3/Chapter_main.html
33
The Conformer Problem

Atoms connected by proximity
Different conformers have different graphs and
features.

34
2.5D Conformers 3.5D
Molecule A
Molecule B
35
Molecular Representations and Kernels

1D SMILES strings
2D Graph of bonds
2D Surfaces
2.5D Conformers
3D Atomic coordinates
(Pharmacophores, Epitopes)
3.5D Conformers
4D Temporal evolution
4D Isomers

36
Summary

ChemDB and other resources
Variety of kernels for small molecules
State-of-the-art performance on several benchmark
datasets
For now, 2D kernels slightly better than 1D and
3D kernels
Many possible extensions 2.5D, 3D, 3.5D, 4D
kernels
Need for larger data sets and new models of
cooperation in the chemistry community
Many open (ML) questions (e.g. clustering and
visualizing 107 compounds, intelligent
recognition of useful molecules/reactions,
retrosynthesis, prediction of reaction rates,
information retrieval from literature, docking,
matching table of all proteins against all known
compounds, origin of life, etc.)

Write a Comment

User Comments (0)

About PowerShow.com

6. Machine Learning and Other Predictive Methods - PowerPoint PPT Presentation

6. Machine Learning and Other Predictive Methods

Bioinformatics analogy and differences: Data (GenBank, Swissprot, PDB) Similarity (BLAST) ... Bioinformatics, 21, Supplement 1, i359-368, (2005). 29 ... – PowerPoint PPT presentation