Chemoinformatics Theory - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Chemoinformatics Theory

Description:

... silico techniques are used in pharmaceutical companies in the process of drug discovery. ... analysis played a major role in the pharmaceutical industry. ... – PowerPoint PPT presentation

Number of Views:353

Avg rating:3.0/5.0

Slides: 41

Provided by: scott709

Category:

more less

Transcript and Presenter's Notes

Title: Chemoinformatics Theory

1
Chemoinformatics Theory
Yoon Soo Pyon ysp2_at_case.edu October 19th, 2007
2
Outline

Chemoinformatics-What is it?
Molecular descriptors and chemical spaces
Chemical spaces and molecular similarity
Molecular similarity, dissimilarity, diversity
Modification and Simplification of chemical
spaces
Compound Classification and Selection
Similarity Searching
Machine Learning Methods
Library Design
Quantitative Structure Activity Relationship
Analysis (QSAR)
Virtual Screening and compound filtering

3
Chemoinformatics-What is it?

Use of computer and informational techniques,
applied to a range of problems in the field of
chemistry.
This in silico techniques are used in
pharmaceutical companies in the process of drug
discovery.

4
Chemoinformatics-What is it?
5
Chemoinformatics-What is it?
6
Molecular descriptors and chemical spaces

Chemical reference spaces where molecular data
sets are projected and analysis of design is
carried out.
Definition of chemical spaces critically depend
on the use of computational descriptors of
molecular structure, physical or chemical
properties.

7
Molecular descriptors and chemical spaces
8
Molecular descriptors and chemical spaces

There are no generally preferred descriptor
spaces.
Require to generate reference spaces for
specific application on a case by case

9
Chemical spaces and molecular similarity

Similar Property Principle Molecules having
similar structures and properties should also
exhibit similar activity. (Often but not always
true)
Thus, molecules that are located closely
together in chemical reference space are often
considered to be functionally related.

10
Chemical spaces and molecular similarity
11
Molecular similarity, dissimilarity, and diversity

Diversity analysis
Select different compounds from a given
population
Evenly populate a given chemical space with
candidate molecules. Only selecting compounds
that are at least a pre-defined minimum distance
away from others.
Dissimilarity Inverse of molecular similarity
Dissimilarity analysis played a major role in the
pharmaceutical industry.

12
Molecular similarity, dissimilarity, and diversity

Dissimilarity algorithm
Select a subset of k maximally dissimilar
compounds
? due to combinatorial problem, non-trivial
challenge
Other dissimilarity algorithm
Decide on a desired size, n, of a final subset
Select a seed compound and place it in the
subset
Calculate the dissimilarity between each of the
other compounds and those in the subset
Choose the next compound as the one most
dissimilar to those in the subset
If fewer than n in the subset, repeat the
calculation of the dissimilarity until n is
achieved
Complexity varies as the square of n

13
Modification and Simplification of Chemical
Spaces

High dimensional chemistry space might often too
complex for carrying meaningful analyses.
Why?
1) Major areas of high dimensional chemical space
might not populated and remained as empty.
2) Correlation effects between selected
descriptors dramatically distort the reference
space.
Therefore,
1) Design low-dimensional reference spaces
2) Simplify high-dimensional spaces
3) Reduce their dimensionality

14
Modification and Simplification of Chemical
Spaces (contd.)

Auto scaling or variance scaling
Why? Descriptor with large value range will
dominate those having smaller one.
Dimension reduction

15
Modification and Simplification of Chemical
Spaces (contd.) Dimension reduction

Assumption High dimensional descriptor spaces
have at least some intrinsic redundancy.
Two approaches
To identify those descriptors that are most
important for representing the original dataset
and the relationships they form between objects
for lower-dimensional representation
ex) multi dimensional scaling (Agrafiotis,
et al. 2001)
To attempt to generate new descriptors for
lower-dimensional spaces by combining important
contributors from original one.
ex) Principal Component Analysis (PCA)

16
Modification and Simplification of Chemical
Spaces (contd.) - Simplification

Simplification of n-dimensional descriptor
spaces
ex) Binary descriptor transformation
above mean ? 1, below mean ? 0

17
Compound Classification and Selection- CLUSTER
ANALYSIS

Aim is to divide a group into clusters where
objects in the cluster are similar, but objects
in other clusters are dissimilar
Many algorithms for doing this
Hierarchical methods seem to be better than
non-hierarchical
Sometimes called a distance-based approach to
compound selection, because distance is measured
between pairs of compounds

18
Compound Classification and Selection- CLUSTER
ANALYSIS
19
Compound Classification and Selection-
Hierarchical Clustering

The composition of each cluster depends on the
one from which it was derived
Agglomerative methods start at the bottom and
merge similar clusters (bottom-up)
Wards method clusters are formed to minimize
the variance (i.e., the sum of the squared
deviations from the mean)
Others centroid method and the median method
Divisive hierarchical clustering starts with all
compounds in a single cluster and partitions the
data (top-down)

20
Compound Classification and Selection-
Non-Hierarchical Clustering

Organize compounds into an initially defined
number of independent clusters.
Methods
nearest neighbor Jarvis Patrick clustering
relocation K-means

21
Compound Classification and Selection-
Partitioning

Rather than comparing molecular positions,
establish a coordinate ore reference system in
chemical space.
Compounds that populate the same partitions
considered to be similar.

22
Compound Classification and Selection-
Partitioning

Diversity-based selection - Aims at generating a
small representative subset of a compound
collection. It is attempted to generate evenly
populated partition.
Activity-based selection Known active compounds
are added to the source database prior to
partitioning. Compounds in database mapping close
to known activities are then selected as
candidate for testing to identify new hits.

23
Compound Classification and Selection-
Statistical Partitioning

Recursive partitioning most popular statistical
partitioning. A decision tree method
Divides datasets along decision trees formed by
sequences of molecular descriptors.
ex) The compounds could be divided according to
molecular weight.

24
Compound Classification and Selection-
Statistical Partitioning

Statistical partitioning methods such as
recursive partitioning is also very attractive
tools for the analysis of HTS data sets.

25
Similarity Searching Structural queries and
graphs

Detection of structural fragments or
substructures is a simple but popular form of
similarity searching.

26
Similarity Searching Structural queries and
graphs

Contemporary substructure search methods are
mostly based on dictionaries of predefined
molecular fragments.
Queries can be transformed into an
machine-readable format such as Simplified
Molecular Input Line Entry Specification (SMILES)
code.
SMILES encodes 2D representation of molecules as
linear strings of alpha-numeric characters.

27
Similarity Searching Structural queries and
graphs (SMILES)
28
Similarity Searching Structural queries and
graphs

Subgraph-isomorphism
Common substructures can also determined by
systematic mapping of corresponding node
positions in graph.
However, computationally expensive
Reduced graph
Nodes do not represent atoms but features such as
functionally important groups or whole ring
system.
Become more suitable for node matching procedures
and similarity searching.

29
Similarity Searching Structural queries and
graphs (Reduced graph )
30
Similarity Searching Pharmacophore

A molecular framework that carries the essential
features responsible for drugs biological
activity
Spatial arrangements of atoms or groups that are
responsible for biological activity
Often used as 3D queries for database searching

31
Similarity Searching Fingerprints

Fingerprints
widely used similarity search tools.
consist of various descriptors that are encoded
as bit strings
Bit strings of query and database compared using
similarity metric such as Tanimoto coefficient

32
Machine Learning Methods

Important role in chemoinformatics
For example, it is usually difficult to predict
which types of descriptors are most suitable for
a given search, classification.
Therefore, machine learning techniques are often
used to facilitate descriptor selection
Applied to generate complex predictive models by
iterative processing of molecular learning sets
Genetic algorithms
Neural Networks
Self Organizing Maps (SOM)

33
Machine Learning Methods Genetic algorithms

Different parameters and model solutions to given
problems are encoded in a chromosome and
subjected to iterative random variation, thus
generating a population.
Solutions provided by these chromosomes are
evaluated by fitness function that assign high
scores to desired results.
Chromosomes yielding best intermediate solutions
are subjected to mutation and crossover operation
that correspond to random genetic mutations and
gene recombination events.
The resulting modified chromosomes represent the
next generation and the process is continued
until the obtained results meet a satisfactory
convergence criterion

34
Library Design

Diverse Library
Focused Library

35
Quantitative Structure Activity Relationship
Analysis (QSAR)

Goal Evaluation of molecular features that
determine biological activity and the prediction
of compound potency as a function of structural
modification

36
Virtual Screening and Compound Filtering

VS(Virtual Screening) - the process of screening
large databases on the computer for molecules
having desired properties and biological
activity.
A major application of VS techniques is the
identification of novel active molecules in large
compound databases.
Series of known active compounds are added as
search templates to a source DB and then
compounds that are identified as similar to these
templates based on VS calculations are selected
as candidate molecules for experimental
evaluation

37
(No Transcript)
38
Virtual Screening and Compound Filtering- Filter
Functions

Filter functions are very popular tools for VS
Attempts to identify compounds with desired
properties and discard others.
Have been implemented for analysis of diverse
molecular properties including chemical
reactivity, toxicity, drug-like character,
absorption, distribution, metabolism, excretion
(ADME) parameters.
Ex) Aqueous solubility, Passive absorption
blood-brain-barrier penetration,
metabolic stability,
oral availability