Title: Chemoinformatics Theory
1Chemoinformatics Theory
Yoon Soo Pyon ysp2_at_case.edu October 19th, 2007
2Outline
- Chemoinformatics-What is it?
- Molecular descriptors and chemical spaces
- Chemical spaces and molecular similarity
- Molecular similarity, dissimilarity, diversity
- Modification and Simplification of chemical
spaces - Compound Classification and Selection
- Similarity Searching
- Machine Learning Methods
- Library Design
- Quantitative Structure Activity Relationship
Analysis (QSAR) - Virtual Screening and compound filtering
3Chemoinformatics-What is it?
- Use of computer and informational techniques,
applied to a range of problems in the field of
chemistry. - This in silico techniques are used in
pharmaceutical companies in the process of drug
discovery.
4Chemoinformatics-What is it?
5Chemoinformatics-What is it?
6Molecular descriptors and chemical spaces
- Chemical reference spaces where molecular data
sets are projected and analysis of design is
carried out. - Definition of chemical spaces critically depend
on the use of computational descriptors of
molecular structure, physical or chemical
properties.
7Molecular descriptors and chemical spaces
8Molecular descriptors and chemical spaces
- There are no generally preferred descriptor
spaces. - Require to generate reference spaces for
specific application on a case by case
9Chemical spaces and molecular similarity
- Similar Property Principle Molecules having
similar structures and properties should also
exhibit similar activity. (Often but not always
true) - Thus, molecules that are located closely
together in chemical reference space are often
considered to be functionally related.
10Chemical spaces and molecular similarity
11Molecular similarity, dissimilarity, and diversity
- Diversity analysis
- Select different compounds from a given
population - Evenly populate a given chemical space with
candidate molecules. Only selecting compounds
that are at least a pre-defined minimum distance
away from others. - Dissimilarity Inverse of molecular similarity
- Dissimilarity analysis played a major role in the
pharmaceutical industry.
12Molecular similarity, dissimilarity, and diversity
- Dissimilarity algorithm
- Select a subset of k maximally dissimilar
compounds - ? due to combinatorial problem, non-trivial
challenge - Other dissimilarity algorithm
- Decide on a desired size, n, of a final subset
- Select a seed compound and place it in the
subset - Calculate the dissimilarity between each of the
other compounds and those in the subset - Choose the next compound as the one most
dissimilar to those in the subset - If fewer than n in the subset, repeat the
calculation of the dissimilarity until n is
achieved - Complexity varies as the square of n
13Modification and Simplification of Chemical
Spaces
- High dimensional chemistry space might often too
complex for carrying meaningful analyses. - Why?
- 1) Major areas of high dimensional chemical space
might not populated and remained as empty. - 2) Correlation effects between selected
descriptors dramatically distort the reference
space. - Therefore,
- 1) Design low-dimensional reference spaces
- 2) Simplify high-dimensional spaces
- 3) Reduce their dimensionality
14Modification and Simplification of Chemical
Spaces (contd.)
- Auto scaling or variance scaling
- Why? Descriptor with large value range will
dominate those having smaller one. -
- Dimension reduction
15Modification and Simplification of Chemical
Spaces (contd.) Dimension reduction
- Assumption High dimensional descriptor spaces
have at least some intrinsic redundancy. - Two approaches
- To identify those descriptors that are most
important for representing the original dataset
and the relationships they form between objects
for lower-dimensional representation - ex) multi dimensional scaling (Agrafiotis,
et al. 2001) - To attempt to generate new descriptors for
lower-dimensional spaces by combining important
contributors from original one. - ex) Principal Component Analysis (PCA)
16Modification and Simplification of Chemical
Spaces (contd.) - Simplification
- Simplification of n-dimensional descriptor
spaces - ex) Binary descriptor transformation
- above mean ? 1, below mean ? 0
17Compound Classification and Selection- CLUSTER
ANALYSIS
- Aim is to divide a group into clusters where
objects in the cluster are similar, but objects
in other clusters are dissimilar - Many algorithms for doing this
- Hierarchical methods seem to be better than
non-hierarchical - Sometimes called a distance-based approach to
compound selection, because distance is measured
between pairs of compounds
18Compound Classification and Selection- CLUSTER
ANALYSIS
19Compound Classification and Selection-
Hierarchical Clustering
- The composition of each cluster depends on the
one from which it was derived - Agglomerative methods start at the bottom and
merge similar clusters (bottom-up) - Wards method clusters are formed to minimize
the variance (i.e., the sum of the squared
deviations from the mean) - Others centroid method and the median method
- Divisive hierarchical clustering starts with all
compounds in a single cluster and partitions the
data (top-down)
20Compound Classification and Selection-
Non-Hierarchical Clustering
- Organize compounds into an initially defined
number of independent clusters. - Methods
- nearest neighbor Jarvis Patrick clustering
- relocation K-means
21Compound Classification and Selection-
Partitioning
- Rather than comparing molecular positions,
establish a coordinate ore reference system in
chemical space. - Compounds that populate the same partitions
considered to be similar.
22Compound Classification and Selection-
Partitioning
- Diversity-based selection - Aims at generating a
small representative subset of a compound
collection. It is attempted to generate evenly
populated partition. - Activity-based selection Known active compounds
are added to the source database prior to
partitioning. Compounds in database mapping close
to known activities are then selected as
candidate for testing to identify new hits.
23Compound Classification and Selection-
Statistical Partitioning
- Recursive partitioning most popular statistical
partitioning. A decision tree method - Divides datasets along decision trees formed by
sequences of molecular descriptors. - ex) The compounds could be divided according to
molecular weight.
24Compound Classification and Selection-
Statistical Partitioning
- Statistical partitioning methods such as
recursive partitioning is also very attractive
tools for the analysis of HTS data sets.
25Similarity Searching Structural queries and
graphs
- Detection of structural fragments or
substructures is a simple but popular form of
similarity searching.
26Similarity Searching Structural queries and
graphs
- Contemporary substructure search methods are
mostly based on dictionaries of predefined
molecular fragments. - Queries can be transformed into an
machine-readable format such as Simplified
Molecular Input Line Entry Specification (SMILES)
code. - SMILES encodes 2D representation of molecules as
linear strings of alpha-numeric characters.
27Similarity Searching Structural queries and
graphs (SMILES)
28Similarity Searching Structural queries and
graphs
- Subgraph-isomorphism
- Common substructures can also determined by
systematic mapping of corresponding node
positions in graph. - However, computationally expensive
- Reduced graph
- Nodes do not represent atoms but features such as
functionally important groups or whole ring
system. - Become more suitable for node matching procedures
and similarity searching.
29Similarity Searching Structural queries and
graphs (Reduced graph )
30Similarity Searching Pharmacophore
- A molecular framework that carries the essential
features responsible for drugs biological
activity - Spatial arrangements of atoms or groups that are
responsible for biological activity - Often used as 3D queries for database searching
31Similarity Searching Fingerprints
- Fingerprints
- widely used similarity search tools.
- consist of various descriptors that are encoded
as bit strings - Bit strings of query and database compared using
similarity metric such as Tanimoto coefficient
32Machine Learning Methods
- Important role in chemoinformatics
- For example, it is usually difficult to predict
which types of descriptors are most suitable for
a given search, classification. - Therefore, machine learning techniques are often
used to facilitate descriptor selection - Applied to generate complex predictive models by
iterative processing of molecular learning sets - Genetic algorithms
- Neural Networks
- Self Organizing Maps (SOM)
33Machine Learning Methods Genetic algorithms
- Different parameters and model solutions to given
problems are encoded in a chromosome and
subjected to iterative random variation, thus
generating a population. - Solutions provided by these chromosomes are
evaluated by fitness function that assign high
scores to desired results. - Chromosomes yielding best intermediate solutions
are subjected to mutation and crossover operation
that correspond to random genetic mutations and
gene recombination events. - The resulting modified chromosomes represent the
next generation and the process is continued
until the obtained results meet a satisfactory
convergence criterion
34Library Design
- Diverse Library
- Focused Library
35Quantitative Structure Activity Relationship
Analysis (QSAR)
- Goal Evaluation of molecular features that
determine biological activity and the prediction
of compound potency as a function of structural
modification
36Virtual Screening and Compound Filtering
- VS(Virtual Screening) - the process of screening
large databases on the computer for molecules
having desired properties and biological
activity. - A major application of VS techniques is the
identification of novel active molecules in large
compound databases. - Series of known active compounds are added as
search templates to a source DB and then
compounds that are identified as similar to these
templates based on VS calculations are selected
as candidate molecules for experimental
evaluation
37(No Transcript)
38Virtual Screening and Compound Filtering- Filter
Functions
- Filter functions are very popular tools for VS
- Attempts to identify compounds with desired
properties and discard others. - Have been implemented for analysis of diverse
molecular properties including chemical
reactivity, toxicity, drug-like character,
absorption, distribution, metabolism, excretion
(ADME) parameters. - Ex) Aqueous solubility, Passive absorption
- blood-brain-barrier penetration,
metabolic stability, - oral availability
39Virtual Screening and Compound Filtering- Filter
Functions
40Thank You