Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search presentation

About This Presentation

Transcript and Presenter's Notes

Title: Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search

1
Persistent object-oriented hyper-graph model for
Maximal Common Substructure (MCS) search
Milorad Tosic, Ph.D. Rutgers, The State
University of New Jersey Department of Chemistry
2
Databases of Chemical Structures Similarity
Searching Features
Less general
Size of the database
Nature of structures data
Search type
Type of similarity
Couple of hundreds of thousands of structures
Purified, consistent data
Structure search
Graph isomorphism
Substructure search DOW96, BAR93
Row, inconsistent data
Subgraph isomorphism
Supstructure search (structures contained in
target structure)
Substructure similarity searchHAG92, GWW98,
ART92
Maximal common subgraph
More general
3
Substructure similarity search
DOW96, BAR93, HAG92, GWW98, ART92

screening search
based on substructural features that are
typically small, fragment substructures
many thousands of structures per second
precedes detailed and time-consuming atom-by-atom
search
atom-by-atom search (MCS)(Maximal Common
Substructure search)
The MCS of a pair of structures is the largest
substructure that is present in both structures.
The MCS is interpreted as similarity measure
between two structures that corresponds favorably
to an intuitive notion of chemical similarity
The MCS is of our primary concern because of its
importance for the search quality and its
exponential computational complexity.

4
MCS - Maximal Common Substructure search

NP-complete problem
Subgraph isomorphism is proven to be NP-complete
problem which implies that the MCS is also
NP-complete
(at least) Exponential computational complexity
Average run-time can be reduced by
Use faster computer
Use various heuristics
Carry out some computation in pre-processing phase

BAR93
XUJ96
BAR93
5
Our strategy for MCS search

Back-tracking
The back-tracking is used as an common background
algorithm for problems with exponential
complexity
Distributed objects
Distributed computing is explored for increasing
processing speed
Persistent objects are essential for robustness
of the searching engine
Topology-based comparison criteria
Topology-based features of chemical structures
are found attractive for structure efficient
description
Topological queries and indexing in collection of
distributed objects are considered as promising
approach in similar applications
Our heuristics for reducing average searching
time and postponing computational explosion to
the structures of the size as big as possible are
based on substructure-by-substructure instead of
atom-by-atom search

XUJ96, EST98, WAN98
PSV99
6
Experimental results - question
Is there any searching speed-up due to
introduction of topology-based comparison
criteria ?

Compare searching time with and without
topology-based criteria, for the same set of
target structures and the same set of database
structures.
The topology criterion based on loop number is
usedAn atom X matches atom Y iff they have the
same atom types and number of loops that X
belongs to is not greater than that Y belongs
to.
In order to examine how atom types influence
searching process, the same set of target
structures is applied including as well as
excluding hydrogens.

7
Search with Hydrogens excluded
8
Search with Hydrogens excluded
9
Search with Hydrogens included
10
Search with Hydrogens included
11
Experimental results - answer
Is there any searching speed-up due to
introduction of topology-based comparison
criteria ? - YES

Searching speed-up is evident if topology-based
criteria are applied.
Oscillations in searching time indicate further
potential for improving speed.
Exponential complexity remains (both curves have
the same growing tendency), but by introducing
topology-based criteria point of the run-time
explosion is translated into the area of much
more complex structures.
Relative improvement is higher for the case where
structures without hydrogens are considered. If
such a conclusion can be made for specific atom
types, then much better results can be expected
for the case of specific substructure type.

12
Experimental results - question
Is there any improvement in quality of the
searching results due to introduction of
topology-based comparison criteria ?

Does topology-based comparison criteria improve
substructure similarity measure?
Compare structures from the sets of resulting
structures obtained by searching with and without
topology-based criteria, for the same set of
target structures and the same set of database
structures.

13
Target structure
14
Two of resulting structures
The structure is eliminated
15
Experimental results - answer
Is there any improvement in quality of the
searching results due to introduction of
topology-based comparison criteria ? - YES

Decreasing number of resulting structures.
Increased probability for expected structures to
be found in the set of resulting structures.

16
Serializable hyper-graph

Different characteristic substructures are
represented on an uniform way
Efficient implementation of topology-based
comparison criteria
Pointer-based data structure with no extra delay
due to serialization
Persistent storage of such objects is
straightforward
Easy to adopt to any distributed objects
technology

17
Hyper-graph definitions

Definition A hyper-graph HG is an ordered
two-tuple
HG (C,E) ,
where C is set of hyper-graphs that are
containers of HG, and E is a set of hyper-graphs
that are elements of HG
C c c gt HG , E e e lt HG
Definition An undirected hyper-graph HG is an
ordered two-tuple
HG ((C, E), I) ,
where (C,E) is hyper-graph, and I is set of
undirected hyper-graphs that are neighbors of the
HG. We say that HG is in undirected connection
relation with its neighbors.
Definition The undirected connection relation is
an equivalence relation.

18
Hyper-graph definitions (cont)

Definition An directed hyper-graph HG is an
ordered three-tuple
HG ((C, E), I, O) ,
where (C,E) is hyper-graph, I is set of directed
hyper-graphs that are input neighbors of the HG,
and O is set of directed hyper-graphs that are
output neighbors of the HG. We say that HG is in
directed connection relation with its neighbors.
Definition The directed connection relation is
an order relation.
Note We use the undirected hyper-graph in MCS.

19
Hyper-graph example
v1
v1
v2
v3
e12
id v1 type VERTEX Container
G1 Elements InElements e12
id v2 type VERTEX Container
G1 Elements InElements e12, e23, e24
e35
e23
v5
v2
e24
e57
e45
v7
v4
. . .
e46
e67
v6
e68
v8
e12
e23
id e12 type EDGE Container G1 Elements
InElements v1,v2
id e23 type EDGE Container G1 Elements
InElements v2, v3
G1
id G1 type GRAPH Container Elements
v1, , v8, e12, e23, ,e68 InElements
. . .
20
Hyper-graph example (cont)After simple-loop
reduction
v5
v3
v1
e57
e35
e45
e23
v6
v7
e12
v5
v4
v2
e68
e46
e24
e67
e45
v8
v2
v6
v4
g1
g2
g3
g4
G2
e1
e2
e3
e1
id G2 type GRAPH Container Elements
g1,g2,g3,g4, e1,e2,e3,e4 InElements
id e1 type EDGE Container G2 Elements
v2 InElements g1,g2
g1
g2
e2
id g1 type GRAPH Container G2 Elements
v1,v2,e12 InElements e1
id g2 type LOOP Container G2 Elements
v2,v3,v4,v5,e23,e24,e35,e45 InElements
e1, e2
id e2 type EDGE Container G2 Elements
v4,v5,e45 InElements g2, g3
21
Hyper-graph class hierarchy
22
Conclusions

Experimental analysis proved again the fact
pointed out in a literature that topological
information about chemical structure (information
about loops in the experiments) can improve
substructure similarity searching.
Because the MCS is NP-complete problem,
efficiency of the applied computing model is very
important. Distributed objects is currently the
most promising computational approach. Hence, it
should be applied to substructure similarity
search in chemical structure databases.
The proposed hyper-graph model is able to
efficiently represent both topology and
behavioral characteristics of a chemical
structure, in a hierarchical way.
Due to efficient serialization method, the object
representation of the hyper-graph can be
incorporated at any distributed technology (i.g.
CORBA) without decreasing execution efficiency.

23
References
ART92 Artymiuk, J., et. all., (1992),
Similarity searching of three-dimensional
molecules and macromolecules., J. Chem. Inf.
Comput. Sci., 32, 617-630.
BAR93 Barnard, J.M., (1993), Substructure
searching methods Old and New., J. Chem. Inf.
Comput. Sci., 33, 532-538.
DOW96 Downs, G.M., and Willett, P. (1995),
Similarity searching in databases of chemical
structures., Rev. Comput. Chem., 7, 1-66.
EST98 Estrada, E., (1998), Spectral moments of
the edge adjacency matrix in molecular graphs.,
J. Chem. Inf. Comput. Sci., 38, 23-27.
GWW96 Gillet, V.J., Wild, D.J., Willet, P., and
Bradshaw, J. (1998), Similarity and dissimilarity
methods for processing chemical structure
databases., The Computer Journal, 41, No. 8,
547- 558.
HAG92 Hagadone, T.R., (1992), Molecule
substructure similarity searching Efficient
retrival in two- dimensional structure
databases., J. Chem. Inf. Comput. Sci., 32,
515-521.
PSV99 Papadimitriou, C.H., Suciu, D., and
Vianu, V., (1999), Topological queries in spatial
databases., Journal of Comput. and Sys. Sci.,
58, 29-53.
WAN98 Wang, T., and Zhou, J., (1998), 3DFS A
new 3D flexible searching system for use in drug
design., J. Chem. Inf. Comput. Sci., 38, 71-77.
XUJ96 Xu, J., (1996), GMA A generic match
algorithm for structural homomorphism,
isomorphism, and maximal common substructure
match and its applications., J. Chem. Inf.
Comput. Sci., 36, 25-34.

Write a Comment

User Comments (0)

About PowerShow.com

Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search PowerPoint PPT Presentation