Title: Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search
1Persistent object-oriented hyper-graph model for
Maximal Common Substructure (MCS) search
Milorad Tosic, Ph.D. Rutgers, The State
University of New Jersey Department of Chemistry
2 Databases of Chemical Structures Similarity
Searching Features
Less general
Size of the database
Nature of structures data
Search type
Type of similarity
Couple of hundreds of thousands of structures
Purified, consistent data
Structure search
Graph isomorphism
Substructure search DOW96, BAR93
Row, inconsistent data
Subgraph isomorphism
Supstructure search (structures contained in
target structure)
Substructure similarity searchHAG92, GWW98,
ART92
Maximal common subgraph
More general
3Substructure similarity search
DOW96, BAR93, HAG92, GWW98, ART92
- screening search
- based on substructural features that are
typically small, fragment substructures - many thousands of structures per second
- precedes detailed and time-consuming atom-by-atom
search - atom-by-atom search (MCS)(Maximal Common
Substructure search) - The MCS of a pair of structures is the largest
substructure that is present in both structures. - The MCS is interpreted as similarity measure
between two structures that corresponds favorably
to an intuitive notion of chemical similarity - The MCS is of our primary concern because of its
importance for the search quality and its
exponential computational complexity.
4MCS - Maximal Common Substructure search
- NP-complete problem
- Subgraph isomorphism is proven to be NP-complete
problem which implies that the MCS is also
NP-complete - (at least) Exponential computational complexity
- Average run-time can be reduced by
- Use faster computer
- Use various heuristics
- Carry out some computation in pre-processing phase
BAR93
XUJ96
BAR93
5Our strategy for MCS search
- Back-tracking
- The back-tracking is used as an common background
algorithm for problems with exponential
complexity - Distributed objects
- Distributed computing is explored for increasing
processing speed - Persistent objects are essential for robustness
of the searching engine - Topology-based comparison criteria
- Topology-based features of chemical structures
are found attractive for structure efficient
description - Topological queries and indexing in collection of
distributed objects are considered as promising
approach in similar applications - Our heuristics for reducing average searching
time and postponing computational explosion to
the structures of the size as big as possible are
based on substructure-by-substructure instead of
atom-by-atom search
XUJ96, EST98, WAN98
PSV99
6Experimental results - question
Is there any searching speed-up due to
introduction of topology-based comparison
criteria ?
- Compare searching time with and without
topology-based criteria, for the same set of
target structures and the same set of database
structures. - The topology criterion based on loop number is
usedAn atom X matches atom Y iff they have the
same atom types and number of loops that X
belongs to is not greater than that Y belongs
to. - In order to examine how atom types influence
searching process, the same set of target
structures is applied including as well as
excluding hydrogens.
7Search with Hydrogens excluded
8Search with Hydrogens excluded
9Search with Hydrogens included
10Search with Hydrogens included
11Experimental results - answer
Is there any searching speed-up due to
introduction of topology-based comparison
criteria ? - YES
- Searching speed-up is evident if topology-based
criteria are applied. - Oscillations in searching time indicate further
potential for improving speed. - Exponential complexity remains (both curves have
the same growing tendency), but by introducing
topology-based criteria point of the run-time
explosion is translated into the area of much
more complex structures. - Relative improvement is higher for the case where
structures without hydrogens are considered. If
such a conclusion can be made for specific atom
types, then much better results can be expected
for the case of specific substructure type.
12Experimental results - question
Is there any improvement in quality of the
searching results due to introduction of
topology-based comparison criteria ?
- Does topology-based comparison criteria improve
substructure similarity measure? - Compare structures from the sets of resulting
structures obtained by searching with and without
topology-based criteria, for the same set of
target structures and the same set of database
structures.
13Target structure
14Two of resulting structures
The structure is eliminated
15Experimental results - answer
Is there any improvement in quality of the
searching results due to introduction of
topology-based comparison criteria ? - YES
- Decreasing number of resulting structures.
- Increased probability for expected structures to
be found in the set of resulting structures.
16Serializable hyper-graph
- Different characteristic substructures are
represented on an uniform way - Efficient implementation of topology-based
comparison criteria - Pointer-based data structure with no extra delay
due to serialization - Persistent storage of such objects is
straightforward - Easy to adopt to any distributed objects
technology
17Hyper-graph definitions
- Definition A hyper-graph HG is an ordered
two-tuple - HG (C,E) ,
- where C is set of hyper-graphs that are
containers of HG, and E is a set of hyper-graphs
that are elements of HG - C c c gt HG , E e e lt HG
- Definition An undirected hyper-graph HG is an
ordered two-tuple - HG ((C, E), I) ,
- where (C,E) is hyper-graph, and I is set of
undirected hyper-graphs that are neighbors of the
HG. We say that HG is in undirected connection
relation with its neighbors. - Definition The undirected connection relation is
an equivalence relation.
18Hyper-graph definitions (cont)
- Definition An directed hyper-graph HG is an
ordered three-tuple - HG ((C, E), I, O) ,
- where (C,E) is hyper-graph, I is set of directed
hyper-graphs that are input neighbors of the HG,
and O is set of directed hyper-graphs that are
output neighbors of the HG. We say that HG is in
directed connection relation with its neighbors. - Definition The directed connection relation is
an order relation. - Note We use the undirected hyper-graph in MCS.
19Hyper-graph example
v1
v1
v2
v3
e12
id v1 type VERTEX Container
G1 Elements InElements e12
id v2 type VERTEX Container
G1 Elements InElements e12, e23, e24
e35
e23
v5
v2
e24
e57
e45
v7
v4
. . .
e46
e67
v6
e68
v8
e12
e23
id e12 type EDGE Container G1 Elements
InElements v1,v2
id e23 type EDGE Container G1 Elements
InElements v2, v3
G1
id G1 type GRAPH Container Elements
v1, , v8, e12, e23, ,e68 InElements
. . .
20Hyper-graph example (cont)After simple-loop
reduction
v5
v3
v1
e57
e35
e45
e23
v6
v7
e12
v5
v4
v2
e68
e46
e24
e67
e45
v8
v2
v6
v4
g1
g2
g3
g4
G2
e1
e2
e3
e1
id G2 type GRAPH Container Elements
g1,g2,g3,g4, e1,e2,e3,e4 InElements
id e1 type EDGE Container G2 Elements
v2 InElements g1,g2
g1
g2
e2
id g1 type GRAPH Container G2 Elements
v1,v2,e12 InElements e1
id g2 type LOOP Container G2 Elements
v2,v3,v4,v5,e23,e24,e35,e45 InElements
e1, e2
id e2 type EDGE Container G2 Elements
v4,v5,e45 InElements g2, g3
21Hyper-graph class hierarchy
22Conclusions
- Experimental analysis proved again the fact
pointed out in a literature that topological
information about chemical structure (information
about loops in the experiments) can improve
substructure similarity searching. - Because the MCS is NP-complete problem,
efficiency of the applied computing model is very
important. Distributed objects is currently the
most promising computational approach. Hence, it
should be applied to substructure similarity
search in chemical structure databases. - The proposed hyper-graph model is able to
efficiently represent both topology and
behavioral characteristics of a chemical
structure, in a hierarchical way. - Due to efficient serialization method, the object
representation of the hyper-graph can be
incorporated at any distributed technology (i.g.
CORBA) without decreasing execution efficiency.
23References
ART92 Artymiuk, J., et. all., (1992),
Similarity searching of three-dimensional
molecules and macromolecules., J. Chem. Inf.
Comput. Sci., 32, 617-630.
BAR93 Barnard, J.M., (1993), Substructure
searching methods Old and New., J. Chem. Inf.
Comput. Sci., 33, 532-538.
DOW96 Downs, G.M., and Willett, P. (1995),
Similarity searching in databases of chemical
structures., Rev. Comput. Chem., 7, 1-66.
EST98 Estrada, E., (1998), Spectral moments of
the edge adjacency matrix in molecular graphs.,
J. Chem. Inf. Comput. Sci., 38, 23-27.
GWW96 Gillet, V.J., Wild, D.J., Willet, P., and
Bradshaw, J. (1998), Similarity and dissimilarity
methods for processing chemical structure
databases., The Computer Journal, 41, No. 8,
547- 558.
HAG92 Hagadone, T.R., (1992), Molecule
substructure similarity searching Efficient
retrival in two- dimensional structure
databases., J. Chem. Inf. Comput. Sci., 32,
515-521.
PSV99 Papadimitriou, C.H., Suciu, D., and
Vianu, V., (1999), Topological queries in spatial
databases., Journal of Comput. and Sys. Sci.,
58, 29-53.
WAN98 Wang, T., and Zhou, J., (1998), 3DFS A
new 3D flexible searching system for use in drug
design., J. Chem. Inf. Comput. Sci., 38, 71-77.
XUJ96 Xu, J., (1996), GMA A generic match
algorithm for structural homomorphism,
isomorphism, and maximal common substructure
match and its applications., J. Chem. Inf.
Comput. Sci., 36, 25-34.