Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search PowerPoint PPT Presentation

presentation player overlay
1 / 23
About This Presentation
Transcript and Presenter's Notes

Title: Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search


1
Persistent object-oriented hyper-graph model for
Maximal Common Substructure (MCS) search
Milorad Tosic, Ph.D. Rutgers, The State
University of New Jersey Department of Chemistry
2
Databases of Chemical Structures Similarity
Searching Features
Less general
Size of the database
Nature of structures data
Search type
Type of similarity
Couple of hundreds of thousands of structures
Purified, consistent data
Structure search
Graph isomorphism
Substructure search DOW96, BAR93
Row, inconsistent data
Subgraph isomorphism
Supstructure search (structures contained in
target structure)
Substructure similarity searchHAG92, GWW98,
ART92
Maximal common subgraph
More general
3
Substructure similarity search
DOW96, BAR93, HAG92, GWW98, ART92
  • screening search
  • based on substructural features that are
    typically small, fragment substructures
  • many thousands of structures per second
  • precedes detailed and time-consuming atom-by-atom
    search
  • atom-by-atom search (MCS)(Maximal Common
    Substructure search)
  • The MCS of a pair of structures is the largest
    substructure that is present in both structures.
  • The MCS is interpreted as similarity measure
    between two structures that corresponds favorably
    to an intuitive notion of chemical similarity
  • The MCS is of our primary concern because of its
    importance for the search quality and its
    exponential computational complexity.

4
MCS - Maximal Common Substructure search
  • NP-complete problem
  • Subgraph isomorphism is proven to be NP-complete
    problem which implies that the MCS is also
    NP-complete
  • (at least) Exponential computational complexity
  • Average run-time can be reduced by
  • Use faster computer
  • Use various heuristics
  • Carry out some computation in pre-processing phase

BAR93
XUJ96
BAR93
5
Our strategy for MCS search
  • Back-tracking
  • The back-tracking is used as an common background
    algorithm for problems with exponential
    complexity
  • Distributed objects
  • Distributed computing is explored for increasing
    processing speed
  • Persistent objects are essential for robustness
    of the searching engine
  • Topology-based comparison criteria
  • Topology-based features of chemical structures
    are found attractive for structure efficient
    description
  • Topological queries and indexing in collection of
    distributed objects are considered as promising
    approach in similar applications
  • Our heuristics for reducing average searching
    time and postponing computational explosion to
    the structures of the size as big as possible are
    based on substructure-by-substructure instead of
    atom-by-atom search

XUJ96, EST98, WAN98
PSV99
6
Experimental results - question
Is there any searching speed-up due to
introduction of topology-based comparison
criteria ?
  • Compare searching time with and without
    topology-based criteria, for the same set of
    target structures and the same set of database
    structures.
  • The topology criterion based on loop number is
    usedAn atom X matches atom Y iff they have the
    same atom types and number of loops that X
    belongs to is not greater than that Y belongs
    to.
  • In order to examine how atom types influence
    searching process, the same set of target
    structures is applied including as well as
    excluding hydrogens.

7
Search with Hydrogens excluded
8
Search with Hydrogens excluded
9
Search with Hydrogens included
10
Search with Hydrogens included
11
Experimental results - answer
Is there any searching speed-up due to
introduction of topology-based comparison
criteria ? - YES
  • Searching speed-up is evident if topology-based
    criteria are applied.
  • Oscillations in searching time indicate further
    potential for improving speed.
  • Exponential complexity remains (both curves have
    the same growing tendency), but by introducing
    topology-based criteria point of the run-time
    explosion is translated into the area of much
    more complex structures.
  • Relative improvement is higher for the case where
    structures without hydrogens are considered. If
    such a conclusion can be made for specific atom
    types, then much better results can be expected
    for the case of specific substructure type.

12
Experimental results - question
Is there any improvement in quality of the
searching results due to introduction of
topology-based comparison criteria ?
  • Does topology-based comparison criteria improve
    substructure similarity measure?
  • Compare structures from the sets of resulting
    structures obtained by searching with and without
    topology-based criteria, for the same set of
    target structures and the same set of database
    structures.

13
Target structure
14
Two of resulting structures
The structure is eliminated
15
Experimental results - answer
Is there any improvement in quality of the
searching results due to introduction of
topology-based comparison criteria ? - YES
  • Decreasing number of resulting structures.
  • Increased probability for expected structures to
    be found in the set of resulting structures.

16
Serializable hyper-graph
  • Different characteristic substructures are
    represented on an uniform way
  • Efficient implementation of topology-based
    comparison criteria
  • Pointer-based data structure with no extra delay
    due to serialization
  • Persistent storage of such objects is
    straightforward
  • Easy to adopt to any distributed objects
    technology

17
Hyper-graph definitions
  • Definition A hyper-graph HG is an ordered
    two-tuple
  • HG (C,E) ,
  • where C is set of hyper-graphs that are
    containers of HG, and E is a set of hyper-graphs
    that are elements of HG
  • C c c gt HG , E e e lt HG
  • Definition An undirected hyper-graph HG is an
    ordered two-tuple
  • HG ((C, E), I) ,
  • where (C,E) is hyper-graph, and I is set of
    undirected hyper-graphs that are neighbors of the
    HG. We say that HG is in undirected connection
    relation with its neighbors.
  • Definition The undirected connection relation is
    an equivalence relation.

18
Hyper-graph definitions (cont)
  • Definition An directed hyper-graph HG is an
    ordered three-tuple
  • HG ((C, E), I, O) ,
  • where (C,E) is hyper-graph, I is set of directed
    hyper-graphs that are input neighbors of the HG,
    and O is set of directed hyper-graphs that are
    output neighbors of the HG. We say that HG is in
    directed connection relation with its neighbors.
  • Definition The directed connection relation is
    an order relation.
  • Note We use the undirected hyper-graph in MCS.

19
Hyper-graph example
v1
v1
v2
v3
e12
id v1 type VERTEX Container
G1 Elements InElements e12
id v2 type VERTEX Container
G1 Elements InElements e12, e23, e24
e35
e23
v5
v2
e24
e57
e45
v7
v4
. . .
e46
e67
v6
e68
v8
e12
e23
id e12 type EDGE Container G1 Elements
InElements v1,v2
id e23 type EDGE Container G1 Elements
InElements v2, v3
G1
id G1 type GRAPH Container Elements
v1, , v8, e12, e23, ,e68 InElements
. . .
20
Hyper-graph example (cont)After simple-loop
reduction
v5
v3
v1
e57
e35
e45
e23
v6
v7
e12
v5
v4
v2
e68
e46
e24
e67
e45
v8
v2
v6
v4
g1
g2
g3
g4
G2
e1
e2
e3
e1
id G2 type GRAPH Container Elements
g1,g2,g3,g4, e1,e2,e3,e4 InElements
id e1 type EDGE Container G2 Elements
v2 InElements g1,g2
g1
g2
e2
id g1 type GRAPH Container G2 Elements
v1,v2,e12 InElements e1
id g2 type LOOP Container G2 Elements
v2,v3,v4,v5,e23,e24,e35,e45 InElements
e1, e2
id e2 type EDGE Container G2 Elements
v4,v5,e45 InElements g2, g3
21
Hyper-graph class hierarchy
22
Conclusions
  • Experimental analysis proved again the fact
    pointed out in a literature that topological
    information about chemical structure (information
    about loops in the experiments) can improve
    substructure similarity searching.
  • Because the MCS is NP-complete problem,
    efficiency of the applied computing model is very
    important. Distributed objects is currently the
    most promising computational approach. Hence, it
    should be applied to substructure similarity
    search in chemical structure databases.
  • The proposed hyper-graph model is able to
    efficiently represent both topology and
    behavioral characteristics of a chemical
    structure, in a hierarchical way.
  • Due to efficient serialization method, the object
    representation of the hyper-graph can be
    incorporated at any distributed technology (i.g.
    CORBA) without decreasing execution efficiency.

23
References
ART92 Artymiuk, J., et. all., (1992),
Similarity searching of three-dimensional
molecules and macromolecules., J. Chem. Inf.
Comput. Sci., 32, 617-630.
BAR93 Barnard, J.M., (1993), Substructure
searching methods Old and New., J. Chem. Inf.
Comput. Sci., 33, 532-538.
DOW96 Downs, G.M., and Willett, P. (1995),
Similarity searching in databases of chemical
structures., Rev. Comput. Chem., 7, 1-66.
EST98 Estrada, E., (1998), Spectral moments of
the edge adjacency matrix in molecular graphs.,
J. Chem. Inf. Comput. Sci., 38, 23-27.
GWW96 Gillet, V.J., Wild, D.J., Willet, P., and
Bradshaw, J. (1998), Similarity and dissimilarity
methods for processing chemical structure
databases., The Computer Journal, 41, No. 8,
547- 558.
HAG92 Hagadone, T.R., (1992), Molecule
substructure similarity searching Efficient
retrival in two- dimensional structure
databases., J. Chem. Inf. Comput. Sci., 32,
515-521.
PSV99 Papadimitriou, C.H., Suciu, D., and
Vianu, V., (1999), Topological queries in spatial
databases., Journal of Comput. and Sys. Sci.,
58, 29-53.
WAN98 Wang, T., and Zhou, J., (1998), 3DFS A
new 3D flexible searching system for use in drug
design., J. Chem. Inf. Comput. Sci., 38, 71-77.
XUJ96 Xu, J., (1996), GMA A generic match
algorithm for structural homomorphism,
isomorphism, and maximal common substructure
match and its applications., J. Chem. Inf.
Comput. Sci., 36, 25-34.
Write a Comment
User Comments (0)
About PowerShow.com