Mining Frequent Subgraphs - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Mining Frequent Subgraphs

Description:

An Efficient Data Structure: CAM Tree. Two Operations: CAM-join, CAM-extension. 17 ... with the label of the edge between the two vertices and zero if there is ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 28
Provided by: McMi83
Category:

less

Transcript and Presenter's Notes

Title: Mining Frequent Subgraphs


1
Mining Frequent Subgraphs
  • EECS435
  • Fall 2008

2
Overview
  • Introduction
  • Finding recurring subgraphs from graph databases.
  • gSpan
  • FFSM

3
Labeled Graph
  • We define a labeled graph G as a five element
    tuple G V, E, ?V, ?E, ? where
  • V is the set of vertices of G,
  • E ? V ?V is a set of undirected edges of G,
  • ?V (?E) are set of vertex (edge) labels,
  • ? is the labeling function V ? ?V and E ? ?E
    that maps vertices and edges to their labels.

4
Frequent Subgraph Mining
Input A set GD of labeled undirected graphs
  • ? 2/3

Output All frequent subgraphs (w. r. t. ?) from
GD.
5
Finding Frequent Subgraphs
  • Given a graph database GD G0,G1,,Gn, find
    all subgraphs appearing in at least ? graphs.
  • Isomorphic subgraphs are considered the same
    subgraph.
  • Apriori approaches
  • Generation of subgraph candidates is complicated
    and expensive.
  • Subgraph isomorphism is an NP-complete problem,
    so pruning is expensive.

6
gSpan
  • DFS without candidate generation
  • Relabels graph representation to support DFS.
  • Discovers all frequent subgraphs without
    candidate generation or pruning.
  • DFS Representation
  • Map each graph to a DFS code (sequence).
  • Lexicographically order the codes.
  • Construct a search tree based on the
    lexicographic order.

7
Depth-First Search Tree
(a)
(b)
(c)
(d)
8
DFS Codes
  • Given ei (i1,j1), e2 (i2,j2) e1 lt e2 if
  • i1 i2 j1 lt j2
  • i1 lt j1 j1 i2
  • code(G,T) edge sequence of ei lt ei1

(a)
(b)
(c)
(d)
9
DFS Lexicographic Order
  • ? code(G?,T?) (a0,a1,,am)
  • ß code(Gß,Tß) (b0,b1,,bn)
  • ? ß iff (1) or (2)
  • (1)
  • (2)
  • Minimum DFS code
  • The minimum DFS code min(G), in DFS lexicographic
    order, is the canonical label of graph G.
  • Graphs A and B are isomorphic if min(A) min(B).

10
DFS Codes Parents and Children
  • If ? (a0,a1,,am) and ß (a0,a1,,am,b)
  • ß is the child of ?.
  • ? is the parent of ß.
  • A valid DFS code requires that b grows from a
    vertex on the rightmost path.

11
DFS Code Trees
  • Organize DFS code nodes as parent-child.
  • Pre-order traversal follows DFS lexicographic
    order.
  • If s and s are the same graph with different DFS
    codes, s is not the minimum and can be pruned.

12
gSpan
  • D is the set of all graphs.
  • S is the result set.

Algorithm 1 GraphSet_Projection(D,S) 1 sort
labels in D by frequency 2 remove infrequent
vertices and edges 3 relabel remaining vertices
and edges 4 S all frequent 1-edge graphs in
D 5 sort S in DFS lexicographic order 6 S
S 7 foreach edge e in S do 8 s graph
defined by e 9 s.D subgraphs in D containing
e 10 Subgraph_Mining(D,S,s) 11 D D -
e 12 if D lt minSup 13 break
Subprocedure 1 Subgraph_Mining(D,S,s) 1 if s !
min(s) 2 return 3 S S U s 4 s 1-edge
children of s in s.D 5 foreach child c of s
do 6 if support(c) minSup 7 Subgraph_Mining
(Ds,S,c)
13
Runtime Synthetic
Runtime (sec)
14
Runtime Chemical
Apriori (FSG)
gSpan
15
gSpan Advantages
  • Lower memory requirements.
  • Faster than naïve FSG by an order of magnitude.
  • No candidate generation.
  • Lexicographic ordering minimizes search tree.
  • False positives pruning.
  • Any disadvantage?

16
FFSM Fast Frequent Subgraph Mining -- An
Overview
  • How to solve graph isomorphism problem?
  • A Novel Graph Canonical Form CAM
  • How to tackle subgraph isomorphism problem
    (NP-complete)?
  • Incrementally kept embeddings
  • How to enumerate subgraphs
  • An Efficient Data Structure CAM Tree
  • Two Operations CAM-join, CAM-extension.

17
Adjacency Matrix
  • Every diagonal entry of adjacency matrix M
    corresponds to a distinct vertex in G and is
    filled with the label of this vertex.
  • Every off-diagonal entry in the lower triangle
  • part of M1 corresponds to a pair of vertices in
    G and is filled with the label of the edge
    between the two vertices and zero if there is no
    edge.

1for an undirected graph, the upper triangle is
always a mirror of the lower triangle
18
Code
  • A Code of n ? n adjacency matrix M is defined as
    sequence of lower triangular entries (including
    the diagonal entries) in the order
  • M1,1 M2,1 M2,2 Mn,1 Mn,2 Mn,n-1 Mn,n

Code(M1) aybyxb0y0c00y0d gt Code(M2)
aybyxb00yd0y00c gt Code(M3) bxby0d0y0cyy00a
a
b
y
b
x
y
y
d
0
0
c
0
y
0
0
M2
  • The Canonical Adjacency Matrix is the one
    produces the maximal code, using lexicographic
    order.

19
MP Submatrix
  • For an m ? m matrix M, an n ? n matrix N is Ms
    maximal proper submatrix (MP Submatrix), iff N is
    obtained by removing the last non-zero entry from
    M.
  • We define a CAM is connected iff the
    corresponding graph is connected.
  • Theorem I A CAMs MP submatrix is CAM
  • Theorem II A connected CAMs MP submatrix is
    connected

20
CAM Tree Subgraphs
b
d
c
a
b
b
c
y
b
x
a
b
a
a
b
y
b
x
b
y
b
y
c
y
0
d
0
y
b
x
0
b
x
0
21
CAM Tree Frequent Subgraphs
? 2/3
22
How to Enumerate Nodes in a CAM Tree?
  • Two operations to explore CAM tree
  • CAM-Join
  • CAM-Extension
  • Augmenting CAM tree with Suboptimal CAMs
  • Objectives
  • none false dismissal
  • no redundancy
  • Plus We want to this efficiently!

23
Suboptimal Tree
We define a Suboptimal CAM as a matrix that its
MP submatrix is a CAM.
d
b
c
a
b
b
a
d
y
b
x
b
y
24
Summary
  • Theorem
  • For a graph G, let CK-1 (Ck) be set of the
    suboptimal CAMs of all the size (K-1) (K)
    subgraphs of G (K 2). Every member of set CK
    can be enumerated unambiguously either by joining
    two members of set CK-1 or by extending a member
    in CK-1.

25
Experimental Study
  • Predictive Toxicology Evaluation Competition
    (PTE)
  • Contains 337 compounds
  • Each graph contains 27 nodes and 27 edges on
    average
  • NIH DTP Anti-Viral Screen Test (DTP CA/CM)
  • Chemicals are classified to be Confirmed Active
    (CA), Confirmed Moderate Active (CM) and
    Confirmed Inactive (CI).
  • We formed a dataset contains CA (423) and CM
    (1083).
  • Each graph contains 25 nodes and 27 edges on
    average

26
Performance (PTE)
Support Threshold ()
Support Threshold ()
27
Performance (DTP CACM)
Support Threshold ()
Support Threshold ()
Write a Comment
User Comments (0)
About PowerShow.com