Title: Mining Frequent Subgraphs
1Mining Frequent Subgraphs
2Overview
- Introduction
- Finding recurring subgraphs from graph databases.
- gSpan
- FFSM
3Labeled Graph
- We define a labeled graph G as a five element
tuple G V, E, ?V, ?E, ? where - V is the set of vertices of G,
- E ? V ?V is a set of undirected edges of G,
- ?V (?E) are set of vertex (edge) labels,
- ? is the labeling function V ? ?V and E ? ?E
that maps vertices and edges to their labels.
4Frequent Subgraph Mining
Input A set GD of labeled undirected graphs
Output All frequent subgraphs (w. r. t. ?) from
GD.
5Finding Frequent Subgraphs
- Given a graph database GD G0,G1,,Gn, find
all subgraphs appearing in at least ? graphs. - Isomorphic subgraphs are considered the same
subgraph. - Apriori approaches
- Generation of subgraph candidates is complicated
and expensive. - Subgraph isomorphism is an NP-complete problem,
so pruning is expensive.
6gSpan
- DFS without candidate generation
- Relabels graph representation to support DFS.
- Discovers all frequent subgraphs without
candidate generation or pruning. - DFS Representation
- Map each graph to a DFS code (sequence).
- Lexicographically order the codes.
- Construct a search tree based on the
lexicographic order.
7Depth-First Search Tree
(a)
(b)
(c)
(d)
8DFS Codes
- Given ei (i1,j1), e2 (i2,j2) e1 lt e2 if
- i1 i2 j1 lt j2
- i1 lt j1 j1 i2
- code(G,T) edge sequence of ei lt ei1
(a)
(b)
(c)
(d)
9DFS Lexicographic Order
- ? code(G?,T?) (a0,a1,,am)
- ß code(Gß,Tß) (b0,b1,,bn)
- ? ß iff (1) or (2)
- (1)
- (2)
- Minimum DFS code
- The minimum DFS code min(G), in DFS lexicographic
order, is the canonical label of graph G. - Graphs A and B are isomorphic if min(A) min(B).
10DFS Codes Parents and Children
- If ? (a0,a1,,am) and ß (a0,a1,,am,b)
- ß is the child of ?.
- ? is the parent of ß.
- A valid DFS code requires that b grows from a
vertex on the rightmost path.
11DFS Code Trees
- Organize DFS code nodes as parent-child.
- Pre-order traversal follows DFS lexicographic
order. - If s and s are the same graph with different DFS
codes, s is not the minimum and can be pruned.
12gSpan
- D is the set of all graphs.
- S is the result set.
Algorithm 1 GraphSet_Projection(D,S) 1 sort
labels in D by frequency 2 remove infrequent
vertices and edges 3 relabel remaining vertices
and edges 4 S all frequent 1-edge graphs in
D 5 sort S in DFS lexicographic order 6 S
S 7 foreach edge e in S do 8 s graph
defined by e 9 s.D subgraphs in D containing
e 10 Subgraph_Mining(D,S,s) 11 D D -
e 12 if D lt minSup 13 break
Subprocedure 1 Subgraph_Mining(D,S,s) 1 if s !
min(s) 2 return 3 S S U s 4 s 1-edge
children of s in s.D 5 foreach child c of s
do 6 if support(c) minSup 7 Subgraph_Mining
(Ds,S,c)
13Runtime Synthetic
Runtime (sec)
14Runtime Chemical
Apriori (FSG)
gSpan
15gSpan Advantages
- Lower memory requirements.
- Faster than naïve FSG by an order of magnitude.
- No candidate generation.
- Lexicographic ordering minimizes search tree.
- False positives pruning.
- Any disadvantage?
16FFSM Fast Frequent Subgraph Mining -- An
Overview
- How to solve graph isomorphism problem?
- A Novel Graph Canonical Form CAM
- How to tackle subgraph isomorphism problem
(NP-complete)? - Incrementally kept embeddings
- How to enumerate subgraphs
- An Efficient Data Structure CAM Tree
- Two Operations CAM-join, CAM-extension.
17Adjacency Matrix
- Every diagonal entry of adjacency matrix M
corresponds to a distinct vertex in G and is
filled with the label of this vertex. - Every off-diagonal entry in the lower triangle
- part of M1 corresponds to a pair of vertices in
G and is filled with the label of the edge
between the two vertices and zero if there is no
edge.
1for an undirected graph, the upper triangle is
always a mirror of the lower triangle
18Code
- A Code of n ? n adjacency matrix M is defined as
sequence of lower triangular entries (including
the diagonal entries) in the order - M1,1 M2,1 M2,2 Mn,1 Mn,2 Mn,n-1 Mn,n
Code(M1) aybyxb0y0c00y0d gt Code(M2)
aybyxb00yd0y00c gt Code(M3) bxby0d0y0cyy00a
a
b
y
b
x
y
y
d
0
0
c
0
y
0
0
M2
- The Canonical Adjacency Matrix is the one
produces the maximal code, using lexicographic
order.
19MP Submatrix
- For an m ? m matrix M, an n ? n matrix N is Ms
maximal proper submatrix (MP Submatrix), iff N is
obtained by removing the last non-zero entry from
M.
- We define a CAM is connected iff the
corresponding graph is connected. - Theorem I A CAMs MP submatrix is CAM
- Theorem II A connected CAMs MP submatrix is
connected
20CAM Tree Subgraphs
b
d
c
a
b
b
c
y
b
x
a
b
a
a
b
y
b
x
b
y
b
y
c
y
0
d
0
y
b
x
0
b
x
0
21CAM Tree Frequent Subgraphs
? 2/3
22How to Enumerate Nodes in a CAM Tree?
- Two operations to explore CAM tree
- CAM-Join
- CAM-Extension
- Augmenting CAM tree with Suboptimal CAMs
- Objectives
- none false dismissal
- no redundancy
- Plus We want to this efficiently!
23Suboptimal Tree
We define a Suboptimal CAM as a matrix that its
MP submatrix is a CAM.
d
b
c
a
b
b
a
d
y
b
x
b
y
24Summary
- Theorem
- For a graph G, let CK-1 (Ck) be set of the
suboptimal CAMs of all the size (K-1) (K)
subgraphs of G (K 2). Every member of set CK
can be enumerated unambiguously either by joining
two members of set CK-1 or by extending a member
in CK-1.
25Experimental Study
- Predictive Toxicology Evaluation Competition
(PTE) - Contains 337 compounds
- Each graph contains 27 nodes and 27 edges on
average - NIH DTP Anti-Viral Screen Test (DTP CA/CM)
- Chemicals are classified to be Confirmed Active
(CA), Confirmed Moderate Active (CM) and
Confirmed Inactive (CI). - We formed a dataset contains CA (423) and CM
(1083). - Each graph contains 25 nodes and 27 edges on
average
26Performance (PTE)
Support Threshold ()
Support Threshold ()
27Performance (DTP CACM)
Support Threshold ()
Support Threshold ()