Mining Frequent Subgraphs - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Mining Frequent Subgraphs

Description:

An Efficient Data Structure: CAM Tree. Two Operations: CAM-join, CAM-extension. 17 ... with the label of the edge between the two vertices and zero if there is ... – PowerPoint PPT presentation

Number of Views:207

Avg rating:3.0/5.0

Slides: 28

Provided by: McMi83

Category:

more less

Transcript and Presenter's Notes

Title: Mining Frequent Subgraphs

1
Mining Frequent Subgraphs

EECS435
Fall 2008

2
Overview

Introduction
Finding recurring subgraphs from graph databases.
gSpan
FFSM

3
Labeled Graph

We define a labeled graph G as a five element
tuple G V, E, ?V, ?E, ? where
V is the set of vertices of G,
E ? V ?V is a set of undirected edges of G,
?V (?E) are set of vertex (edge) labels,
? is the labeling function V ? ?V and E ? ?E
that maps vertices and edges to their labels.

4
Frequent Subgraph Mining
Input A set GD of labeled undirected graphs

? 2/3

Output All frequent subgraphs (w. r. t. ?) from
GD.
5
Finding Frequent Subgraphs

Given a graph database GD G0,G1,,Gn, find
all subgraphs appearing in at least ? graphs.
Isomorphic subgraphs are considered the same
subgraph.
Apriori approaches
Generation of subgraph candidates is complicated
and expensive.
Subgraph isomorphism is an NP-complete problem,
so pruning is expensive.

6
gSpan

DFS without candidate generation
Relabels graph representation to support DFS.
Discovers all frequent subgraphs without
candidate generation or pruning.
DFS Representation
Map each graph to a DFS code (sequence).
Lexicographically order the codes.
Construct a search tree based on the
lexicographic order.

7
Depth-First Search Tree
(a)
(b)
(c)
(d)
8
DFS Codes

Given ei (i1,j1), e2 (i2,j2) e1 lt e2 if
i1 i2 j1 lt j2
i1 lt j1 j1 i2
code(G,T) edge sequence of ei lt ei1

(a)
(b)
(c)
(d)
9
DFS Lexicographic Order

? code(G?,T?) (a0,a1,,am)
ß code(Gß,Tß) (b0,b1,,bn)
? ß iff (1) or (2)
(1)
(2)
Minimum DFS code
The minimum DFS code min(G), in DFS lexicographic
order, is the canonical label of graph G.
Graphs A and B are isomorphic if min(A) min(B).

10
DFS Codes Parents and Children

If ? (a0,a1,,am) and ß (a0,a1,,am,b)
ß is the child of ?.
? is the parent of ß.
A valid DFS code requires that b grows from a
vertex on the rightmost path.

11
DFS Code Trees

Organize DFS code nodes as parent-child.
Pre-order traversal follows DFS lexicographic
order.
If s and s are the same graph with different DFS
codes, s is not the minimum and can be pruned.

12
gSpan

D is the set of all graphs.
S is the result set.

Algorithm 1 GraphSet_Projection(D,S) 1 sort
labels in D by frequency 2 remove infrequent
vertices and edges 3 relabel remaining vertices
and edges 4 S all frequent 1-edge graphs in
D 5 sort S in DFS lexicographic order 6 S
S 7 foreach edge e in S do 8 s graph
defined by e 9 s.D subgraphs in D containing
e 10 Subgraph_Mining(D,S,s) 11 D D -
e 12 if D lt minSup 13 break
Subprocedure 1 Subgraph_Mining(D,S,s) 1 if s !
min(s) 2 return 3 S S U s 4 s 1-edge
children of s in s.D 5 foreach child c of s
do 6 if support(c) minSup 7 Subgraph_Mining
(Ds,S,c)
13
Runtime Synthetic
Runtime (sec)
14
Runtime Chemical
Apriori (FSG)
gSpan
15
gSpan Advantages

Lower memory requirements.
Faster than naïve FSG by an order of magnitude.
No candidate generation.
Lexicographic ordering minimizes search tree.
False positives pruning.
Any disadvantage?

16
FFSM Fast Frequent Subgraph Mining -- An
Overview

How to solve graph isomorphism problem?
A Novel Graph Canonical Form CAM
How to tackle subgraph isomorphism problem
(NP-complete)?
Incrementally kept embeddings
How to enumerate subgraphs
An Efficient Data Structure CAM Tree
Two Operations CAM-join, CAM-extension.

17
Adjacency Matrix

Every diagonal entry of adjacency matrix M
corresponds to a distinct vertex in G and is
filled with the label of this vertex.
Every off-diagonal entry in the lower triangle
part of M1 corresponds to a pair of vertices in
G and is filled with the label of the edge
between the two vertices and zero if there is no
edge.

1for an undirected graph, the upper triangle is
always a mirror of the lower triangle
18
Code

A Code of n ? n adjacency matrix M is defined as
sequence of lower triangular entries (including
the diagonal entries) in the order
M1,1 M2,1 M2,2 Mn,1 Mn,2 Mn,n-1 Mn,n

Code(M1) aybyxb0y0c00y0d gt Code(M2)
aybyxb00yd0y00c gt Code(M3) bxby0d0y0cyy00a
a
b
y
b
x
y
y
d
0
0
c
0
y
0
0
M2

The Canonical Adjacency Matrix is the one
produces the maximal code, using lexicographic
order.

19
MP Submatrix

For an m ? m matrix M, an n ? n matrix N is Ms
maximal proper submatrix (MP Submatrix), iff N is
obtained by removing the last non-zero entry from
M.

We define a CAM is connected iff the
corresponding graph is connected.
Theorem I A CAMs MP submatrix is CAM
Theorem II A connected CAMs MP submatrix is
connected

20
CAM Tree Subgraphs
b
d
c
a
b
b
c
y
b
x
a
b
a
a
b
y
b
x
b
y
b
y
c
y
0
d
0
y
b
x
0
b
x
0
21
CAM Tree Frequent Subgraphs
? 2/3
22
How to Enumerate Nodes in a CAM Tree?

Two operations to explore CAM tree
CAM-Join
CAM-Extension
Augmenting CAM tree with Suboptimal CAMs
Objectives
none false dismissal
no redundancy
Plus We want to this efficiently!

23
Suboptimal Tree
We define a Suboptimal CAM as a matrix that its
MP submatrix is a CAM.
d
b
c
a
b
b
a
d
y
b
x
b
y
24
Summary

Theorem
For a graph G, let CK-1 (Ck) be set of the
suboptimal CAMs of all the size (K-1) (K)
subgraphs of G (K 2). Every member of set CK
can be enumerated unambiguously either by joining
two members of set CK-1 or by extending a member
in CK-1.

25
Experimental Study

Predictive Toxicology Evaluation Competition
(PTE)
Contains 337 compounds
Each graph contains 27 nodes and 27 edges on
average
NIH DTP Anti-Viral Screen Test (DTP CA/CM)
Chemicals are classified to be Confirmed Active
(CA), Confirmed Moderate Active (CM) and
Confirmed Inactive (CI).
We formed a dataset contains CA (423) and CM
(1083).
Each graph contains 25 nodes and 27 edges on
average