Title: Frequent Subgraph Pattern Mining on Uncertain Graph Data
1Frequent Subgraph Pattern Miningon Uncertain
Graph Data
- Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang
- Harbin Institute of Technology, China
- CIKM09, Hong Kong
- Nov 4, 2009
2Outline
- Background
- Problem Definition
- Algorithm
- Experimental Results
- Conclusions
3Background
- Graph mining has played an important role in a
range of real world applications. - medicines structures of molecules
- bioinformatics biological networks
- technologies WWW
- social science social networks
- many others
4Directions of Graph Mining
Models of graphse.g. Leskovec et al. KDD05
Patterns of graphse.g., Yan et al. ICDM02
Uncertainties of graphs
Privacy of graphse.g., Zou et al. VLDB09
Evolution of graphse.g., Faloutsos et al.
SIGMOD07
5Uncertainties of Graphs Example I
- Protein-Protein Interaction (PPI) Networks
- Vertices proteins
- Edges interactions between proteins
- Uncertainties probabilities of interactions
really existing
TIF34
0.375
0.639
0.867
0.651
0.651
FET3
0.147
0.639
0.698
NTG1
SMT3
RAD59
RPC40
The data are taken from the STRING Database
(http//string-db.org).
6Uncertainties of Graphs Example II
- Topologies of wireless sensor networks (WSNs)
- Vertices sensor nodes
- Edges wireless links between sensor nodes
- Uncertainties probabilities of wireless links
functioning at any given time
0.75
0.95
0.88
0.92
0.69
7The Goal of This Paper
Models of graphse.g. Leskovec et al. KDD05
Patterns of graphse.g., Yan et al. ICDM02
Uncertainties of graphs
Privacy of graphse.g., Zou et al. VLDB09
Evolution of graphse.g., Faloutsos et al.
SIGMOD07
8Outline
- Background
- Problem Definition
- Algorithm
- Experimental Results
- Conclusions
9Preliminaries
Graph Database
Subgraph Pattern
support 1.0
support 0.5
The support of S the number of graphs
containing S
the total number of graphs
10Frequent Subgraph Pattern Mining Problem
- Input a graph database D, and a support
threshold minsup - Output all subgraph patterns with support no
less than minsup - FSP mining on biological networks (e.g., PPI
networks) is an important tool for discovering
functional modules Koyutürk et al.
Bioinformatics 04, Turanalp et al. BMC
Bioinformatics 08. - PPI networks are subject to uncertainties.
- How do we define support?
11Model of Uncertain Graphs
(1 0.5) 0.6 0.7 0.8 0.168
Uncertain Graph
0.5 (1 0.6) 0.7 0.8 0.112
12Model of Uncertain Graphs (Contd)
Theorem An uncertain graph represents a
probability distribution over all its implicated
graphs.
13Uncertain Graph Databases
Theorem An uncertain graph DB represents a
probability distribution over all its implicated
graph DBs.
Totally, 24 23 128 implicated graph databases.
Implicated Graph Database
((1 0.5) 0.6 0.7 0.8) (0.8 0.1 (1
0.7)) 4.032 10-3
14Expected Support
D
uncertain graph DB
p1 Pr(D implicates d1)
p2 Pr(D implicates d2)
pn Pr(D implicates dn)
s1 support of S in d1
s2 support of S in d2
sn support of S in dn
The expected support of S is
15FSP Mining Problem on Uncertain Graphs
- Input an uncertain graph database D, and an
expected support threshold minsup - Output all subgraph patterns with expected
support no less than minsup - It is P-hard to count the number of frequent
subgraph patterns. - Reduction from the problem of counting the number
of satisfying truth assignments of a monotone
k-CNF formula. - The FSP mining problem on uncertain graphs is
NP-hard.
16Outline
- Background
- Problem Definition
- Algorithm
- Experimental Results
- Conclusions
17Approximation Method
- It is P-hard to compute the expected support of
a subgraph pattern. - We develop an approximation method to find an
approximate set of frequent subgraph patterns. - Let e (0 lt e lt 1) be a relative error tolerance.
Output
Discard
Arbitrary
expected support
1
0
minsup
(1-e) minsup
18Objective I
- Difficulty I of frequent subgraph patterns is
exponentially large. - Objective I Examine subgraph patterns as
efficiently as possible to find all frequent ones.
19Method for Objectives I
- Step 1 Build a search tree T of subgraph
patterns. - Step 2 Examine subgraph patterns in T in
depth-first order - If S is infrequent, then all its descendents can
be pruned.
20Objective II
- Difficulty II It is P-hard to compute the
expected support esup(S) of a subgraph pattern S. - Objective II Make the following judgments
without computing esup(S) exactly. - If esup(S) is surely not in the green region,
then discard. - If esup(S) is probable to be in the green region
and surely not in the red region, then output.
21Method for Objective II
- Step 1 Approximate esup(S) by an interval l, u
such that esup(S)?l, u. - Step 2 Decide whether S can be output or not by
testing the following conditions.
Output
Discard
Shrink
22Approximating esup(S) by l,u
A subgraph pattern S occurs in an uncertain graph
G if S is contained in at least one implicated
graph of G.
Algorithm Approximate esup(S) by l,u Step 1
For each uncertain graph Gi in D, approximate
Pr(S occurs in Gi) by an interval li, ui of
width at most eminsup. Step 2
23Approximate Pr(S occurs in Gi) by li, ui
Step 1 Find all embeddings of S in Gi.
4 embeddings Step 2 Assign boolean
variables to the edges in the embeddings. Pr(x1)
0.5, Pr(x2) 0.6, Pr(x3) 0.7, Pr(x4)
0.8. Step 3 Construct a conjunctive formula for
each embedding. C1 (x1 x2), C2 (x1 x4),
C3 (x2 x3), C4 (x3 x4). Step 4 Construct
a DNF formula. F C1 V C2 V C3 V C4. Step 5
Estimate Pr(F TRUE) by p using Karp Lubys
Markov-Chain Monte-Carlo
method with absolute error eminsup/2 and
confidence d (d ?0,1). Step 6 li, ui p -
eminsup/2, p eminsup/2.
24Outline
- Background
- Problem Definition
- Algorithm
- Experimental Results
- Conclusions
25Experimental Results
- Data
- The STRING Database (http//string-db.org)
26Time Efficiency
27Approximation Quality
28Scalability
29Conclusions
- A new model of uncertain graph data has been
proposed. - The frequent subgraph pattern mining problem on
uncertain graph data has been formalized. - The computational complexity of the problem has
been formally proved to be NP-hard. - An approximate mining algorithm has been
proposed. - The proposed algorithm has high efficiency, high
approximation quality, and high scalability.
30Thank you