Mining, Indexing and Searching Graphs in Biological Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Mining, Indexing and Searching Graphs in Biological Databases

Description:

Title: No Slide Title Author: Jiawei Han Last modified by: dlewis Created Date: 6/19/1998 4:38:52 AM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 77
Provided by: Jiaw245
Category:

less

Transcript and Presenter's Notes

Title: Mining, Indexing and Searching Graphs in Biological Databases


1
Mining, Indexing and Searching Graphs in
Biological Databases
  • Jiawei Han
  • Department of Computer Science
  • Institute of Genomic Biology
  • University of Illinois at Urbana-Champaign
  • www.cs.uiuc.edu/hanj
  • In collaboration with Xifeng Yan (UIUC Ph.D.06
    and IBM Watson), Philip S. Yu (IBM Watson), et
    al.
  • (Core material for tutorials at ICDM05 KDD06)

2
References Covering Five Papers
  • X. Yan and J. Han, gSpan Graph-Based
    Substructure Pattern Mining, Proc. 2002 Int.
    Conf. on Data Mining (ICDM'02) (Google Scholar
    ranked 3 out of 83,800 entries on Graph Pattern
    Mining on November 2, 2015)
  • X. Yan and J. Han, CloseGraph Mining Closed
    Frequent Graph Patterns, Proc. 2003 ACM SIGKDD
    Int. Conf. Knowledge Discovery and Data Mining
    (KDD'03) (Google Scholar ranked 1 out of 83,800
    entries on Graph Pattern Mining on November 2,
    2015)
  • X. Yan, P. S. Yu, and J. Han, Graph Indexing A
    Frequent Structure-based Approach, Proc. 2004
    ACM-SIGMOD Int. Conf. Management of Data
    (SIGMOD'04) (invited to TODS and published 2005,
    Google Scholar ranked 1 out of 39,300 entries
    on Graph Indexing on November 2, 2015)
  • X. Yan, P. S. Yu, and J. Han, Substructure
    Similarity Search in Graph Databases, Proc. 2005
    ACM-SIGMOD Int. Conf. on Management of Data
    (SIGMOD'05) (invited and published in ACM
    TODS06)
  • H. Hu, X. Yan, H. Yu, J. Han and X. J. Zhou,
    Mining Coherent Dense Subgraphs across Massive
    Biological Networks for Functional Discovery,
    Proc. 2005 Int. Conf. Intelligent Systems for
    Molecular Biology (ISMB'05) (Also in
    Bioinformatics, 2005)

3
(No Transcript)
4
Graph, Graph, Everywhere
from H. Jeong et al Nature 411, 41 (2001)
Aspirin
Yeast protein interaction network
Co-author network
An Internet Web
5
Why Graph Mining and Searching?
  • Graphs are ubiquitous
  • Chemical compounds (Cheminformatics)
  • Protein structures, biological pathways/networks
    (Bioinformactics)
  • Program control flow, traffic flow, and workflow
    analysis
  • XML databases, Web, and social network analysis
  • Graph is a general model
  • Trees, lattices, sequences, and items are
    degenerated graphs
  • Diversity of graphs
  • Directed vs. undirected, labeled vs. unlabeled
    (edges vertices), weighted, with angles
    geometry (topological vs. 2-D/3-D)
  • Complexity of algorithms many problems are of
    high complexity!

6
Outline
  • Mining frequent graph patterns
  • Graph indexing methods
  • Similairty search in graph databases
  • Biological network analysis
  • Some recent progress on graph mining

7
Graph Pattern Mining
  • Frequent subgraphs
  • A (sub)graph is frequent if its support
    (occurrence frequency) in a given dataset is no
    less than a minimum support threshold
  • Applications of graph pattern mining
  • Mining biochemical structures
  • Program control flow analysis
  • Mining XML structures or Web communities
  • Building blocks for graph classification,
    clustering, comparison, and correlation analysis

8
Example Frequent Subgraphs
Graph Dataset
(A)
(B)
(C)
Frequent Patterns (min support is 2)
(1)
(2)
9
Frequent Subgraph Mining Approaches
  • Apriori-based approach
  • AGM/AcGM Inokuchi, et al. (PKDD00)
  • FSG Kuramochi and Karypis (ICDM01)
  • PATH Vanetik and Gudes (ICDM02, ICDM04)
  • FFSM Huan, et al. (ICDM03)
  • Pattern growth-based approach
  • MoFa, Borgelt and Berthold (ICDM02)
  • gSpan Yan and Han (ICDM02)
  • Gaston Nijssen and Kok (KDD04)

10
Properties of Graph Mining Algorithms
  • Search order
  • breadth vs. depth
  • Generation of candidate subgraphs
  • apriori vs. pattern growth
  • Elimination of duplicate subgraphs
  • passive vs. active
  • Support calculation
  • embedding store or not
  • Discover order of patterns
  • path ? tree ? graph

11
Apriori-Based Approach
(k1)-edge
k-edge
G1
G
G2
G

Gn
G
JOIN
12
Pattern Growth-Based Span and Pruning
1-edge
...
2-edge
...
...
If redundant, prune it!
...
3-edge
G1
...
...
PRUNED
...
13
gSpan (Yan and Han ICDM02)
Right-Most Extension
Theorem Completeness
The Enumeration of Graphs using Right-most
Extension is COMPLETE
14
DFS Code
  • Flatten a graph into a sequence using depth first
    search

0
1
2
4
3
15
DFS Code Extension
  • Let a be the minimum DFS code of a graph G and b
    be a non-minimum DFS code of G. For any DFS code
    d generated from b by one right-most extension,

(i) d is not a minimum DFS code,
(ii) dfs(d) cannot be extended from b, and
(iii) dfs(d) is either less than a or can be extended from a.
THEOREM RIGHT-EXTENSION The DFS code of a graph
extended from a nonminimum DFS code is NOT
MINIMUM
16
GASTON (Nijssen and Kok, KDD04)
  • Extend graphs directly
  • Store embeddings
  • Separate the discovery of different types of
    graphs
  • path ? tree ? graph
  • Simple structures are easier to mine and
    duplication detection is much simpler

17
Graph Pattern Explosion Problem
  • If a graph is frequent, all of its subgraphs are
    frequent - the Apriori property
  • An n-edge frequent graph may have 2n subgraphs
  • Among 422 chemical compounds which are confirmed
    to be active in an AIDS antiviral screen dataset,
    there are 1,000,000 frequent graph patterns if
    the minimum support is 5

18
Closed Frequent Graphs
  • Motivation Handling graph pattern explosion
    problem
  • Closed frequent graph
  • A frequent graph G is closed if there exists no
    supergraph of G that carries the same support as
    G
  • If some of Gs subgraphs have the same support,
    it is unnecessary to output these subgraphs
    (nonclosed graphs)
  • Lossless compression still ensures that the
    mining result is complete

19
CLOSEGRAPH (Yan Han, KDD03)
A Pattern-Growth Approach
(k1)-edge
At what condition, can we stop searching their
children i.e., early termination?
G1
G2
k-edge
G
If G and G are frequent, G is a subgraph of G.
If in any part of the graph in the dataset where
G occurs, G also occurs, then we need not grow
G, since none of Gs children will be closed
except those of G.

Gn
20
Handling Tricky Exception Cases
a
b
(pattern 1)
b
a
a
b
c
d
c
d
a
(graph 1)
(graph 2)
c
d
(pattern 2)
21
Experimental Result
  • The AIDS antiviral screen compound dataset from
    NCI/NIH
  • The dataset contains 43,905 chemical compounds
  • Among these 43,905 compounds, 423 of them belongs
    to CA, 1081 are of CM, and the remaining are in
    class CI

22
Discovered Patterns
20
10
5
23
Number of Patterns Frequent vs. Closed
CA
Number of patterns
minimum support
24
Runtime Frequent vs. Closed
CA
runtime (sec)
minimum support
25
Do the Odds Beat the Curse of Complexity?
  • Potentially exponential number of frequent
    patterns
  • The worst case complexty vs. the expected
    probability
  • Ex. Suppose Walmart has 104 kinds of products
  • The chance to pick up one product 10-4
  • The chance to pick up a particular set of 10
    products 10-40
  • What is the chance this particular set of 10
    products to be frequent 103 times in 109
    transactions?
  • Have we solved the NP-hard problem of subgraph
    isomorphism testing?
  • No. But the real graphs in bio/chemistry is not
    so bad
  • A carbon has only 4 bounds and most proteins in a
    network have distinct labels

26
Outline
  • Mining frequent graph patterns
  • Graph indexing methods
  • Similairty search in graph databases
  • Biological network analysis
  • Some recent progress on graph mining

27
Graph Search Querying Graph Databases
  • Querying graph databases
  • Given a graph database and a query graph, find
    all graphs containing this query graph

28
Scalability Issue
  • Sequential scan
  • Disk I/O
  • Subgraph isomorphism testing
  • An indexing mechanism is needed
  • DayLight Daylight.com (commercial)
  • GraphGrep Dennis Shasha, et al. PODS'02
  • Grace Srinath Srinivasa, et al. ICDE'03

Sample database
29
Indexing Strategy
Graph (G)
Query graph (Q)
If graph G contains query graph Q, G should
contain any substructure of Q
Substructure
  • Remarks
  • Index substructures of a query graph to prune
    graphs that do not contain these substructures

30
Framework
  • Two steps in processing graph queries
  • Step 1. Index Construction
  • Enumerate structures in the graph database, build
    an inverted index between structures and graphs
  • Step 2. Query Processing
  • Enumerate structures in the query graph
  • Calculate the candidate graphs containing these
    structures
  • Prune the false positive answers by performing
    subgraph isomorphism test

31
Cost Analysis
Query Response Time
Disk I/O time
Isomorphism testing time
Graph index access time
Size of candidate answer set
Remark make Cq as small as possible
32
Path-Based Approach
Sample database
(a)
(b)
(c)
Paths
0-length C, O, N, S 1-length C-C, C-O, C-N,
C-S, N-N, S-O 2-length C-C-C, C-O-C, C-N-C,
... 3-length ...
Built an inverted index between paths and graphs
33
Problems of Path-Based Approach
Sample database
(a)
(b)
(c)
Query graph
Only graph (c) contains this query graph.
However, if we only index paths C, C-C, C-C-C,
C-C-C-C, we cannot prune graph (a) and (b).
34
gIndex Indexing Graphs by Data Mining
  • Our methodology on graph index
  • Identify frequent structures in the database, the
    frequent structures are subgraphs that appear
    quite often in the graph database
  • Prune redundant frequent structures to maintain a
    small set of discriminative structures
  • Create an inverted index between discriminative
    frequent structures and graphs in the database

35
IDEAS Indexing with Two Constraints
discriminative (103)
frequent (105)
structure (gt106)
36
Why Discriminative Subgraphs?
Sample database
(a)
(b)
(c)
  • All graphs contain structures C, C-C, C-C-C
  • Why bother indexing these redundant frequent
    structures?
  • Only index structures that provide more
    information than existing structures

37
Discriminative Structures
  • Pinpoint the most useful frequent structures
  • Given a set of structures f1, f2, , fn and a
    new structure x , we measure the extra indexing
    power provided by x,
  • When P is small enough, x is a discriminative
    structure and should be included in the index
  • Index discriminative frequent structures only
  • Reduce the index size by an order of magnitude

38
Why Frequent Structures?
  • We cannot index (or even search) all of
    substructures
  • Large structures will likely be indexed well by
    their substructures
  • Size-increasing support threshold

minimum support threshold
support
size
39
Experimental Setting
  • The AIDS antiviral screen compound dataset from
    NCI/NIH, containing 43,905 chemical compounds
  • Query graphs are randomly extracted from the
    dataset.
  • GraphGrep maximum length (edges) of paths is set
    at 10
  • gIndex maximum size (edges) of structures is set
    at 10

40
Experiments Index Size
OF FEATURES
DATABASE SIZE
41
Experiments Answer Set Size
OF CANDIDATES
QUERY SIZE
42
Experiments Incremental Maintenance
Frequent structures are stable to database
updating Index can be built based on a small
portion of a graph database, but be used for the
whole database
43
Outline
  • Mining frequent graph patterns
  • Graph indexing methods
  • Similairty search in graph databases
  • Biological network analysis
  • Some recent progress on graph mining

44
Structure Similarity Search
  • CHEMICAL COMPOUNDS

(a) caffeine
(b) diurobromine
(c) viagra
  • QUERY GRAPH

45
Some Straightforward Methods
  • Method1 Directly compute the similarity between
    the graphs in the DB and the query graph
  • Sequential scan
  • Subgraph similarity computation
  • Method 2 Form a set of subgraph queries from the
    original query graph and use the exact subgraph
    search
  • Costly If we allow 3 edges to be missed in a
    20-edge query graph, it may generate 1,140
    subgraphs

46
Index Precise vs. Approximate Search
  • Precise Search
  • Use frequent patterns as indexing features
  • Select features in the database space based on
    their selectivity
  • Build the index
  • Approximate Search
  • Hard to build indices covering similar
    subgraphsexplosive number of subgraphs in
    databases
  • Idea (1) keep the index structure
  • (2) select features in the query space

47
Substructure Similarity Measure
  • Query relaxation measure
  • The number of edges that can be relabeled or
    missed but the position of these edges are not
    fixed

QUERY GRAPH

48
Substructure Similarity Measure
  • Feature-based similarity measure
  • Each graph is represented as a feature vector X
    x1, x2, , xn
  • The similarity is defined by the distance of
    their corresponding vectors
  • Advantages
  • Easy to index
  • Fast
  • Rough measure

49
Intuition Feature-Based Similarity Search
Graph (G1)
  • If graph G contains the major part of a query
    graph Q, G should share a number of common
    features with Q

Query (Q)
Graph (G2)
  • Given a relaxation ratio, calculate the maximal
    number of features that can be missed !

Substructure
At least one of them should be contained
50
Feature-Graph Matrix
graphs in database
G1 G2 G3 G4 G5
f1 0 1 0 1 1
f2 0 1 0 0 1
f3 1 0 1 1 1
f4 1 0 0 0 1
f5 0 0 1 1 0
features
Assume a query graph has 5 features and at
most 2 features to miss due to the relaxation
threshold
51
Edge Relaxation Feature Misses
  • If we allow k edges to be relaxed, J is the
    maximum number of features to be hit by k
    edgesit becomes the maximum coverage problem
  • NP-complete
  • A greedy algorithm exists
  • We design a heuristic to refine the bound of
    feature misses

52
Query Processing Framework
  • Three steps in processing approximate graph
    queries
  • Step 1. Index Construction
  • Select small structures as features in a graph
    database, and build the feature-graph matrix
    between the features and the graphs in the
    database

53
Framework (cont.)
  • Step 2. Feature Miss Estimation
  • Determine the indexed features belonging to the
    query graph
  • Calculate the upper bound of the number of
    features that can be missed for an approximate
    matching, denoted by J
  • On the query graph, not the graph database

54
Framework (cont.)
  • Step 3. Query Processing
  • Use the feature-graph matrix to calculate the
    difference in the number of features between
    graph G and query Q, FG FQ
  • If FG FQ gt J, discard G. The remaining graphs
    constitute a candidate answer set

55
Performance Study
  • Database
  • Chemical compounds of Anti-Aids Drug from
    NCI/NIH, randomly select 10,000 compounds
  • Query
  • Randomly select 30 graphs with 16 and 20 edges as
    query graphs
  • Competitive algorithms
  • Grafil Graph Filterour algorithm
  • Edge use edges only
  • All use all the features

56
Comparison of the Three Algorithms
of candidates
edge relaxation
57
Outline
  • Mining frequent graph patterns
  • Graph indexing methods
  • Similairty search in graph databases
  • Biological network analysis
  • Some recent progress on graph mining

58
Biological Networks
  • Protein-protein interaction network
  • Metabolic network
  • Transcriptional regulatory network
  • Co-expression network
  • Genetic Interaction network

59
Data Mining Across Multiple Networks
f
f
j
j
a
a
c
h
h
c
e
e
b
b
k
k
d
i
g
d
i
g
f
f
f
j
j
j
a
a
h
a
c
h
h
c
c
e
e
e
b
b
k
b
k
k
d
i
g
d
d
g
i
g
i
60
Data Mining Across Multiple Networks
f
f
j
j
a
a
h
c
h
c
e
e
b
b
k
k
d
i
g
d
g
i
61
Identify Frequent Co-expression Clusters across
Multiple Microarray Data Sets
62
Our Solution
  • We develop a novel algorithm, called CODENSE, to
    mine frequent coherent dense subgraphs.
  • The target subgraphs have three characteristics
  • All edges occur in gt k graphs (frequency)
  • All edges should exhibit correlated occurrences
    in the given graph set (coherency)
  • The subgraph is dense, where density d is higher
    than a threshold ? and d2m/(n(n-1)) (density)
  • m edges, n nodes

63
CODENSE Mine Coherent Dense Subgraphs
(1) Builds a summary graph by eliminating
infrequent edges
64
CODENSE Mine Coherent Dense Subgraphs
(2) Identify dense subgraphs of the summary graph
Observation If a frequent subgraph is dense, it
must be a dense subgraph in the summary graph.
However, the reverse is not true.
65
CODENSE Mine Coherent Dense Subgraphs
(3) Construct the edge occurrence profiles for
each dense summary subgraph
66
CODENSE Mine Coherent Dense Subgraphs
(4) builds a second-order graph for each dense
summary subgraph
67
CODENSE Mine Coherent Dense Subgraphs
(5) Identify dense subgraphs of the second-order
graph
Observation If a subgraph is coherent (its edges
show high correlation in their occurrences across
a graph set), then its 2nd-order graph must be
dense
68
CODENSE Mine Coherent Dense Subgraphs
(6) Identify the coherent dense subgraphs
69
CODENSE Mine Coherent Dense Subgraphs
70
Applying CoDense to 39 Yeast Microarray Data Sets
f
f
j
j
a
a
h
h
c1 c2 cm g1 .1 .2 .2 g2 .4 .3 .4
c
c
e
e
b
b
k
k
d
d
g
i
i
g
f
f
j
j
e
a
c1 c2 cm g1 .8 .6 .2 g2 .2 .3 .4
a
c
c
h
h
e
b
b
k
k
d
d
i
g
i
g
c1 c2 cm g1 .9 .4 .1 g2 .7 .3 .5
f
f
j
j
a
a
h
h
c
c
e
e
b
b
k
k
i
d
d
g
g
i
f
f
j
c1 c2 cm g1 .2 .5 .8 g2 .7 .1 .3
j
a
a
h
h
c
c
e
e
b
b
k
k
d
d
i
g
g
i
71
Discovery of New Genes Based on Similar Genes
ATP17
72
Network of Known Similar Genes
Brown YDR115W, FMC1, ATP12, MRPL37,
MRPS18 GO0019538 (protein metabolism pvalue
0.001122)
73
Network Involved in the New Genes
YDR115W
MRP49
MRPL51
PHB1
PET100
ATP12
MRPL37
ATP17
MRPL38
ACN9
MRPL39
MRPL32
FMC1
MRPS18
RedPHB1,ATP17,MRPL51,MRPL39, MRPL49,
MRPL51,PET100 GO0006091 (generation of precursor
metabolites and energy pvalue0. 001339)
74
Outline
  • Mining frequent graph patterns
  • Graph indexing methods
  • Similairty search in graph databases
  • Biological network analysis
  • Some recent progress on graph mining

75
Recent Developments Graph Mining
  • Colossal pattern mining F. Zhu, X. Yan, J. Han,
    P. S. Yu, and H. Cheng, Mining Colossal Frequent
    Patterns by Core Pattern Fusion, in Proc. 2007
    Int. Conf. on Data Engineering (ICDE'07), April
    2007 (Best student paper award)
  • Constraint-based mining F. Zhu, X. Yan, J. Han,
    and P. S. Yu, gPrune A Constraint Pushing
    Framework for Graph Pattern Mining, in Proc.
    2007 Pacific-Asia Conf. on Knowledge Discovery
    and Data Mining (PAKDD'07), May 2007 (Best
    student paper award)
  • Approximate graph mining C. Chen, X. Yan, F.
    Zhu, and J. Han, gApprox Mining Frequent
    Approximate Patterns from a Massive Network,
    Proc. 2007 Int. Conf. on Data Mining (ICDM'07),
    Oct. 2007

76
Recent Developments Graph Mining
  • Graph-containment indexing C. Chen, X. Yan, P.
    S. Yu, J. Han, D. Zhang, and X. Gu, Towards
    Graph Containment Search and Indexing, in Proc.
    2007 Int. Conf. on Very Large Data Bases
    (VLDB'07), Vienna, Austria, Sept. 2007
  • Pattern-based classification H. Cheng, X. Yan,
    J. Han, and C.-W. Hsu, Discriminative Frequent
    Pattern Analysis for Effective Classification,
    in Proc. 2007 Int. Conf. on Data Engineering
    (ICDE'07), Istanbul, Turkey, April 2007
  • DDPMine H. Cheng, X. Yan, J. Han, and P. S. Yu,
    "Direct Discriminative Pattern Mining for
    Effective Classification", Proc. 2008 Int. Conf.
    on Data Engineering (ICDE'08), Cancun, Mexico,
    April 2008

77
Discriminative Frequent Pattern Analysis for
Effective Classification ICDE07
78
Conclusions
  • Graph mining has wide applications
  • Frequent and closed subgraph mining methods
  • gSpan and CloseGraph pattern-growth depth-first
    search approach
  • Graph indexing techniques
  • Frequent and discirminative subgraphs as indexing
    fatures
  • Similairty search in graph databases
  • Indexing and approximate matching help similar
    subgraph search
  • Biological network analysis
  • Mining coherent, dense, multiple biological
    networks
  • Many new developments along the line of graph
    pattern mining

79
Thanks and Questions
Write a Comment
User Comments (0)
About PowerShow.com