Graph Mining Applications in Machine Learning Problems - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Graph Mining Applications in Machine Learning Problems

Description:

Graph Mining Applications in Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 45
Provided by: Taku151
Category:

less

Transcript and Presenter's Notes

Title: Graph Mining Applications in Machine Learning Problems


1
Graph Mining Applicationsin Machine Learning
Problems
  • Max Planck Institute for Biological Cybernetics
  • Koji Tsuda

2
Motivations for graph analysis
  • Existing methods assume tables
  • Structured data beyond this framework
  • ? New methods for analysis

3
Graphs..
4
Graph Structures in Biology
  • DNA Sequence
  • RNA
  • Texts in literature
  • Compounds

H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
5
Overview
  • Path representation
  • Graph Kernel Its disadvantages
  • Substructure representation
  • Graph Mining
  • EM-based Graph Clustering (Tsuda and Kudo, ICML
    2006)

6
Path Representations Marginalized Graph Kernels
7
Marginalized Graph Kernels
(Kashima, Tsuda, Inokuchi, ICML 2003)
  • Going to define the kernel function
  • Both vertex and edges are labeled

8
Label path
  • Sequence of vertex and edge labels
  • Generated by random walking
  • Uniform initial, transition, terminal
    probabilities

9
Path-probability vector
10
Kernel definition
  • Kernels for paths
  • Take expectation over all possible paths!
  • Marginalized kernels for graphs

11
Computation
12
Graph Kernel Applications
  • Chemical Compounds (Mahe et al., 2005)
  • Protein 3D structures (Borgwardt et al, 2005)
  • RNA graphs (Karklin et al., 2005)
  • Pedestrian detection
  • Signal Processing

13
Strong points of MGK
  • Polynomial time computation O(n3)
  • Positive definite kernel
  • Support Vector Machines
  • Kernel PCA
  • Kernel CCA
  • And so on

14
Drawbacks of graph kernels
  • Global similarity measure
  • Fails to capture subtle differences
  • Long paths suppressed
  • Results not interpretable
  • Structural features ignored (e.g. loops)
  • No labels -gt kernel always 1

15
Substructure Representation Graph Mining
16
Substructure Representation
  • 0/1 vector of pattern indicators
  • Huge dimensionality!
  • Need Graph Mining for selecting features

patterns
17
Graph Mining
  • Subfield of Data Mining
  • KDD, ICDM, PKDD
  • not popular in ICML, NIPS
  • Analysis of Graph Databases
  • Frequent Substructure Mining
  • Combinatorial algorithm
  • Recently developed
  • AGM (Inokuchi et al., 2000), gspan (Yan et al.,
    2002), Gaston (2004)

18
Graph Mining
  • Frequent Substructure Mining
  • Enumerate all patterns occurred in at least m
    graphs
  • Indicator of pattern k in graph i

Support(k) of occurrence of pattern k
19
Enumeration on Tree-shaped Search Space
  • Each node has a pattern
  • Generate nodes from the root
  • Add an edge at each step

20
Tree Pruning
Support(g) of occurrence of pattern g
  • Anti-monotonicity
  • If support(g) lt m, stop exploring!

Not generated
21
Gspan (Yan and Han, 2002)
  • Efficient Frequent Substructure Mining Method
  • DFS Code
  • Efficient detection of isomorphic patterns
  • Extend Gspan for our works

22
Depth First Search (DFS) Code
DFS Code Tree on G
0,1,A,a,B
2,0,A,b,A
1,2,B,c,A
0,2,A,b,A
0,1,A,a,B
0,3,A,b,C
G1
G0
2,0,A,b,A
0,3,A,b,C
0,3,A,b,C
Isomorphic
0,3,A,b,C
Non-minimum DFS code. Prune it.
23
Discriminative patterns
  • w_i gt 0 positive class
  • w_i lt 0 negative class
  • Weighted Substructure Mining
  • Patterns with large frequency difference
  • Not Anti-Monotonic Use a bound

24
Multiclass version
  • Multiple weight vectors
  • (graph belongs to class )
  • (otherwise)
  • Search patterns overrepresented in a class

25
Summary of Graph Mining
  • Efficient way of searching patterns satisfying
    predetermined conditions
  • NP hard
  • But actual speed depends on the data
  • Faster for..
  • Sparse graphs
  • Diverse kinds of labels

26
EM-based clustering of graphs(Tsuda and Kudo,
ICML 2006)
27
EM-based graph clustering
  • Motivation
  • Learning a mixture model in the feature space of
    patterns
  • Basis for more complex probabilistic inference
  • L1 regularization Graph Mining
  • E-step -gt Mining -gt M-step

28
Probabilistic Model
  • Binomial Mixture
  • Each Component

Mixing weight for cluster
Parameter vector for cluster
29
Ordinary EM algorithm
  • Maximizing the log likelihood
  • E-step Get posterior
  • M-step Estimate using posterior probs.
  • Both are computationally prohibitive (!)

30
Regularization
  • L1-Regularized log likelihood
  • Baseline constant
  • ML parameter estimate using single binomial
    distribution
  • In solution, most parameters exactly equal to
    constants

31
E-step
  • Active pattern
  • E-step computed only with active patterns
    (computable!)

32
M-step
  • Putative cluster assignment
  • Each parameter is solved separately
  • Naïve way
  • solve it for all params and identify active
    patterns
  • Use graph mining to find active patterns

33
Solution
  • Occurrence probability in a cluster
  • Overall occurrence probability

34
Solution
35
Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
36
Mining for Active Patterns
  • Active pattern
  • Equivalently written as
  • F can be found by graph mining! (multiclass)

37
Experiments RNA graphs
  • Stem as a node
  • Secondary structure by RNAfold
  • 0/1 Vertex label (self loop or not)

38
Clustering RNA graphs
  • Three Rfam families
  • Intron GP I (Int, 30 graphs)
  • SSU rRNA 5 (SSU, 50 graphs)
  • RNase bact a (RNase, 50 graphs)
  • Three bipartition problems
  • Results evaluated by ROC scores (Area under the
    ROC curve)

39
Examples of RNA Graphs
40
ROC Scores
41
No of Patterns Time
42
Found Patterns
43
Conclusion
  • Substructure representation is better than paths
  • Probabilistic inference helped by graph mining
  • Many possible extensions
  • Naïve Bayes
  • Graph PCA, LFD, CCA
  • Semi-supervised learning
  • Applications in Biology?

44
Ongoing work..
Write a Comment
User Comments (0)
About PowerShow.com