Graph Mining Applications to Machine Learning Problems - PowerPoint PPT Presentation

About This Presentation
Title:

Graph Mining Applications to Machine Learning Problems

Description:

... gspan (Yan et al., 2002), Gaston (2004) Graph Mining Frequent Substructure Mining Enumerate all patterns occurred in at least m graphs : Indicator of ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 51
Provided by: Tak114
Learn more at: https://ibisml.org
Category:

less

Transcript and Presenter's Notes

Title: Graph Mining Applications to Machine Learning Problems


1
Graph Mining Applications to Machine Learning
Problems
  • Max Planck Institute for Biological Cybernetics
  • Koji Tsuda

2
Graphs
3
Graph Structures in Biology
  • DNA Sequence
  • RNA
  • Texts in literature
  • Compounds

H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
4
Substructure Representation
  • 0/1 vector of pattern indicators
  • Huge dimensionality!
  • Need Graph Mining for selecting features
  • Better than paths (Marginalized graph kernels)

patterns
5
Overview
  • Quick Review on Graph Mining
  • EM-based Clustering algorithm
  • Mixture model with L1 feature selection
  • Graph Boosting
  • Supervised Regression for QSAR Analysis
  • Linear programming meets graph mining

6
Quick Review of Graph Mining
7
Graph Mining
  • Analysis of Graph Databases
  • Find all patterns satisfying predetermined
    conditions
  • Frequent Substructure Mining
  • Combinatorial, Exhaustive
  • Recently developed
  • AGM (Inokuchi et al., 2000), gspan (Yan et al.,
    2002), Gaston (2004)

8
Graph Mining
  • Frequent Substructure Mining
  • Enumerate all patterns occurred in at least m
    graphs
  • Indicator of pattern k in graph i

Support(k) of occurrence of pattern k
9
Gspan (Yan and Han, 2002)
  • Efficient Frequent Substructure Mining Method
  • DFS Code
  • Efficient detection of isomorphic patterns
  • Extend Gspan for our works

10
Enumeration on Tree-shaped Search Space
  • Each node has a pattern
  • Generate nodes from the root
  • Add an edge at each step

11
Tree Pruning
Support(g) of occurrence of pattern g
  • Anti-monotonicity
  • If support(g) lt m, stop exploring!

Not generated
12
Discriminative patternsWeighted Substructure
Mining
  • w_i gt 0 positive class
  • w_i lt 0 negative class
  • Weighted Substructure Mining
  • Patterns with large frequency difference
  • Not Anti-Monotonic Use a bound

13
Multiclass version
  • Multiple weight vectors
  • (graph belongs to class )
  • (otherwise)
  • Search patterns overrepresented in a class

14
EM-based clustering of graphs
Tsuda, K. and T. Kudo Clustering Graphs by
Weighted Substructure Mining. ICML 2006,
953-960, 2006       
15
EM-based graph clustering
  • Motivation
  • Learning a mixture model in the feature space of
    patterns
  • Basis for more complex probabilistic inference
  • L1 regularization Graph Mining
  • E-step -gt Mining -gt M-step

16
Probabilistic Model
  • Binomial Mixture
  • Each Component

Mixing weight for cluster
Parameter vector for cluster
17
Function to minimize
  • L1-Regularized log likelihood
  • Baseline constant
  • ML parameter estimate using single binomial
    distribution
  • In solution, most parameters exactly equal to
    constants

18
E-step
  • Active pattern
  • E-step computed only with active patterns
    (computable!)

19
M-step
  • Putative cluster assignment by E-step
  • Each parameter is solved separately
  • Use graph mining to find active patterns
  • Then, solve it only for active patterns

20
Solution
  • Occurrence probability in a cluster
  • Overall occurrence probability

21
Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
22
Mining for Active Patterns F
  • F is rewritten in the following form
  • Active patterns can be found by graph mining!
    (multiclass)

23
Experiments RNA graphs
  • Stem as a node
  • Secondary structure by RNAfold
  • 0/1 Vertex label (self loop or not)

24
Clustering RNA graphs
  • Three Rfam families
  • Intron GP I (Int, 30 graphs)
  • SSU rRNA 5 (SSU, 50 graphs)
  • RNase bact a (RNase, 50 graphs)
  • Three bipartition problems
  • Results evaluated by ROC scores (Area under the
    ROC curve)

25
Examples of RNA Graphs
26
ROC Scores
27
No of Patterns Time
28
Found Patterns
29
Summary (EM)
  • Probabilistic clustering based on substructure
    representation
  • Inference helped by graph mining
  • Many possible extensions
  • Naïve Bayes
  • Graph PCA, LFD, CCA
  • Semi-supervised learning
  • Applications in Biology?

30
Graph Boosting
Saigo, H., T. Kadowaki and K. Tsuda A Linear
Programming Approach for Molecular QSAR
analysis. International Workshop on Mining and
Learning with Graphs, 85-96, 2006
31
Graph Regression Problem
  • Known as QSAR problem in chemical informatics
  • Quantitative Structure-Activity Analysis
  • Given a graph, predict a real-value
  • Typically, features (descriptors) are given

32
QSAR with conventional descriptors
atoms bonds rings Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
33
Motivation of Graph Boosting
  • Descriptors are not always available
  • New features by obtaining informative patterns
    (i.e., subgraphs)
  • Greedy pattern discovery by Boosting gSpan
  • Linear Programming (LP) Boosting for reducing the
    number of graph mining calls
  • Accurate prediction interpretable results

34
Molecule as a labeled graph
35
QSAR with patterns
Activity
1 1 1 3
-1 1 -1 1.2
-1 1 -1 0.77
-1 1 -1 -3.52
1 1 -1 -4
36
Sparse regression in a very high dimensional space
  • G all possible patterns (intractably large)
  • G-dimensional feature vector x for a molecule
  • Linear Regression
  • Use L1 regularizer to have sparse a
  • Select a tractable number of patterns

37
Problem formulation
We introduce e-insensitive loss and L1
regularizer m of training graphs d G ?,
?- slack variables e parameter
38
Dual LP
  • Primal Huge number of weight variables
  • Dual Huge number of constraints

LP1-Dual
39
Column Generation Algorithm for LP Boost (Demiriz
et al., 2002)
  • Start from the dual with no constraints
  • Add the most violated constraint each time
  • Guaranteed to converge

Constraint Matrix
Used Part
40
Finding the most violated constraint
  • Constraint for a pattern (shown again)
  • Finding the most violated one
  • Searched by weighted substructure mining

41
Algorithm Overview
  • Iteration
  • Find a new pattern by graph mining with weight u
  • If all constraints are satisfied, break
  • Add a new constraint
  • Update u by LP1-Dual
  • Return
  • Convert dual solution to obtain primal solution a

42
Speed-up by adding multiple patterns (multiple
pricing)
  • So far, the most violated pattern is chosen
  • Mining and inclusion of top k patterns at each
    iteration
  • Reduction of the number of mining calls

A Linear Programming Approach for Molecular QSAR
Analysis
43
Speed-up by multiple pricing
44
Clearly negative data
atoms bonds rings Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
22 20 -10000
23 19 -10000
A Linear Programming Approach for Molecular QSAR
Analysis
45
Inclusion of clearly negative data
LP2-Primal
l of clearly negative data z predetermined
upperbound ? slack variable
46
Experiments
  • Data from Endocrine Disruptors Knowledge Base
  • 59 compounds labeled by real number and 61
    compounds labeled by a large negative number
  • Label (target) is a log translated relative
    proliferative potency (log(RPP)) normalized
    between 1 and 1
  • Comparison with
  • Marginalized Graph Kernel ridge regression
  • Marginalized Graph Kernel kNN regression

47
Results with or without clearly negative data
LP2
LP1
48
Extracted patterns
Interpretable compared with implicitly expressed
features by Marginalized Graph Kernel
49
Summary (Graph Boosting)
  • Graph Boosting simultaneously generate patterns
    and learn their weights
  • Finite convergence by column generation
  • Potentially interpretable by chemists.
  • Flexible constraints and speed-up by LP.

50
Concluding Remarks
  • Using graph mining as a part of machine learning
    algorithms
  • Weights are essential
  • Please include weights when you implement your
    item-set/tree/graph mining algorithms
  • Make it available on the web!
  • Then ML researchers can use it
Write a Comment
User Comments (0)
About PowerShow.com