Graph Mining Applications to Machine Learning Problems

About This Presentation

Title:

Graph Mining Applications to Machine Learning Problems

Description:

... gspan (Yan et al., 2002), Gaston (2004) Graph Mining Frequent Substructure Mining Enumerate all patterns occurred in at least m graphs : Indicator of ... – PowerPoint PPT presentation

Number of Views:222

Avg rating:3.0/5.0

Slides: 51

Provided by: Tak114

Learn more at: https://ibisml.org

Category:

more less

Transcript and Presenter's Notes

Title: Graph Mining Applications to Machine Learning Problems

1
Graph Mining Applications to Machine Learning
Problems

Max Planck Institute for Biological Cybernetics
Koji Tsuda

2
Graphs
3
Graph Structures in Biology

DNA Sequence
RNA
Texts in literature

Compounds

H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
4
Substructure Representation

0/1 vector of pattern indicators
Huge dimensionality!
Need Graph Mining for selecting features
Better than paths (Marginalized graph kernels)

patterns
5
Overview

Quick Review on Graph Mining
EM-based Clustering algorithm
Mixture model with L1 feature selection
Graph Boosting
Supervised Regression for QSAR Analysis
Linear programming meets graph mining

6
Quick Review of Graph Mining
7
Graph Mining

Analysis of Graph Databases
Find all patterns satisfying predetermined
conditions
Frequent Substructure Mining
Combinatorial, Exhaustive
Recently developed
AGM (Inokuchi et al., 2000), gspan (Yan et al.,
2002), Gaston (2004)

8
Graph Mining

Frequent Substructure Mining
Enumerate all patterns occurred in at least m
graphs
Indicator of pattern k in graph i

Support(k) of occurrence of pattern k
9
Gspan (Yan and Han, 2002)

Efficient Frequent Substructure Mining Method
DFS Code
Efficient detection of isomorphic patterns
Extend Gspan for our works

10
Enumeration on Tree-shaped Search Space

Each node has a pattern
Generate nodes from the root
Add an edge at each step

11
Tree Pruning
Support(g) of occurrence of pattern g

Anti-monotonicity
If support(g) lt m, stop exploring!

Not generated
12
Discriminative patternsWeighted Substructure
Mining

w_i gt 0 positive class
w_i lt 0 negative class
Weighted Substructure Mining
Patterns with large frequency difference
Not Anti-Monotonic Use a bound

13
Multiclass version

Multiple weight vectors
(graph belongs to class )
(otherwise)
Search patterns overrepresented in a class

14
EM-based clustering of graphs
Tsuda, K. and T. Kudo Clustering Graphs by
Weighted Substructure Mining. ICML 2006,
953-960, 2006
15
EM-based graph clustering

Motivation
Learning a mixture model in the feature space of
patterns
Basis for more complex probabilistic inference
L1 regularization Graph Mining
E-step -gt Mining -gt M-step

16
Probabilistic Model

Binomial Mixture
Each Component

Mixing weight for cluster
Parameter vector for cluster
17
Function to minimize

L1-Regularized log likelihood
Baseline constant
ML parameter estimate using single binomial
distribution
In solution, most parameters exactly equal to
constants

18
E-step

Active pattern
E-step computed only with active patterns
(computable!)

19
M-step

Putative cluster assignment by E-step
Each parameter is solved separately
Use graph mining to find active patterns
Then, solve it only for active patterns

20
Solution

Occurrence probability in a cluster
Overall occurrence probability

21
Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
22
Mining for Active Patterns F

F is rewritten in the following form
Active patterns can be found by graph mining!
(multiclass)

23
Experiments RNA graphs

Stem as a node
Secondary structure by RNAfold
0/1 Vertex label (self loop or not)

24
Clustering RNA graphs

Three Rfam families
Intron GP I (Int, 30 graphs)
SSU rRNA 5 (SSU, 50 graphs)
RNase bact a (RNase, 50 graphs)
Three bipartition problems
Results evaluated by ROC scores (Area under the
ROC curve)

25
Examples of RNA Graphs
26
ROC Scores
27
No of Patterns Time
28
Found Patterns
29
Summary (EM)

Probabilistic clustering based on substructure
representation
Inference helped by graph mining
Many possible extensions
Naïve Bayes
Graph PCA, LFD, CCA
Semi-supervised learning
Applications in Biology?

30
Graph Boosting
Saigo, H., T. Kadowaki and K. Tsuda A Linear
Programming Approach for Molecular QSAR
analysis. International Workshop on Mining and
Learning with Graphs, 85-96, 2006
31
Graph Regression Problem

Known as QSAR problem in chemical informatics
Quantitative Structure-Activity Analysis
Given a graph, predict a real-value
Typically, features (descriptors) are given

32
QSAR with conventional descriptors
atoms bonds rings Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
33
Motivation of Graph Boosting

Descriptors are not always available
New features by obtaining informative patterns
(i.e., subgraphs)
Greedy pattern discovery by Boosting gSpan
Linear Programming (LP) Boosting for reducing the
number of graph mining calls
Accurate prediction interpretable results

34
Molecule as a labeled graph
35
QSAR with patterns
Activity
1 1 1 3
-1 1 -1 1.2
-1 1 -1 0.77
-1 1 -1 -3.52
1 1 -1 -4
36
Sparse regression in a very high dimensional space

G all possible patterns (intractably large)
G-dimensional feature vector x for a molecule
Linear Regression
Use L1 regularizer to have sparse a
Select a tractable number of patterns

37
Problem formulation
We introduce e-insensitive loss and L1
regularizer m of training graphs d G ?,
?- slack variables e parameter
38
Dual LP

Primal Huge number of weight variables
Dual Huge number of constraints

LP1-Dual
39
Column Generation Algorithm for LP Boost (Demiriz
et al., 2002)

Start from the dual with no constraints
Add the most violated constraint each time
Guaranteed to converge

Constraint Matrix
Used Part
40
Finding the most violated constraint

Constraint for a pattern (shown again)
Finding the most violated one
Searched by weighted substructure mining

41
Algorithm Overview

Iteration
Find a new pattern by graph mining with weight u
If all constraints are satisfied, break
Add a new constraint
Update u by LP1-Dual
Return
Convert dual solution to obtain primal solution a

42
Speed-up by adding multiple patterns (multiple
pricing)

So far, the most violated pattern is chosen
Mining and inclusion of top k patterns at each
iteration
Reduction of the number of mining calls

A Linear Programming Approach for Molecular QSAR
Analysis
43
Speed-up by multiple pricing
44
Clearly negative data
atoms bonds rings Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
22 20 -10000
23 19 -10000
A Linear Programming Approach for Molecular QSAR
Analysis
45
Inclusion of clearly negative data
LP2-Primal
l of clearly negative data z predetermined
upperbound ? slack variable
46
Experiments

Data from Endocrine Disruptors Knowledge Base
59 compounds labeled by real number and 61
compounds labeled by a large negative number
Label (target) is a log translated relative
proliferative potency (log(RPP)) normalized
between 1 and 1
Comparison with
Marginalized Graph Kernel ridge regression
Marginalized Graph Kernel kNN regression

47
Results with or without clearly negative data
LP2
LP1
48
Extracted patterns
Interpretable compared with implicitly expressed
features by Marginalized Graph Kernel
49
Summary (Graph Boosting)

Graph Boosting simultaneously generate patterns
and learn their weights
Finite convergence by column generation
Potentially interpretable by chemists.
Flexible constraints and speed-up by LP.

50
Concluding Remarks

Using graph mining as a part of machine learning
algorithms
Weights are essential
Please include weights when you implement your
item-set/tree/graph mining algorithms
Make it available on the web!
Then ML researchers can use it

Write a Comment

User Comments (0)

About PowerShow.com

Graph Mining Applications to Machine Learning Problems - PowerPoint PPT Presentation

Graph Mining Applications to Machine Learning Problems

... gspan (Yan et al., 2002), Gaston (2004) Graph Mining Frequent Substructure Mining Enumerate all patterns occurred in at least m graphs : Indicator of ... – PowerPoint PPT presentation