Title: Graph Mining Applications to Machine Learning Problems
1Graph Mining Applications to Machine Learning
Problems
- Max Planck Institute for Biological Cybernetics
- Koji Tsuda
2Graphs
3Graph Structures in Biology
- DNA Sequence
- RNA
- Texts in literature
H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
4Substructure Representation
- 0/1 vector of pattern indicators
- Huge dimensionality!
- Need Graph Mining for selecting features
- Better than paths (Marginalized graph kernels)
patterns
5Overview
- Quick Review on Graph Mining
- EM-based Clustering algorithm
- Mixture model with L1 feature selection
- Graph Boosting
- Supervised Regression for QSAR Analysis
- Linear programming meets graph mining
6Quick Review of Graph Mining
7Graph Mining
- Analysis of Graph Databases
- Find all patterns satisfying predetermined
conditions - Frequent Substructure Mining
- Combinatorial, Exhaustive
- Recently developed
- AGM (Inokuchi et al., 2000), gspan (Yan et al.,
2002), Gaston (2004)
8Graph Mining
- Frequent Substructure Mining
- Enumerate all patterns occurred in at least m
graphs -
- Indicator of pattern k in graph i
Support(k) of occurrence of pattern k
9Gspan (Yan and Han, 2002)
- Efficient Frequent Substructure Mining Method
- DFS Code
- Efficient detection of isomorphic patterns
- Extend Gspan for our works
10Enumeration on Tree-shaped Search Space
- Each node has a pattern
- Generate nodes from the root
- Add an edge at each step
11Tree Pruning
Support(g) of occurrence of pattern g
- Anti-monotonicity
- If support(g) lt m, stop exploring!
Not generated
12Discriminative patternsWeighted Substructure
Mining
- w_i gt 0 positive class
- w_i lt 0 negative class
- Weighted Substructure Mining
- Patterns with large frequency difference
- Not Anti-Monotonic Use a bound
13Multiclass version
- Multiple weight vectors
- (graph belongs to class )
- (otherwise)
- Search patterns overrepresented in a class
14EM-based clustering of graphs
Tsuda, K. and T. Kudo Clustering Graphs by
Weighted Substructure Mining. ICML 2006,
953-960, 2006
15EM-based graph clustering
- Motivation
- Learning a mixture model in the feature space of
patterns - Basis for more complex probabilistic inference
- L1 regularization Graph Mining
- E-step -gt Mining -gt M-step
16Probabilistic Model
- Binomial Mixture
- Each Component
Mixing weight for cluster
Parameter vector for cluster
17Function to minimize
- L1-Regularized log likelihood
- Baseline constant
- ML parameter estimate using single binomial
distribution - In solution, most parameters exactly equal to
constants
18E-step
- Active pattern
- E-step computed only with active patterns
(computable!)
19M-step
- Putative cluster assignment by E-step
- Each parameter is solved separately
- Use graph mining to find active patterns
- Then, solve it only for active patterns
20Solution
- Occurrence probability in a cluster
- Overall occurrence probability
21Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
22Mining for Active Patterns F
- F is rewritten in the following form
- Active patterns can be found by graph mining!
(multiclass)
23Experiments RNA graphs
- Stem as a node
- Secondary structure by RNAfold
- 0/1 Vertex label (self loop or not)
24Clustering RNA graphs
- Three Rfam families
- Intron GP I (Int, 30 graphs)
- SSU rRNA 5 (SSU, 50 graphs)
- RNase bact a (RNase, 50 graphs)
- Three bipartition problems
- Results evaluated by ROC scores (Area under the
ROC curve)
25Examples of RNA Graphs
26ROC Scores
27No of Patterns Time
28Found Patterns
29Summary (EM)
- Probabilistic clustering based on substructure
representation - Inference helped by graph mining
- Many possible extensions
- Naïve Bayes
- Graph PCA, LFD, CCA
- Semi-supervised learning
- Applications in Biology?
30Graph Boosting
Saigo, H., T. Kadowaki and K. Tsuda A Linear
Programming Approach for Molecular QSAR
analysis. International Workshop on Mining and
Learning with Graphs, 85-96, 2006
31Graph Regression Problem
- Known as QSAR problem in chemical informatics
- Quantitative Structure-Activity Analysis
- Given a graph, predict a real-value
- Typically, features (descriptors) are given
32QSAR with conventional descriptors
atoms bonds rings Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
33Motivation of Graph Boosting
- Descriptors are not always available
- New features by obtaining informative patterns
(i.e., subgraphs) - Greedy pattern discovery by Boosting gSpan
- Linear Programming (LP) Boosting for reducing the
number of graph mining calls - Accurate prediction interpretable results
34Molecule as a labeled graph
35QSAR with patterns
Activity
1 1 1 3
-1 1 -1 1.2
-1 1 -1 0.77
-1 1 -1 -3.52
1 1 -1 -4
36Sparse regression in a very high dimensional space
- G all possible patterns (intractably large)
- G-dimensional feature vector x for a molecule
- Linear Regression
- Use L1 regularizer to have sparse a
- Select a tractable number of patterns
37Problem formulation
We introduce e-insensitive loss and L1
regularizer m of training graphs d G ?,
?- slack variables e parameter
38Dual LP
- Primal Huge number of weight variables
- Dual Huge number of constraints
LP1-Dual
39Column Generation Algorithm for LP Boost (Demiriz
et al., 2002)
- Start from the dual with no constraints
- Add the most violated constraint each time
- Guaranteed to converge
Constraint Matrix
Used Part
40Finding the most violated constraint
- Constraint for a pattern (shown again)
- Finding the most violated one
- Searched by weighted substructure mining
41Algorithm Overview
- Iteration
- Find a new pattern by graph mining with weight u
- If all constraints are satisfied, break
- Add a new constraint
- Update u by LP1-Dual
- Return
- Convert dual solution to obtain primal solution a
42Speed-up by adding multiple patterns (multiple
pricing)
- So far, the most violated pattern is chosen
- Mining and inclusion of top k patterns at each
iteration - Reduction of the number of mining calls
A Linear Programming Approach for Molecular QSAR
Analysis
43Speed-up by multiple pricing
44Clearly negative data
atoms bonds rings Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
22 20 -10000
23 19 -10000
A Linear Programming Approach for Molecular QSAR
Analysis
45Inclusion of clearly negative data
LP2-Primal
l of clearly negative data z predetermined
upperbound ? slack variable
46Experiments
- Data from Endocrine Disruptors Knowledge Base
- 59 compounds labeled by real number and 61
compounds labeled by a large negative number - Label (target) is a log translated relative
proliferative potency (log(RPP)) normalized
between 1 and 1 - Comparison with
- Marginalized Graph Kernel ridge regression
- Marginalized Graph Kernel kNN regression
47Results with or without clearly negative data
LP2
LP1
48Extracted patterns
Interpretable compared with implicitly expressed
features by Marginalized Graph Kernel
49Summary (Graph Boosting)
- Graph Boosting simultaneously generate patterns
and learn their weights - Finite convergence by column generation
- Potentially interpretable by chemists.
- Flexible constraints and speed-up by LP.
50Concluding Remarks
- Using graph mining as a part of machine learning
algorithms - Weights are essential
- Please include weights when you implement your
item-set/tree/graph mining algorithms - Make it available on the web!
- Then ML researchers can use it