Title: Flowers
105/2009
Large-Scale Graph Mining Using Backbone
Refine-ment Classes
Andreas Maunz1, Christoph Helma1,2, and Stefan
Kramer3 1) FDM Universität Freiburg (D)2)
in-silico toxicology Basel (CH) 3) Technische
Universität München (D)
2BACKBONE REFINEMENT CLASS MINING
- Efficient diverse substructure mining from a
large class-labelled graph database
3BBRC Rationale
Typical substructure frequencies for databases of
small molecules
Trees are most frequent substructure type yet
efficiently enumerable. However
- Excessively large result sets are obtained even
for high correlation and minimum frequency
constraints.
04
4BBRC Definitions
GASTON (GrAph, Sequence and Tree ExtractiON) by
Nijssen and Kok1
- Backbone of a tree longest path with the lowest
sequence (assuming canonical sequence ordering).
- Since every tree has exactly one backbone,
backbones partition the partial order of trees
disjointly.
- Pre-order (depth-first) traversal is used within
each partition to refine structures.
Backbone Refinement Class (BBRC) All tree
refinements growing from a specific backbone.
1 Nijssen S. Kok J.N. A Quickstart in
Frequent Structure Mining can make a Difference,
KDD 04 Proceedings of the tenth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, New York, NY, USA ACM 2004
647652.
04
4
5BBRC Example
Backbones in gray
04
5
6BBRC Properties (1)
Some Properties
- Two types of BBRCs
- within a backbone not disjoint(see figure on
the left) - across backbones disjoint
- A given backbone spans a maximum search tree. No
node may be added without changing the backbone.
- BBRCs partition the search space structurally
(as opposed to occurrence-based methods, such as
open/closed features).
Search space for two BBRCs within the same
backbone.
04
6
7BBRC Properties (2)
The Number of BBRCs
? The number of Backbone Refinement Classes is
governed by the (recursive)branches on this
backbone.
04
7
8BBRC Properties (3)
The Number of BBRCs (unpublished)
04
8
9BBRC Properties (4)
Summary of Feature Counts
04
9
10BBRC Implementation
Idea Use paths as candidate backbones. Mine
BBRCs and represent each BBRC by the most
(?2-)significant member.
- In case of several most significant members, use
the most general one.
- ?2 thresholds can not be used for anti-monotonic
pruning, however an upper bound for ?2 values of
refinements of a pattern exists1 (Statistical
Metric Pruning).
Dynamic Upper Bound Pruning ?2 threshold may be
increased during depth-first traversal since we
only search for the max. elements of classes.
1 S. Morishita and J. Sese. Traversing Itemset
Lattices with Statistical Metric Pruning. In
Symposium on Principles of Database Systems,
pages 226236, 2000.
04
10
11BBRC Experiments (1)
Investigation of BBRCs regarding time efficiency,
feature set sizes and expressiveness
- Class-Balanced CPDB datasets
- Salmonella Mutagenicity (SM, 388 active / 810
compounds) - Rat Carcinogenicity (RC, 459 active / 1145
compounds) - Mouse Carcinogenicity (MoC, 428 active / 927
compounds) - Multicell Call (MuC, 553 active / 1067
compounds).
- Significant Trees all trees that are frequent
and significant.
- BBRC Representatives most significant
representatives of the backbone refinement
classes.
04
11
12BBRC Experiments (2)
Feature Set Sizes
Minimum frequency 6
04
12
13BBRC Experiments (3)
Time Efficiency
Minimum frequency 6
04
13
14BBRC Experiments (4)
Instance-based predictions all all
predictions AD top 80 confidence
predictions wt. predictions weighted by
confidence
Accuracy, Sensitivity, Specificity
Black Sign. Trees Dark Gray BBRC-R. Light Gray
Open Trees
04
14
15Large-Scale Analysis (1)
Large Scale Analysis
- NCI Yeast Anticancer Drug Screen datasets (April
2002 release) - AC-One (stage 0) 87,264 compounds, 12,068
active - AC-All (stage 0) 87,264 compounds, 5,777
active - AC-All (stage 1) 10,924 compounds, 5,433 active
To the best knowledge of the authors, 1. and 2.
are the largest labelled datasets that have been
considered in correlated graph mining.
04
15
16Large-Scale Analysis (2)
Effects of Minimum Frequency on Dataset Coverage
AC-One (stage 0) 87,264 comp
04
16
17Large-Scale Analysis (3)
Feature Count for Balanced datasets (downsampling)
Max. Trees the positive border as implied by
minimum frequency and significance constraints1.
1 M. Al Hasan et.al. Origami Mining
Representative Orthogonal Graph Patterns. ICDM
2007. Seventh IEEE International Conference on
Data Mining, pages 153162, Oct. 2007.
04
18Large-Scale Analysis (4)
Time Efficiency
Time efficiency (Mining)
Open Treesmining times of 4-12h
Time efficiency (Prediction)
Open Treesprediction times of gt60simpractical
RAM demand.
04
18
19BBRC Experiments (5)
Euclidean embedding based on Co-Occurrences and
Entropy1
Active / Inactive compounds Activating /
Deactivatingfeatures
Differently colored features nearly perfectly
separated
Features are well distributed with few clusters
1 Hannes Schulz, Christian Kersting, Andreas
Karwath, ILP, the Blind, and the Elephant
Euclidean Embedding of Co-Proven Queries
(Proceedings of the 19th International Conference
on Inductive Logic Programming (ILP 2009)
(forthcoming)).
04
19
20SUMMARY
21Summary (1)
Backbone Refinement Class Representatives
- Structurally heterogeneous descriptors,
compression by structural invariant (backbone
constraint)
- Good dataset coverage, robust against increasing
minimum frequencies
- Applicable to large-scale graph databases
through a novel statistical pruning technique
04
22Summary (2)
Backbone Refinement Class Representatives
- Compression of 90 compared to all trees and 31
compared to open trees
- Time efficiency improved by 85 and 83 versus
no statistical pruning and static upper bound
pruning, respectively.
- Discriminative potential similar to complete set
of trees, but significantly better than open
trees.
04
23Acknowledgements
The authors would like to thank Björn Bringmann
for providing a binary and friendly cooperation
in dataset testing, and Ulrich Rückert for
providing datasets. The research was (partially)
supported by the EU seventh framework programme
under contract no Health-F5-2008-200787 (OpenTox).
http//www.opentox.org
C implementation http//www.maunz.de/libfminer-
doc
04