Flowers

About This Presentation

Title:

Flowers

Description:

Efficient diverse substructure mining from a large class ... Activating / Deactivating. features. Euclidean embedding based on Co-Occurrences and Entropy[1] ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 24

Provided by: am5796

Category:

more less

Transcript and Presenter's Notes

Title: Flowers

1
05/2009
Large-Scale Graph Mining Using Backbone
Refine-ment Classes
Andreas Maunz1, Christoph Helma1,2, and Stefan
Kramer3 1) FDM Universität Freiburg (D)2)
in-silico toxicology Basel (CH) 3) Technische
Universität München (D)
2
BACKBONE REFINEMENT CLASS MINING

Efficient diverse substructure mining from a
large class-labelled graph database

3
BBRC Rationale
Typical substructure frequencies for databases of
small molecules
Trees are most frequent substructure type yet
efficiently enumerable. However

Excessively large result sets are obtained even
for high correlation and minimum frequency
constraints.

04
4
BBRC Definitions
GASTON (GrAph, Sequence and Tree ExtractiON) by
Nijssen and Kok1

Backbone of a tree longest path with the lowest
sequence (assuming canonical sequence ordering).

Since every tree has exactly one backbone,
backbones partition the partial order of trees
disjointly.

Pre-order (depth-first) traversal is used within
each partition to refine structures.

Backbone Refinement Class (BBRC) All tree
refinements growing from a specific backbone.
1 Nijssen S. Kok J.N. A Quickstart in
Frequent Structure Mining can make a Difference,
KDD 04 Proceedings of the tenth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, New York, NY, USA ACM 2004
647652.
04
4
5
BBRC Example
Backbones in gray
04
5
6
BBRC Properties (1)
Some Properties

Two types of BBRCs
within a backbone not disjoint(see figure on
the left)
across backbones disjoint

A given backbone spans a maximum search tree. No
node may be added without changing the backbone.

BBRCs partition the search space structurally
(as opposed to occurrence-based methods, such as
open/closed features).

Search space for two BBRCs within the same
backbone.
04
6
7
BBRC Properties (2)
The Number of BBRCs
? The number of Backbone Refinement Classes is
governed by the (recursive)branches on this
backbone.
04
7
8
BBRC Properties (3)
The Number of BBRCs (unpublished)
04
8
9
BBRC Properties (4)
Summary of Feature Counts
04
9
10
BBRC Implementation
Idea Use paths as candidate backbones. Mine
BBRCs and represent each BBRC by the most
(?2-)significant member.

In case of several most significant members, use
the most general one.

?2 thresholds can not be used for anti-monotonic
pruning, however an upper bound for ?2 values of
refinements of a pattern exists1 (Statistical
Metric Pruning).

Dynamic Upper Bound Pruning ?2 threshold may be
increased during depth-first traversal since we
only search for the max. elements of classes.
1 S. Morishita and J. Sese. Traversing Itemset
Lattices with Statistical Metric Pruning. In
Symposium on Principles of Database Systems,
pages 226236, 2000.
04
10
11
BBRC Experiments (1)
Investigation of BBRCs regarding time efficiency,
feature set sizes and expressiveness

Class-Balanced CPDB datasets
Salmonella Mutagenicity (SM, 388 active / 810
compounds)
Rat Carcinogenicity (RC, 459 active / 1145
compounds)
Mouse Carcinogenicity (MoC, 428 active / 927
compounds)
Multicell Call (MuC, 553 active / 1067
compounds).

Significant Trees all trees that are frequent
and significant.

BBRC Representatives most significant
representatives of the backbone refinement
classes.

04
11
12
BBRC Experiments (2)
Feature Set Sizes
Minimum frequency 6
04
12
13
BBRC Experiments (3)
Time Efficiency
Minimum frequency 6
04
13
14
BBRC Experiments (4)
Instance-based predictions all all
predictions AD top 80 confidence
predictions wt. predictions weighted by
confidence
Accuracy, Sensitivity, Specificity
Black Sign. Trees Dark Gray BBRC-R. Light Gray
Open Trees
04
14
15
Large-Scale Analysis (1)
Large Scale Analysis

NCI Yeast Anticancer Drug Screen datasets (April
2002 release)
AC-One (stage 0) 87,264 compounds, 12,068
active
AC-All (stage 0) 87,264 compounds, 5,777
active
AC-All (stage 1) 10,924 compounds, 5,433 active

To the best knowledge of the authors, 1. and 2.
are the largest labelled datasets that have been
considered in correlated graph mining.
04
15
16
Large-Scale Analysis (2)
Effects of Minimum Frequency on Dataset Coverage
AC-One (stage 0) 87,264 comp
04
16
17
Large-Scale Analysis (3)
Feature Count for Balanced datasets (downsampling)
Max. Trees the positive border as implied by
minimum frequency and significance constraints1.
1 M. Al Hasan et.al. Origami Mining
Representative Orthogonal Graph Patterns. ICDM
2007. Seventh IEEE International Conference on
Data Mining, pages 153162, Oct. 2007.
04
18
Large-Scale Analysis (4)
Time Efficiency
Time efficiency (Mining)
Open Treesmining times of 4-12h
Time efficiency (Prediction)
Open Treesprediction times of gt60simpractical
RAM demand.
04
18
19
BBRC Experiments (5)
Euclidean embedding based on Co-Occurrences and
Entropy1
Active / Inactive compounds Activating /
Deactivatingfeatures
Differently colored features nearly perfectly
separated
Features are well distributed with few clusters
1 Hannes Schulz, Christian Kersting, Andreas
Karwath, ILP, the Blind, and the Elephant
Euclidean Embedding of Co-Proven Queries
(Proceedings of the 19th International Conference
on Inductive Logic Programming (ILP 2009)
(forthcoming)).
04
19
20
SUMMARY
21
Summary (1)
Backbone Refinement Class Representatives

Structurally heterogeneous descriptors,
compression by structural invariant (backbone
constraint)

Good dataset coverage, robust against increasing
minimum frequencies

Applicable to large-scale graph databases
through a novel statistical pruning technique

04
22
Summary (2)
Backbone Refinement Class Representatives

Compression of 90 compared to all trees and 31
compared to open trees

Time efficiency improved by 85 and 83 versus
no statistical pruning and static upper bound
pruning, respectively.

Discriminative potential similar to complete set
of trees, but significantly better than open
trees.

04
23
Acknowledgements
The authors would like to thank Björn Bringmann
for providing a binary and friendly cooperation
in dataset testing, and Ulrich Rückert for
providing datasets. The research was (partially)
supported by the EU seventh framework programme
under contract no Health-F5-2008-200787 (OpenTox).
http//www.opentox.org
C implementation http//www.maunz.de/libfminer-
doc
04

Write a Comment

User Comments (0)