Flowers - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Flowers

Description:

Efficient diverse substructure mining from a large class ... Activating / Deactivating. features. Euclidean embedding based on Co-Occurrences and Entropy[1] ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 24
Provided by: am5796
Category:

less

Transcript and Presenter's Notes

Title: Flowers


1
05/2009
Large-Scale Graph Mining Using Backbone
Refine-ment Classes
Andreas Maunz1, Christoph Helma1,2, and Stefan
Kramer3 1) FDM Universität Freiburg (D)2)
in-silico toxicology Basel (CH) 3) Technische
Universität München (D)
2
BACKBONE REFINEMENT CLASS MINING
  • Efficient diverse substructure mining from a
    large class-labelled graph database

3
BBRC Rationale
Typical substructure frequencies for databases of
small molecules
Trees are most frequent substructure type yet
efficiently enumerable. However
  • Excessively large result sets are obtained even
    for high correlation and minimum frequency
    constraints.

04
4
BBRC Definitions
GASTON (GrAph, Sequence and Tree ExtractiON) by
Nijssen and Kok1
  • Backbone of a tree longest path with the lowest
    sequence (assuming canonical sequence ordering).
  • Since every tree has exactly one backbone,
    backbones partition the partial order of trees
    disjointly.
  • Pre-order (depth-first) traversal is used within
    each partition to refine structures.

Backbone Refinement Class (BBRC) All tree
refinements growing from a specific backbone.
1 Nijssen S. Kok J.N. A Quickstart in
Frequent Structure Mining can make a Difference,
KDD 04 Proceedings of the tenth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, New York, NY, USA ACM 2004
647652.
04
4
5
BBRC Example
Backbones in gray
04
5
6
BBRC Properties (1)
Some Properties
  • Two types of BBRCs
  • within a backbone not disjoint(see figure on
    the left)
  • across backbones disjoint
  • A given backbone spans a maximum search tree. No
    node may be added without changing the backbone.
  • BBRCs partition the search space structurally
    (as opposed to occurrence-based methods, such as
    open/closed features).

Search space for two BBRCs within the same
backbone.
04
6
7
BBRC Properties (2)
The Number of BBRCs
? The number of Backbone Refinement Classes is
governed by the (recursive)branches on this
backbone.
04
7
8
BBRC Properties (3)
The Number of BBRCs (unpublished)
04
8
9
BBRC Properties (4)
Summary of Feature Counts
04
9
10
BBRC Implementation
Idea Use paths as candidate backbones. Mine
BBRCs and represent each BBRC by the most
(?2-)significant member.
  • In case of several most significant members, use
    the most general one.
  • ?2 thresholds can not be used for anti-monotonic
    pruning, however an upper bound for ?2 values of
    refinements of a pattern exists1 (Statistical
    Metric Pruning).

Dynamic Upper Bound Pruning ?2 threshold may be
increased during depth-first traversal since we
only search for the max. elements of classes.
1 S. Morishita and J. Sese. Traversing Itemset
Lattices with Statistical Metric Pruning. In
Symposium on Principles of Database Systems,
pages 226236, 2000.
04
10
11
BBRC Experiments (1)
Investigation of BBRCs regarding time efficiency,
feature set sizes and expressiveness
  • Class-Balanced CPDB datasets
  • Salmonella Mutagenicity (SM, 388 active / 810
    compounds)
  • Rat Carcinogenicity (RC, 459 active / 1145
    compounds)
  • Mouse Carcinogenicity (MoC, 428 active / 927
    compounds)
  • Multicell Call (MuC, 553 active / 1067
    compounds).
  • Significant Trees all trees that are frequent
    and significant.
  • BBRC Representatives most significant
    representatives of the backbone refinement
    classes.

04
11
12
BBRC Experiments (2)
Feature Set Sizes
Minimum frequency 6
04
12
13
BBRC Experiments (3)
Time Efficiency
Minimum frequency 6
04
13
14
BBRC Experiments (4)
Instance-based predictions all all
predictions AD top 80 confidence
predictions wt. predictions weighted by
confidence
Accuracy, Sensitivity, Specificity
Black Sign. Trees Dark Gray BBRC-R. Light Gray
Open Trees
04
14
15
Large-Scale Analysis (1)
Large Scale Analysis
  • NCI Yeast Anticancer Drug Screen datasets (April
    2002 release)
  • AC-One (stage 0) 87,264 compounds, 12,068
    active
  • AC-All (stage 0) 87,264 compounds, 5,777
    active
  • AC-All (stage 1) 10,924 compounds, 5,433 active

To the best knowledge of the authors, 1. and 2.
are the largest labelled datasets that have been
considered in correlated graph mining.
04
15
16
Large-Scale Analysis (2)
Effects of Minimum Frequency on Dataset Coverage
AC-One (stage 0) 87,264 comp
04
16
17
Large-Scale Analysis (3)
Feature Count for Balanced datasets (downsampling)
Max. Trees the positive border as implied by
minimum frequency and significance constraints1.
1 M. Al Hasan et.al. Origami Mining
Representative Orthogonal Graph Patterns. ICDM
2007. Seventh IEEE International Conference on
Data Mining, pages 153162, Oct. 2007.
04
18
Large-Scale Analysis (4)
Time Efficiency
Time efficiency (Mining)
Open Treesmining times of 4-12h
Time efficiency (Prediction)
Open Treesprediction times of gt60simpractical
RAM demand.
04
18
19
BBRC Experiments (5)
Euclidean embedding based on Co-Occurrences and
Entropy1
Active / Inactive compounds Activating /
Deactivatingfeatures
Differently colored features nearly perfectly
separated
Features are well distributed with few clusters
1 Hannes Schulz, Christian Kersting, Andreas
Karwath, ILP, the Blind, and the Elephant
Euclidean Embedding of Co-Proven Queries
(Proceedings of the 19th International Conference
on Inductive Logic Programming (ILP 2009)
(forthcoming)).
04
19
20
SUMMARY
21
Summary (1)
Backbone Refinement Class Representatives
  • Structurally heterogeneous descriptors,
    compression by structural invariant (backbone
    constraint)
  • Good dataset coverage, robust against increasing
    minimum frequencies
  • Applicable to large-scale graph databases
    through a novel statistical pruning technique

04
22
Summary (2)
Backbone Refinement Class Representatives
  • Compression of 90 compared to all trees and 31
    compared to open trees
  • Time efficiency improved by 85 and 83 versus
    no statistical pruning and static upper bound
    pruning, respectively.
  • Discriminative potential similar to complete set
    of trees, but significantly better than open
    trees.

04
23
Acknowledgements
The authors would like to thank Björn Bringmann
for providing a binary and friendly cooperation
in dataset testing, and Ulrich Rückert for
providing datasets. The research was (partially)
supported by the EU seventh framework programme
under contract no Health-F5-2008-200787 (OpenTox).
http//www.opentox.org
C implementation http//www.maunz.de/libfminer-
doc
04
Write a Comment
User Comments (0)
About PowerShow.com