Direct Mining of Discriminative and Essential Frequent Patterns via Modelbased Search Tree - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Direct Mining of Discriminative and Essential Frequent Patterns via Modelbased Search Tree

Description:

Two-Step Batch Method. Two Problems. Mine step. combinatorial explosion ... Two Problems. Select step. Issue of discriminative power. Frequent Patterns. 1 ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 25
Provided by: Xu9
Category:

less

Transcript and Presenter's Notes

Title: Direct Mining of Discriminative and Essential Frequent Patterns via Modelbased Search Tree


1
Direct Mining of Discriminative and Essential
Frequent Patterns via Model-based Search Tree
How to find good features from semi-structured
raw data for classification
  • Wei Fan, Kun Zhang, Hong Cheng,
  • Jing Gao, Xifeng Yan, Jiawei Han,
  • Philip S. Yu, Olivier Verscheure

2
Feature Construction
  • Most data mining and machine learning model
    assume the following structured data
  • (x1, x2, ..., xk) -gt y
  • where xis are independent variable
  • y is dependent variable.
  • y drawn from discrete set classification
  • y drawn from continuous variable regression
  • When feature vectors are good, differences in
    accuracy among learners are not much.
  • Questions where do good features come from?

3
Frequent Pattern-Based Feature Extraction
  • Data not in the pre-defined feature vectors
  • Transactions
  • Biological sequence
  • Graph database

Frequent pattern is a good candidate for
discriminative features So, how to mine them?
4
FP Sub-graph
(example borrowed from George Karypis
presentation)
5
Frequent Pattern Feature Vector Representation
P1 P2 P3 Data1 1 1 0 Data2
1 0 1 Data3 1 1 0 Data4 0 0 1
Mining these predictive features is an
NP-hard problem. 100 examples can get up to 1010
patterns Most are useless
6
Example
  • 192 examples
  • 12 support (at least 12 examples contain the
    pattern), 8600 patterns returned by itemsets
  • 192 vs 8600 ?
  • 4 support, 92,000 patterns
  • 192 vs 92,000 ??
  • Most patterns have no predictive power and cannot
    be used to construct features.
  • Our algorithm
  • Find only 20 highly predictive patterns
  • can construct a decision tree with about 90
    accuracy

7
Data in bad feature space
  • Discriminative patterns
  • A non-linear combination of single feature(s)
  • Increase the expressive and discriminative power
    of the feature space
  • An example

Data is non-linearly separable in (x, y)
8
New Feature Space
  • Solving Problem

0
1
1
ItemSet F x0,y0 Association rule F x0 ? y0
1
1
F
Mine Transform
1
1
0
x
1
1
y
Data is linearly separable in (x, y, F)
9
Computational Issues
  • Measured by its frequency or support.
  • E.g. frequent subgraphs with sup 10 or 10
    examples contain these patterns
  • Ordered enumeration cannot enumerate sup
    10 without first enumerating all patterns gt
    10.
  • NP hard problem, easily up to 1010 patterns for a
    realistic problem.
  • Most Patterns are Non-discriminative.
  • Low support patterns can have high
    discriminative power. Bad!
  • Random sampling not work since it is not
    exhaustive.
  • Most patterns are useless. Random sample patterns
    (or blindly enumerate without considering
    frequency) is useless.
  • Small number of examples.
  • If subset of vocabulary, incomplete search.
  • If complete vocabulary, wont help much but
    introduce sample selection bias problem,
    particularly to miss low support but high info
    gain patterns

10
Conventional Procedure
Two-Step Batch Method
  • Mine frequent patterns (gtsup)
  • Select most discriminative patterns
  • Represent data in the feature space using such
    patterns
  • Build classification models.

Feature Construction and Selection
11
Two Problems
  • Mine step
  • combinatorial explosion

2. patterns not considered if minsupport isnt
small enough
1. exponential explosion
12
Two Problems
  • Select step
  • Issue of discriminative power

4. Correlation not directly evaluated on their
joint predictability
3. InfoGain against the complete dataset, NOT on
subset of examples
13
Direct Mining Selection via Model-based Search
Tree
Feature Miner
Classifier
  • Basic Flow

Compact set of highly discriminative
patterns 1 2 3 4 5 6 7 . . .
Global Support 1020/100000.02
Divide-and-Conquer Based Frequent Pattern Mining
Mined Discriminative Patterns
14
Analyses (I)
  • Scalability (Theorem 1)
  • Upper bound
  • Scale down ratio to obtain extremely low
    support pat
  • Bound on number of returned features (Theorem 2)

15
Analyses (II)
  • Subspace is important for discriminative pattern
  • Original set no-information gain if
  • C1 and C0 number of examples belonging to class
    1 and 0
  • P1 number of examples in C1 that contains a
    pattern a
  • P0 number of examples in C0 that contains the
    same pattern a
  • Subsets could have info gain
  • Non-overfitting
  • Optimality under exhaustive search

16
Experimental Studies
Itemset Mining (I)
  • Scalability Comparison

17
Experimental Studies
Itemset Mining (II)
  • Accuracy of Mined Itemsets

4 Wins 1 loss
much smaller number of patterns
18
Experimental Studies
Itemset Mining (III)
  • Convergence

19
Experimental Studies
Graph Mining (I)
  • 9 NCI anti-cancer screen datasets
  • The PubChem Project. URL pubchem.ncbi.nlm.nih.gov
    .
  • Active (Positive) class around 1 - 8.3
  • 2 AIDS anti-viral screen datasets
  • URL http//dtp.nci.nih.gov.
  • H1 CMCA 3.5
  • H2 CA 1

20
Experimental Studies
Graph Mining (II)
  • Scalability

21
Experimental Studies
Graph Mining (III)
  • AUC and Accuracy

AUC
11 Wins
10 Wins 1 Loss
22
Experimental Studies
Graph Mining (IV)
  • AUC of MbT, DT MbT VS Benchmarks

7 Wins, 4 losses
23
Summary
  • Model-based Search Tree
  • Integrated feature mining and construction.
  • Dynamic support
  • Can mine extremely small support patterns
  • Both a feature construction and a classifier
  • Not limited to one type of frequent pattern
    plug-play
  • Experiment Results
  • Itemset Mining
  • Graph Mining
  • Software and Dataset available from
  • www.cs.columbia.edu/wfan

24
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com