Direct Mining of Discriminative and Essential Frequent Patterns via Modelbased Search Tree - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Direct Mining of Discriminative and Essential Frequent Patterns via Modelbased Search Tree

Description:

Two-Step Batch Method. Two Problems. Mine step. combinatorial explosion ... Two Problems. Select step. Issue of discriminative power. Frequent Patterns. 1 ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 25

Provided by: Xu9

Category:

more less

Transcript and Presenter's Notes

Title: Direct Mining of Discriminative and Essential Frequent Patterns via Modelbased Search Tree

1
Direct Mining of Discriminative and Essential
Frequent Patterns via Model-based Search Tree
How to find good features from semi-structured
raw data for classification

Wei Fan, Kun Zhang, Hong Cheng,
Jing Gao, Xifeng Yan, Jiawei Han,
Philip S. Yu, Olivier Verscheure

2
Feature Construction

Most data mining and machine learning model
assume the following structured data
(x1, x2, ..., xk) -gt y
where xis are independent variable
y is dependent variable.
y drawn from discrete set classification
y drawn from continuous variable regression
When feature vectors are good, differences in
accuracy among learners are not much.
Questions where do good features come from?

3
Frequent Pattern-Based Feature Extraction

Data not in the pre-defined feature vectors
Transactions
Biological sequence
Graph database

Frequent pattern is a good candidate for
discriminative features So, how to mine them?
4
FP Sub-graph
(example borrowed from George Karypis
presentation)
5
Frequent Pattern Feature Vector Representation
P1 P2 P3 Data1 1 1 0 Data2
1 0 1 Data3 1 1 0 Data4 0 0 1
Mining these predictive features is an
NP-hard problem. 100 examples can get up to 1010
patterns Most are useless
6
Example

192 examples
12 support (at least 12 examples contain the
pattern), 8600 patterns returned by itemsets
192 vs 8600 ?
4 support, 92,000 patterns
192 vs 92,000 ??
Most patterns have no predictive power and cannot
be used to construct features.
Our algorithm
Find only 20 highly predictive patterns
can construct a decision tree with about 90
accuracy

7
Data in bad feature space

Discriminative patterns
A non-linear combination of single feature(s)
Increase the expressive and discriminative power
of the feature space
An example

Data is non-linearly separable in (x, y)
8
New Feature Space

Solving Problem

0
1
1
ItemSet F x0,y0 Association rule F x0 ? y0
1
1
F
Mine Transform
1
1
0
x
1
1
y
Data is linearly separable in (x, y, F)
9
Computational Issues

Measured by its frequency or support.
E.g. frequent subgraphs with sup 10 or 10
examples contain these patterns
Ordered enumeration cannot enumerate sup
10 without first enumerating all patterns gt
10.
NP hard problem, easily up to 1010 patterns for a
realistic problem.
Most Patterns are Non-discriminative.
Low support patterns can have high
discriminative power. Bad!
Random sampling not work since it is not
exhaustive.
Most patterns are useless. Random sample patterns
(or blindly enumerate without considering
frequency) is useless.
Small number of examples.
If subset of vocabulary, incomplete search.
If complete vocabulary, wont help much but
introduce sample selection bias problem,
particularly to miss low support but high info
gain patterns

10
Conventional Procedure
Two-Step Batch Method

Mine frequent patterns (gtsup)

Select most discriminative patterns

Represent data in the feature space using such
patterns

Build classification models.

Feature Construction and Selection
11
Two Problems

Mine step
combinatorial explosion

2. patterns not considered if minsupport isnt
small enough
1. exponential explosion
12
Two Problems

Select step
Issue of discriminative power

4. Correlation not directly evaluated on their
joint predictability
3. InfoGain against the complete dataset, NOT on
subset of examples
13
Direct Mining Selection via Model-based Search
Tree
Feature Miner
Classifier

Basic Flow

Compact set of highly discriminative
patterns 1 2 3 4 5 6 7 . . .
Global Support 1020/100000.02
Divide-and-Conquer Based Frequent Pattern Mining
Mined Discriminative Patterns
14
Analyses (I)