Mining Compressed FrequentPattern Sets - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Mining Compressed FrequentPattern Sets

Description:

FP-growth method. Depth first search in Pattern-Space ... RPlocal: directly use local method to find representative patterns from raw data ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 29

Provided by: uiuc1

Category:

more less

Transcript and Presenter's Notes

Title: Mining Compressed FrequentPattern Sets

1
Mining Compressed Frequent-Pattern Sets

Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng
Department of Computer Science
University of Illinois at Urbana-Champaign

2
Outline

Introduction
Problem Statement and Analysis
Discovering Representative Patterns
Performance Study
Discussion and Conclusions

3
Introduction

Frequent Pattern Mining
Minimum Support 2

4
Challenge In Frequent Pattern Mining

Efficiency?
Many scaleable mining algorithms are available
now
Usability?Yes
High minimum support common sense patterns
Low minimum support explosive number of results

5
Existing Compressing Techniques

Lossless compression
Closed frequent patterns
Non-derivable frequent item-sets
...
Lossy approximation
Maximal frequent patterns
Boundary cover sets

6
A Motivating Example

A subset of frequent item-sets in accident
dataset
High-quality compression needs to consider both
expression and support

Expression of P1
Support of P1
7
A Motivating Example

Closed frequent pattern
Report P1,P2,P3,P4,P5
Emphasize too much on support
no compression
Maximal frequent pattern
Only report P3
Only care about the expression
Loss the information of support
A desirable output P2,P3,P4

8
Compressing Frequent Patterns

Our compressing framework
Clustering frequent patterns by pattern
similarity
Pick a representative pattern for each cluster
Key Problems
Need a distance function to measure the
similarity between patterns
The quality of the clustering needs to be
controllable
The representative pattern should be able to
describe both expressions and supports of other
patterns
Efficiency is always desirable

9
Distance Measure

Let P1 and P2 are two closed frequent patterns,
T(P) is the set of raw data which contains P, the
distance between P1 and P2 is
Let T(P1)t1,t2,t3,t4,t5, T(P2)t1,t2,t3,t4,t6
, then D(P1,P2)1-4/61/3
D is a valid distance metric
D characterizes the support, but ignore the
expression

10
Representative Patterns

Incorporate expression into Representative
Pattern
The representative pattern should be able to
express all the other patterns in the same
cluster (i.e., superset)
The representative pattern Pr 38,16,18,12,17
Representative pattern is also good w.r.t.
distance
D(Pr, P1) D(P1, P2), D(Pr, P1) D(P1, P2)
Distance can be computed using support only

11
Clustering Criterion

General clustering approach (i.e., k-means)
Directly apply the distance measure
No guarantee on the quality of the clusters
The representative pattern may not exist in a
cluster
d-clustering
For each pattern P, Find all patterns which can
be expressed by P and their distance to P are
within d (d-cover)
All patterns in the cluster can be represented by
P

12
Intuitions of d-clustering

All Patterns in the cluster are supported by
almost same set of transactions
Distance from any pattern to representative is
bounded by d
Distance between any two patterns is bounded by 2
d
The small difference between transaction sets
could be noise or negligible
Representative Pattern has the most informative
expression

13
Pattern Compressing Problem

Pattern Compression Problem
Find the minimum number of clusters
(representative patterns)
All the frequent patterns are d-covered by at
least one representative pattern
Variation support of representative pattern less
than min_sup?
NP-hardness Reducible from set-covering problem

14
Discovering Representative Patterns

RPglobal
Assume all the frequent patterns are mined
Directly apply greedy set-covering algorithm
Guaranteed bounds w.r.t. optimal solution
RPlocal
Relax the constraints used in RPglobal
Gain in efficiency, lose in bound guarantee
Directly mine from raw data set
RPcombine
Combine above two methods
Trade-off w.r.t. efficiency and performance

15
RPglobal

Algorithm
At each step, find the representative pattern Pr
which d-covers the maximum number of uncovered
patterns
Select Pr as new representative pattern
Mark the corresponding pattern as covered
Continue until all patterns are covered
Bound
Cg (C) is the number of output of RPglobal
(optimal)
F is the set of frequent patterns
Set(P) set of the patterns covered by P

16
RPlocal

RPglobal is expensive
Assume all the frequent pattern are pre-computed
Need to find the globally best representative
pattern at each step
Need to compute the pair-wise distance between
all frequent patterns
Relax the constraints RPlocal
Find a locally good representative pattern each
step
Directly mine from raw data
Do not compute the distance pair-wisely

17
Local Greedy Method

Principle of Local Method
Bound
Cl number of output using local method
T optimal number of patterns covering all probe
patterns
Set(P) set of the patterns covered by P

18
Mine from Raw Data

Beneficial
Without storage of huge intermediate outputs
More efficient pruning methods
Applicable
Utilize the internal relations during mining
FP-growth method
Depth first search in Pattern-Space
A pattern can only be covered by its sons or
patterns visited before

Probe Pattern P
Ps Sons
Visited Patterns covering P
19
Integrate Local Method into FP-Mining

Algorithm
Follow the depth-first search in pattern space
Remember all previously discovered representative
patterns
For each pattern P
Not covered yet
Being Visited in the second time which traversal
back from its sons
Select a representative pattern using local
method (with P as new probe pattern)

20
Avoid Pair-wise Comparisons

Find a good representative pattern (for probe
pattern P)
Strong correlations between Pattern positions,
coverage of uncovered patterns and pattern length
Simple but effective heuristic select the
longest item-sets in Ps sons as a new
representative pattern to cover P
4952 first visit of P, 5043 second visit of P
(between 4952 and 5043 are sons of P)

second time visit of P
First time visit of P
Previous Patterns
Ps Sons
21
Efficient Implementation

Non Closed Pattern
Exist a super pattern with same support
Closed_Index (N bits)
Each bit remembers the consistency of an item
Aggregate the closed_index with pattern
Not closed if at least one out-pattern bit is set

(c,a) 111010
f does not belong to (c,a). Support of (c,a) is
same as support of (f,c,a). (c,a) is not closed
22
Efficient Implementation

Prune non-closed patterns
Non-closed patterns are guaranteed to be covered
Use limited bits to remember subset of items
Majority non-closed patterns are pruned by
closed_index
A few left are pruned by checking the coverage of
representative patterns

23
Experimental Setting

Data
frequent itemset mining dataset repository
(http//mi.cs.helsinki./data/)
Comparing algorithms
FPclose an efficient algorithm to generate all
closed itemsets, winner of FIMI workshop 2003
RPglobal first use FPclose to generate closed
itemsets, then use global greedy method to find
representative patterns
RPlocal directly use local method to find
representative patterns from raw data

24
Performance Study

Number of Representative Patterns

25
Performance Study

Running Time

26
Performance Study

Quality of Representative Patterns

27
Conclusions

Significant reduction of the number of output
Two orders of magnitudes of reduction for d 0.1
Catch both expressions and supports
Easily extendable for compression of sequential,
graph and structure data
RPglobal
theoretical bound
works well on small collection of patterns
RPlocal
much more efficient
Still quite good compression quality

28
Future Work

Using representative patterns for association,
correlation and classification
Compressing frequent patterns over incrementally
updated data (i.e., stream)
Further compressing the representative patterns
by some advanced compression models (i.e.,
pattern profiles)

Write a Comment

User Comments (0)