Title: Mining Compressed FrequentPattern Sets
1Mining Compressed Frequent-Pattern Sets
- Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng
- Department of Computer Science
- University of Illinois at Urbana-Champaign
2Outline
- Introduction
- Problem Statement and Analysis
- Discovering Representative Patterns
- Performance Study
- Discussion and Conclusions
3Introduction
- Frequent Pattern Mining
- Minimum Support 2
4Challenge In Frequent Pattern Mining
- Efficiency?
- Many scaleable mining algorithms are available
now - Usability?Yes
- High minimum support common sense patterns
- Low minimum support explosive number of results
5Existing Compressing Techniques
- Lossless compression
- Closed frequent patterns
- Non-derivable frequent item-sets
- ...
- Lossy approximation
- Maximal frequent patterns
- Boundary cover sets
6A Motivating Example
- A subset of frequent item-sets in accident
dataset - High-quality compression needs to consider both
expression and support
Expression of P1
Support of P1
7A Motivating Example
- Closed frequent pattern
- Report P1,P2,P3,P4,P5
- Emphasize too much on support
- no compression
- Maximal frequent pattern
- Only report P3
- Only care about the expression
- Loss the information of support
- A desirable output P2,P3,P4
8Compressing Frequent Patterns
- Our compressing framework
- Clustering frequent patterns by pattern
similarity - Pick a representative pattern for each cluster
- Key Problems
- Need a distance function to measure the
similarity between patterns - The quality of the clustering needs to be
controllable - The representative pattern should be able to
describe both expressions and supports of other
patterns - Efficiency is always desirable
9Distance Measure
- Let P1 and P2 are two closed frequent patterns,
T(P) is the set of raw data which contains P, the
distance between P1 and P2 is - Let T(P1)t1,t2,t3,t4,t5, T(P2)t1,t2,t3,t4,t6
, then D(P1,P2)1-4/61/3 - D is a valid distance metric
- D characterizes the support, but ignore the
expression
10Representative Patterns
- Incorporate expression into Representative
Pattern - The representative pattern should be able to
express all the other patterns in the same
cluster (i.e., superset) - The representative pattern Pr 38,16,18,12,17
- Representative pattern is also good w.r.t.
distance - D(Pr, P1) D(P1, P2), D(Pr, P1) D(P1, P2)
- Distance can be computed using support only
11Clustering Criterion
- General clustering approach (i.e., k-means)
- Directly apply the distance measure
- No guarantee on the quality of the clusters
- The representative pattern may not exist in a
cluster - d-clustering
- For each pattern P, Find all patterns which can
be expressed by P and their distance to P are
within d (d-cover) - All patterns in the cluster can be represented by
P
12Intuitions of d-clustering
- All Patterns in the cluster are supported by
almost same set of transactions - Distance from any pattern to representative is
bounded by d - Distance between any two patterns is bounded by 2
d - The small difference between transaction sets
could be noise or negligible - Representative Pattern has the most informative
expression
13Pattern Compressing Problem
- Pattern Compression Problem
- Find the minimum number of clusters
(representative patterns) - All the frequent patterns are d-covered by at
least one representative pattern - Variation support of representative pattern less
than min_sup? - NP-hardness Reducible from set-covering problem
14Discovering Representative Patterns
- RPglobal
- Assume all the frequent patterns are mined
- Directly apply greedy set-covering algorithm
- Guaranteed bounds w.r.t. optimal solution
- RPlocal
- Relax the constraints used in RPglobal
- Gain in efficiency, lose in bound guarantee
- Directly mine from raw data set
- RPcombine
- Combine above two methods
- Trade-off w.r.t. efficiency and performance
15RPglobal
- Algorithm
- At each step, find the representative pattern Pr
which d-covers the maximum number of uncovered
patterns - Select Pr as new representative pattern
- Mark the corresponding pattern as covered
- Continue until all patterns are covered
- Bound
- Cg (C) is the number of output of RPglobal
(optimal) -
- F is the set of frequent patterns
- Set(P) set of the patterns covered by P
16RPlocal
- RPglobal is expensive
- Assume all the frequent pattern are pre-computed
- Need to find the globally best representative
pattern at each step - Need to compute the pair-wise distance between
all frequent patterns - Relax the constraints RPlocal
- Find a locally good representative pattern each
step - Directly mine from raw data
- Do not compute the distance pair-wisely
17Local Greedy Method
- Principle of Local Method
- Bound
-
- Cl number of output using local method
- T optimal number of patterns covering all probe
patterns - Set(P) set of the patterns covered by P
18Mine from Raw Data
- Beneficial
- Without storage of huge intermediate outputs
- More efficient pruning methods
- Applicable
- Utilize the internal relations during mining
- FP-growth method
- Depth first search in Pattern-Space
- A pattern can only be covered by its sons or
patterns visited before
Probe Pattern P
Ps Sons
Visited Patterns covering P
19Integrate Local Method into FP-Mining
- Algorithm
- Follow the depth-first search in pattern space
- Remember all previously discovered representative
patterns - For each pattern P
- Not covered yet
- Being Visited in the second time which traversal
back from its sons - Select a representative pattern using local
method (with P as new probe pattern)
20Avoid Pair-wise Comparisons
- Find a good representative pattern (for probe
pattern P) - Strong correlations between Pattern positions,
coverage of uncovered patterns and pattern length - Simple but effective heuristic select the
longest item-sets in Ps sons as a new
representative pattern to cover P - 4952 first visit of P, 5043 second visit of P
(between 4952 and 5043 are sons of P)
second time visit of P
First time visit of P
Previous Patterns
Ps Sons
21Efficient Implementation
- Non Closed Pattern
- Exist a super pattern with same support
- Closed_Index (N bits)
- Each bit remembers the consistency of an item
- Aggregate the closed_index with pattern
- Not closed if at least one out-pattern bit is set
(c,a) 111010
f does not belong to (c,a). Support of (c,a) is
same as support of (f,c,a). (c,a) is not closed
22Efficient Implementation
- Prune non-closed patterns
- Non-closed patterns are guaranteed to be covered
- Use limited bits to remember subset of items
- Majority non-closed patterns are pruned by
closed_index - A few left are pruned by checking the coverage of
representative patterns
23Experimental Setting
- Data
- frequent itemset mining dataset repository
(http//mi.cs.helsinki./data/) - Comparing algorithms
- FPclose an efficient algorithm to generate all
closed itemsets, winner of FIMI workshop 2003 - RPglobal first use FPclose to generate closed
itemsets, then use global greedy method to find
representative patterns - RPlocal directly use local method to find
representative patterns from raw data
24Performance Study
- Number of Representative Patterns
25Performance Study
26Performance Study
- Quality of Representative Patterns
27Conclusions
- Significant reduction of the number of output
- Two orders of magnitudes of reduction for d 0.1
- Catch both expressions and supports
- Easily extendable for compression of sequential,
graph and structure data - RPglobal
- theoretical bound
- works well on small collection of patterns
- RPlocal
- much more efficient
- Still quite good compression quality
28Future Work
- Using representative patterns for association,
correlation and classification - Compressing frequent patterns over incrementally
updated data (i.e., stream) - Further compressing the representative patterns
by some advanced compression models (i.e.,
pattern profiles)