Mining Compressed FrequentPattern Sets - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Mining Compressed FrequentPattern Sets

Description:

FP-growth method. Depth first search in Pattern-Space ... RPlocal: directly use local method to find representative patterns from raw data ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 29
Provided by: uiuc1
Category:

less

Transcript and Presenter's Notes

Title: Mining Compressed FrequentPattern Sets


1
Mining Compressed Frequent-Pattern Sets
  • Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

2
Outline
  • Introduction
  • Problem Statement and Analysis
  • Discovering Representative Patterns
  • Performance Study
  • Discussion and Conclusions

3
Introduction
  • Frequent Pattern Mining
  • Minimum Support 2

4
Challenge In Frequent Pattern Mining
  • Efficiency?
  • Many scaleable mining algorithms are available
    now
  • Usability?Yes
  • High minimum support common sense patterns
  • Low minimum support explosive number of results

5
Existing Compressing Techniques
  • Lossless compression
  • Closed frequent patterns
  • Non-derivable frequent item-sets
  • ...
  • Lossy approximation
  • Maximal frequent patterns
  • Boundary cover sets

6
A Motivating Example
  • A subset of frequent item-sets in accident
    dataset
  • High-quality compression needs to consider both
    expression and support

Expression of P1
Support of P1
7
A Motivating Example
  • Closed frequent pattern
  • Report P1,P2,P3,P4,P5
  • Emphasize too much on support
  • no compression
  • Maximal frequent pattern
  • Only report P3
  • Only care about the expression
  • Loss the information of support
  • A desirable output P2,P3,P4

8
Compressing Frequent Patterns
  • Our compressing framework
  • Clustering frequent patterns by pattern
    similarity
  • Pick a representative pattern for each cluster
  • Key Problems
  • Need a distance function to measure the
    similarity between patterns
  • The quality of the clustering needs to be
    controllable
  • The representative pattern should be able to
    describe both expressions and supports of other
    patterns
  • Efficiency is always desirable

9
Distance Measure
  • Let P1 and P2 are two closed frequent patterns,
    T(P) is the set of raw data which contains P, the
    distance between P1 and P2 is
  • Let T(P1)t1,t2,t3,t4,t5, T(P2)t1,t2,t3,t4,t6
    , then D(P1,P2)1-4/61/3
  • D is a valid distance metric
  • D characterizes the support, but ignore the
    expression

10
Representative Patterns
  • Incorporate expression into Representative
    Pattern
  • The representative pattern should be able to
    express all the other patterns in the same
    cluster (i.e., superset)
  • The representative pattern Pr 38,16,18,12,17
  • Representative pattern is also good w.r.t.
    distance
  • D(Pr, P1) D(P1, P2), D(Pr, P1) D(P1, P2)
  • Distance can be computed using support only

11
Clustering Criterion
  • General clustering approach (i.e., k-means)
  • Directly apply the distance measure
  • No guarantee on the quality of the clusters
  • The representative pattern may not exist in a
    cluster
  • d-clustering
  • For each pattern P, Find all patterns which can
    be expressed by P and their distance to P are
    within d (d-cover)
  • All patterns in the cluster can be represented by
    P

12
Intuitions of d-clustering
  • All Patterns in the cluster are supported by
    almost same set of transactions
  • Distance from any pattern to representative is
    bounded by d
  • Distance between any two patterns is bounded by 2
    d
  • The small difference between transaction sets
    could be noise or negligible
  • Representative Pattern has the most informative
    expression

13
Pattern Compressing Problem
  • Pattern Compression Problem
  • Find the minimum number of clusters
    (representative patterns)
  • All the frequent patterns are d-covered by at
    least one representative pattern
  • Variation support of representative pattern less
    than min_sup?
  • NP-hardness Reducible from set-covering problem

14
Discovering Representative Patterns
  • RPglobal
  • Assume all the frequent patterns are mined
  • Directly apply greedy set-covering algorithm
  • Guaranteed bounds w.r.t. optimal solution
  • RPlocal
  • Relax the constraints used in RPglobal
  • Gain in efficiency, lose in bound guarantee
  • Directly mine from raw data set
  • RPcombine
  • Combine above two methods
  • Trade-off w.r.t. efficiency and performance

15
RPglobal
  • Algorithm
  • At each step, find the representative pattern Pr
    which d-covers the maximum number of uncovered
    patterns
  • Select Pr as new representative pattern
  • Mark the corresponding pattern as covered
  • Continue until all patterns are covered
  • Bound
  • Cg (C) is the number of output of RPglobal
    (optimal)
  • F is the set of frequent patterns
  • Set(P) set of the patterns covered by P

16
RPlocal
  • RPglobal is expensive
  • Assume all the frequent pattern are pre-computed
  • Need to find the globally best representative
    pattern at each step
  • Need to compute the pair-wise distance between
    all frequent patterns
  • Relax the constraints RPlocal
  • Find a locally good representative pattern each
    step
  • Directly mine from raw data
  • Do not compute the distance pair-wisely

17
Local Greedy Method
  • Principle of Local Method
  • Bound
  • Cl number of output using local method
  • T optimal number of patterns covering all probe
    patterns
  • Set(P) set of the patterns covered by P

18
Mine from Raw Data
  • Beneficial
  • Without storage of huge intermediate outputs
  • More efficient pruning methods
  • Applicable
  • Utilize the internal relations during mining
  • FP-growth method
  • Depth first search in Pattern-Space
  • A pattern can only be covered by its sons or
    patterns visited before

Probe Pattern P
Ps Sons
Visited Patterns covering P
19
Integrate Local Method into FP-Mining
  • Algorithm
  • Follow the depth-first search in pattern space
  • Remember all previously discovered representative
    patterns
  • For each pattern P
  • Not covered yet
  • Being Visited in the second time which traversal
    back from its sons
  • Select a representative pattern using local
    method (with P as new probe pattern)

20
Avoid Pair-wise Comparisons
  • Find a good representative pattern (for probe
    pattern P)
  • Strong correlations between Pattern positions,
    coverage of uncovered patterns and pattern length
  • Simple but effective heuristic select the
    longest item-sets in Ps sons as a new
    representative pattern to cover P
  • 4952 first visit of P, 5043 second visit of P
    (between 4952 and 5043 are sons of P)

second time visit of P
First time visit of P
Previous Patterns
Ps Sons
21
Efficient Implementation
  • Non Closed Pattern
  • Exist a super pattern with same support
  • Closed_Index (N bits)
  • Each bit remembers the consistency of an item
  • Aggregate the closed_index with pattern
  • Not closed if at least one out-pattern bit is set

(c,a) 111010
f does not belong to (c,a). Support of (c,a) is
same as support of (f,c,a). (c,a) is not closed
22
Efficient Implementation
  • Prune non-closed patterns
  • Non-closed patterns are guaranteed to be covered
  • Use limited bits to remember subset of items
  • Majority non-closed patterns are pruned by
    closed_index
  • A few left are pruned by checking the coverage of
    representative patterns

23
Experimental Setting
  • Data
  • frequent itemset mining dataset repository
    (http//mi.cs.helsinki./data/)
  • Comparing algorithms
  • FPclose an efficient algorithm to generate all
    closed itemsets, winner of FIMI workshop 2003
  • RPglobal first use FPclose to generate closed
    itemsets, then use global greedy method to find
    representative patterns
  • RPlocal directly use local method to find
    representative patterns from raw data

24
Performance Study
  • Number of Representative Patterns

25
Performance Study
  • Running Time

26
Performance Study
  • Quality of Representative Patterns

27
Conclusions
  • Significant reduction of the number of output
  • Two orders of magnitudes of reduction for d 0.1
  • Catch both expressions and supports
  • Easily extendable for compression of sequential,
    graph and structure data
  • RPglobal
  • theoretical bound
  • works well on small collection of patterns
  • RPlocal
  • much more efficient
  • Still quite good compression quality

28
Future Work
  • Using representative patterns for association,
    correlation and classification
  • Compressing frequent patterns over incrementally
    updated data (i.e., stream)
  • Further compressing the representative patterns
    by some advanced compression models (i.e.,
    pattern profiles)
Write a Comment
User Comments (0)
About PowerShow.com