Frequent Pattern based Iterative Projected Clustering - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Frequent Pattern based Iterative Projected Clustering

Description:

A Monte Carlo algorithm (DOC) Our projected clustering method ... Extract all prefixes of a3. Cond.pat.base: {a0a1a2:2,a0a2:2} ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 35
Provided by: iCs8
Category:

less

Transcript and Presenter's Notes

Title: Frequent Pattern based Iterative Projected Clustering


1
Frequent Pattern basedIterative Projected
Clustering
  • Presented by Yiu Man Lung
  • 11 June, 2003

2
Outline
  • Projected Clustering
  • A Monte Carlo algorithm (DOC)
  • Our projected clustering method
  • ?Growth find the best subspace for a fixed p
  • FPC utilize ?Growth to get the best cluster
  • MineClus further refine the clusters
  • Experiments
  • Conclusion

3
Projected Clustering
  • Distance of any two points is almost the same in
    high dimensional spaces Beyer
  • Distance measures are more meaningful in
    subspaces
  • Irrelevant, noise attributes exist in real
    datasets
  • Objective of Projected Clustering
  • A set of clusters
  • The set of relevant attributes for each cluster

4
Projected Clustering
  • Two natural projected clusters
  • C1T1,T2, relevant a1,a2,a3, noise a4,a5
  • C2T3,T4, relevant a3,a4,a5, noise a1,a2
  • If all attributes are considered,
  • Manhattan distance (T2,T3) smallest (100)
  • Fail to discover clusters
  • of high quality,
  • existing in subspaces

5
A Monte Carlo algorithm (DOC)
  • Density Optimal Clustering
  • w controls the extent of the clusters
  • ? i?D (subspace), maxq?Cqi minq?Cqi ? w
  • Quality of a cluster C
  • ?(a,b) a (1/?)b where aC, bD
  • b ( or D ) dominates the ? value
  • ? ?(0,1 reflects the importance of subspace
  • Large ? favors large clusters with small
    subspaces
  • Small ? favors small clusters with large subspaces

6
A Monte Carlo algorithm (DOC)
7
A Monte Carlo algorithm (DOC)
  • Iterative (Greedy) Clustering approach
  • DOC is called for S C to discover the next
    cluster
  • Process continues until no clusters can be
    discovered
  • The final remaining records are outliers
  • Advantage
  • Able to discover clusters of various sizes
  • Able to discover subspaces of various sizes
  • Disadvantage
  • Number of inner loops too high (e.g. 220)
  • Same extent w is used for all dimensions, may not
    be able to represent natural cluster well

8
From Projected Clustering to mining frequent
itemsets
  • Select a random medoid p in S, form the binary
    table
  • Value is 1 if bounded by p with respect to w,
    otherwise 0
  • Set minimum support as ? S for mining
  • Objective Mine the subspace with highest ? value

w 2
(a) Original table
(b) Binary table
(c) Corresponding itemsets
9
FP-tree
  • Basis of our ? growth algorithm
  • Requires two data scan
  • Collects the frequencies of each item
  • Itemsets inserted in decreasing frequency order
  • Each node contains an ItemID and count
  • Paths with common prefixes are compressed
  • Header table links nodes with the same item
  • Mine frequent patterns by FP-Growth algorithm

10
Example of FP-Growth
11
Example of FP-Growth
  • Assume min_sup4
  • Extract all prefixes of a3
  • Cond.pat.base a0a1a22,a0a22
  • Build conditional pattern tree for a3
  • Frequent itemsets
  • a34,a0a34,a2a34,a0a2a34
  • Extract all prefixes of a2
  • Cond.pat.base a0a12,a02
  • Build conditional pattern tree for a2
  • Frequent itemsets a24,a0a24
  • Extract all prefixes of a1
  • Cond.pat.base a05
  • Build conditional pattern tree for a1
  • Frequent itemsets a15,a0a15
  • Examine the first item a0
  • Frequent itemset a010
  • 4 trees, 9 frequent itemsets

12
Optimization I
  • Only generate patterns from prefixes of the
    single path (linear vs exponential)
  • Reason ?(a,b) ? ?(a,b) ? b?b

(a) FP-tree with a single path
(b) The patterns
13
Optimization II
  • Only generate the pattern from the most frequent
    entry (1 vs entries in table header)
  • Reason ?(a,b) ? ?(a,b) ? a?a

(a) FP-tree table header
(b) The patterns
14
Some notations
15
The ? growth algorithm
  • Applies two optimizations discussed before
  • Growing conditional tree from the lth entry
  • Maximum support tablel.support
  • Maximum itemset size dim(Icond)l
  • Prune if ?(tablel.support,dim(Icond)l) ?
    ?(Ibest)
  • Search order
  • Affects performance but not results
  • Dimensionality (itemset size) dominates ? value
  • Mine from least frequent item to most frequent
    item
  • Longer patterns found earlier ? facilitates
    pruning

16
(No Transcript)
17
Example of ? growth
  • Assume min_sup4, ?0.1
  • Examine the first item a0
  • Frequent itemset a010
  • ?(Ibest)?100
  • l4, position of a3 in header table
  • table4.support4, dim(Icond)44
  • Pruning condition not satisfied
  • Build conditional pattern tree for a3
  • Frequent itemsets generated
  • a0a34,a0a2a34
  • ?(Ibest)?4000
  • l3, position of a2 in header table
  • table3.support4, dim(Icond)33
  • Pruning condition satisfied
  • l2, position of a1 in header table
  • table2.support5, dim(Icond)22
  • Pruning condition satisfied
  • 2 trees, 3 frequent itemsets

18
Efficiency of ? growth
  • As shown in last slide, ? growth is efficient
  • The pruning power is high when ? is low
  • It requires 1/(4d)? ? lt 1/2 for effective
    clustering shown in DOC paper
  • The lowest pruning power (when ? is near 1/2) is
    high enough

19
The FPC algorithm
Use previous Ibest for further pruning
20
Efficiency of the FPC algorithm
  • Utilizes the best ? value found so far
  • Prune FP-trees of the same or different p
    (medoid) that could not have better results
  • If a good p is found earlier, a lot of time can
    be saved in subsequent iterations

21
The MineClus algorithm
  • Iterative Phase
  • Produce the clusters iteratively
  • Pruning Phase
  • Discard clusters with significantly low ? values
  • Merging Phase
  • Merge clusters following the agglomerative
    paradigm
  • Refinement Phase
  • Assign remaining records to the clusters
  • Handle outliers

22
Iterative Phase
  • Apply FPC to discover a cluster
  • Find the centroid and the maximum distance
    max_dist in the cluster from the centroid
  • Assign a record in S
  • to the cluster if it is
  • at most max_dist
  • from the centroid
  • Remove assigned
  • records from S
  • Repeat until no
  • clusters found

23
Pruning Phase
  • Sort clusters in descending order of ? value
  • Find pos such that ?pos/?pos1? ?i/?i1?i
  • The set of clusters divided into good clusters
    (i?pos) and bad clusters (igtpos)
  • Discard the bad clusters if there are at least k
    good clusters

24
Merging Phase
  • Merge clusters until k clusters remain
  • A cluster may be a sub-cluster of a natural
    cluster
  • A good cluster has low spread and high ? value
  • Rank in increasing spread, Rank in decreasing ?
    value
  • The new (merged) cluster has highest sum of
    rankings

25
Refinement Phase
  • Assign remaining records to the clusters
  • Also handle outliers
  • Apply a similar method as in the refinement phase
    in PROCLUS

26
The target number of clusters k
  • Why k is optional ?
  • The iterative phase (the most important phase) is
    independent of k
  • k is only used in the pruning phase and merging
    phase
  • User has no idea of k
  • set k to a huge value
  • only skip the pruning and merging phases

27
Experiments
  • Comparisons with PROCLUS and DOC
  • Dependency on parameters
  • Efficiency and scalability
  • Also test with real datasets from UCI ML

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Conclusion
  • Identifies similarity between mining frequent
    itemsets and discovering the best projected
    cluster
  • Adapts the FP-growth algorithm for searching the
    itemset with highest ? value efficiently
  • Extends the cluster definition in DOC to consider
    more appropriate distance and quality measures
  • Evaluates the efficiency and effectiveness of
    MineClus by comparing with DOC and PROCLUS

32
References
  • C. C. Aggarwal, J. L. Wolf, P. S. Yu, C.
    Procopiuc, and J. S. Park. Fast algorithms for
    projected clustering. 1999 SIGMOD.
  • C. C. Aggarwal and P. S. Yu. Finding generalized
    projected clusters in high dimensional spaces.
    2000 SIGMOD.
  • R. Agrawal, J. Gehrke, D. Gunopulos, and P.
    Raghavan. Automatic subspace clustering of high
    dimensional data for data mining applications.
    1998 SIGMOD.
  • R. Agrawal and R. Srikant. Fast algorithms for
    mining association rules in large databases. 1994
    VLDB.

33
References
  • K. P. Bennett, U. Fayyad, and D. Geiger.
    Density-based indexing for approximate
    nearest-neighbor queries. 1999 SIGKDD.
  • K. S. Beyer, J. Goldstein, R. Ramakrishnan, and
    U. Shaft. When is nearest neighbor meaningful?
    1999 ICDT.
  • C. Blake and C. Merz. UCI repository of machine
    learning databases, 1998.
  • C. Faloutsos and K.-I. Lin. Fastmap A fast
    algorithm for indexing, data-mining and
    visualization of traditional and multimedia
    datasets. 1995 SIGMOD.
  • S. Guha, R. Rastogi, and K. Shim. Rock A robust
    clustering algorithm for categorical attributes.
    1999 ICDE.

34
References
  • J. Han and M. Kamber. Data Mining Concepts and
    Techniques. Morgan Kaufmann, 2001.
  • J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. 2000
    SIGMOD.
  • C. M. Procopiuc, M. Jones, P. K. Agarwal, and T.
    M. Murali. A monte carlo algorithm for fast
    projective clustering. 2002 SIGMOD.
  • T. Zhang, R. Ramakrishnan, and M. Livny. Birch
    An efficient data clustering method for very
    large databases. 1996 SIGMOD.
Write a Comment
User Comments (0)
About PowerShow.com