Frequent Pattern based Iterative Projected Clustering - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Frequent Pattern based Iterative Projected Clustering

Description:

A Monte Carlo algorithm (DOC) Our projected clustering method ... Extract all prefixes of a3. Cond.pat.base: {a0a1a2:2,a0a2:2} ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 35

Provided by: iCs8

Category:

more less

Transcript and Presenter's Notes

Title: Frequent Pattern based Iterative Projected Clustering

1
Frequent Pattern basedIterative Projected
Clustering

Presented by Yiu Man Lung
11 June, 2003

2
Outline

Projected Clustering
A Monte Carlo algorithm (DOC)
Our projected clustering method
?Growth find the best subspace for a fixed p
FPC utilize ?Growth to get the best cluster
MineClus further refine the clusters
Experiments
Conclusion

3
Projected Clustering

Distance of any two points is almost the same in
high dimensional spaces Beyer
Distance measures are more meaningful in
subspaces
Irrelevant, noise attributes exist in real
datasets
Objective of Projected Clustering
A set of clusters
The set of relevant attributes for each cluster

4
Projected Clustering

Two natural projected clusters
C1T1,T2, relevant a1,a2,a3, noise a4,a5
C2T3,T4, relevant a3,a4,a5, noise a1,a2
If all attributes are considered,
Manhattan distance (T2,T3) smallest (100)
Fail to discover clusters
of high quality,
existing in subspaces

5
A Monte Carlo algorithm (DOC)

Density Optimal Clustering
w controls the extent of the clusters
? i?D (subspace), maxq?Cqi minq?Cqi ? w
Quality of a cluster C
?(a,b) a (1/?)b where aC, bD
b ( or D ) dominates the ? value
? ?(0,1 reflects the importance of subspace
Large ? favors large clusters with small
subspaces
Small ? favors small clusters with large subspaces

6
A Monte Carlo algorithm (DOC)
7
A Monte Carlo algorithm (DOC)

Iterative (Greedy) Clustering approach
DOC is called for S C to discover the next
cluster
Process continues until no clusters can be
discovered
The final remaining records are outliers
Advantage
Able to discover clusters of various sizes
Able to discover subspaces of various sizes
Disadvantage
Number of inner loops too high (e.g. 220)
Same extent w is used for all dimensions, may not
be able to represent natural cluster well

8
From Projected Clustering to mining frequent
itemsets

Select a random medoid p in S, form the binary
table
Value is 1 if bounded by p with respect to w,
otherwise 0
Set minimum support as ? S for mining
Objective Mine the subspace with highest ? value

w 2
(a) Original table
(b) Binary table
(c) Corresponding itemsets
9
FP-tree

Basis of our ? growth algorithm
Requires two data scan
Collects the frequencies of each item
Itemsets inserted in decreasing frequency order
Each node contains an ItemID and count
Paths with common prefixes are compressed
Header table links nodes with the same item
Mine frequent patterns by FP-Growth algorithm

10
Example of FP-Growth
11
Example of FP-Growth

Assume min_sup4
Extract all prefixes of a3
Cond.pat.base a0a1a22,a0a22
Build conditional pattern tree for a3
Frequent itemsets
a34,a0a34,a2a34,a0a2a34
Extract all prefixes of a2
Cond.pat.base a0a12,a02
Build conditional pattern tree for a2
Frequent itemsets a24,a0a24

Extract all prefixes of a1
Cond.pat.base a05
Build conditional pattern tree for a1
Frequent itemsets a15,a0a15
Examine the first item a0
Frequent itemset a010
4 trees, 9 frequent itemsets

12
Optimization I

Only generate patterns from prefixes of the
single path (linear vs exponential)
Reason ?(a,b) ? ?(a,b) ? b?b

(a) FP-tree with a single path
(b) The patterns
13
Optimization II

Only generate the pattern from the most frequent
entry (1 vs entries in table header)
Reason ?(a,b) ? ?(a,b) ? a?a

(a) FP-tree table header
(b) The patterns
14
Some notations
15
The ? growth algorithm

Applies two optimizations discussed before
Growing conditional tree from the lth entry
Maximum support tablel.support
Maximum itemset size dim(Icond)l
Prune if ?(tablel.support,dim(Icond)l) ?
?(Ibest)
Search order
Affects performance but not results
Dimensionality (itemset size) dominates ? value
Mine from least frequent item to most frequent
item
Longer patterns found earlier ? facilitates
pruning

16
(No Transcript)
17
Example of ? growth

Assume min_sup4, ?0.1
Examine the first item a0
Frequent itemset a010
?(Ibest)?100
l4, position of a3 in header table
table4.support4, dim(Icond)44
Pruning condition not satisfied
Build conditional pattern tree for a3
Frequent itemsets generated
a0a34,a0a2a34
?(Ibest)?4000

l3, position of a2 in header table
table3.support4, dim(Icond)33
Pruning condition satisfied
l2, position of a1 in header table
table2.support5, dim(Icond)22
Pruning condition satisfied
2 trees, 3 frequent itemsets

18
Efficiency of ? growth

As shown in last slide, ? growth is efficient
The pruning power is high when ? is low
It requires 1/(4d)? ? lt 1/2 for effective
clustering shown in DOC paper
The lowest pruning power (when ? is near 1/2) is
high enough

19
The FPC algorithm
Use previous Ibest for further pruning
20
Efficiency of the FPC algorithm

Utilizes the best ? value found so far
Prune FP-trees of the same or different p
(medoid) that could not have better results
If a good p is found earlier, a lot of time can
be saved in subsequent iterations

21
The MineClus algorithm

Iterative Phase
Produce the clusters iteratively
Pruning Phase
Discard clusters with significantly low ? values
Merging Phase
Merge clusters following the agglomerative
paradigm
Refinement Phase
Assign remaining records to the clusters
Handle outliers

22
Iterative Phase

Apply FPC to discover a cluster
Find the centroid and the maximum distance
max_dist in the cluster from the centroid
Assign a record in S
to the cluster if it is
at most max_dist
from the centroid
Remove assigned
records from S
Repeat until no
clusters found

23
Pruning Phase

Sort clusters in descending order of ? value
Find pos such that ?pos/?pos1? ?i/?i1?i
The set of clusters divided into good clusters
(i?pos) and bad clusters (igtpos)
Discard the bad clusters if there are at least k
good clusters

24
Merging Phase

Merge clusters until k clusters remain
A cluster may be a sub-cluster of a natural
cluster
A good cluster has low spread and high ? value
Rank in increasing spread, Rank in decreasing ?
value
The new (merged) cluster has highest sum of
rankings

25
Refinement Phase

Assign remaining records to the clusters
Also handle outliers
Apply a similar method as in the refinement phase
in PROCLUS

26
The target number of clusters k

Why k is optional ?
The iterative phase (the most important phase) is
independent of k
k is only used in the pruning phase and merging
phase
User has no idea of k
set k to a huge value
only skip the pruning and merging phases

27
Experiments

Comparisons with PROCLUS and DOC
Dependency on parameters
Efficiency and scalability
Also test with real datasets from UCI ML

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Conclusion

Identifies similarity between mining frequent
itemsets and discovering the best projected
cluster
Adapts the FP-growth algorithm for searching the
itemset with highest ? value efficiently
Extends the cluster definition in DOC to consider
more appropriate distance and quality measures
Evaluates the efficiency and effectiveness of
MineClus by comparing with DOC and PROCLUS

32
References

C. C. Aggarwal, J. L. Wolf, P. S. Yu, C.
Procopiuc, and J. S. Park. Fast algorithms for
projected clustering. 1999 SIGMOD.
C. C. Aggarwal and P. S. Yu. Finding generalized
projected clusters in high dimensional spaces.
2000 SIGMOD.
R. Agrawal, J. Gehrke, D. Gunopulos, and P.
Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications.
1998 SIGMOD.
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules in large databases. 1994
VLDB.

33
References

K. P. Bennett, U. Fayyad, and D. Geiger.
Density-based indexing for approximate
nearest-neighbor queries. 1999 SIGKDD.
K. S. Beyer, J. Goldstein, R. Ramakrishnan, and
U. Shaft. When is nearest neighbor meaningful?
1999 ICDT.
C. Blake and C. Merz. UCI repository of machine
learning databases, 1998.
C. Faloutsos and K.-I. Lin. Fastmap A fast
algorithm for indexing, data-mining and
visualization of traditional and multimedia
datasets. 1995 SIGMOD.
S. Guha, R. Rastogi, and K. Shim. Rock A robust
clustering algorithm for categorical attributes.
1999 ICDE.

34
References

J. Han and M. Kamber. Data Mining Concepts and
Techniques. Morgan Kaufmann, 2001.
J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. 2000
SIGMOD.
C. M. Procopiuc, M. Jones, P. K. Agarwal, and T.
M. Murali. A monte carlo algorithm for fast
projective clustering. 2002 SIGMOD.
T. Zhang, R. Ramakrishnan, and M. Livny. Birch
An efficient data clustering method for very
large databases. 1996 SIGMOD.