Title: DB Seminar Series: The Subspace Clustering Problem
1DB Seminar SeriesThe Subspace Clustering Problem
- By Kevin Yip
- (17 May 2002)
2Presentation Outline
- Problem definition
- Different approaches
- Focus the projective clustering approach
3Problem Definition Traditional Clustering
- Traditional clustering problemTo divide data
points into disjoint groups such that the value
of an objective function is optimized. - Objective function to minimize intra-cluster
distance and maximize inter-cluster distance. - Distance function define over all dimensions,
numeric or categorical.
4Problem Definition Traditional Clustering
- ExampleProblem clustering points in 2-D
space.Distance function Euclidean distance (d
no. of dimensions, 2 in this case).
5Problem Definition Traditional Clustering
- Example (source CURE, SIGMOD 1998)
6Problem Definition Distance Function Problem
- Observation distance measures defined over all
dimensions are sometimes inappropriate. - Example (source DOC, SIGMOD 2002)
- C1 (x1, x2)
- C2 (x2, x3)
- C3 (x1, x3)
7Problem Definition Distance Function Problem
- As the number of noise dimensions increases, the
distance functions become less and less accurate. - gt For each cluster, except the set of data
points, we also need to find out the set of
related dimensions (bounded attributes)
8Problem Definition The Subspace Clustering
Problem
- Formal DefinitionGiven a dataset of N data
points and d dimensions, we want to divide the
points into k disjoint clusters, each relating to
a subset of dimensions, such that an objective
function is optimized. - Objective function usually intra-cluster
distance, each cluster uses its own set of
dimensions in distance calculation.
9Problem Definition The Subspace Clustering
Problem
- Observation normal distance functions
(Manhattan, Euclidean, etc.) give a smaller value
if less dimensions are involved. - gt 1. Use a normalized distance function.gt 2.
Should also try to maximize the number of
dimensions. - Example (DOC) score(C, D) C(1/ß)D, C
points in a cluster, D relating attributes, ß
is a constant.
10Different Approaches Overview
- Grid-based dimension selection
- Association rule hypergraph partitioning
- Context-specific Bayesian clustering
- Projective clustering (Focus)
11Different Approaches Grid-Based Dimension
Selection
- CLIQUE (98), ENCLUS (99), MAFIA (99), etc.
- Basic idea
- A cluster is a region with high density.
- Divide the domain of each dimension into units.
- For each dimension, find all dense units units
with many points. - Merge neighboring dense units into clusters.
- After finding all 1-d clusters, find 2-d dense
units. - Repeat with higher dimensions.
12Different Approaches Grid-Based Dimension
Selection
- A 2-D dataset for illustration
13Different Approaches Grid-Based Dimension
Selection
- Divide the domain of each dimension into
sub-units.
14Different Approaches Grid-Based Dimension
Selection
- Find all dense units units with many points.
- (assume density threshold 3 points)
15Different Approaches Grid-Based Dimension
Selection
- Merge neighboring dense units into clusters.
16Different Approaches Grid-Based Dimension
Selection
- Find 2-d dense units. Merge neighboring dense
units, if any.
17Different Approaches Grid-Based Dimension
Selection
- Repeat with higher dimensions.
18Different Approaches Grid-Based Dimension
Selection
- Results 1-d ltd1(2,5gt, ltd1(6,8gt, ltd2(1,3gt,
ltd2(4,6gt. - 2-d ltd1,d2(4,5,(4,5gt, ltd1,d2(7,8,(4,5gt.
19Different Approaches Grid-Based Dimension
Selection
- Problems with the grid-based dimension selection
approach - Non-disjoint clusters.
- Exponential dependency on the number of
dimensions.
20Different Approaches - Association Rule
Hypergraph Partitioning
- 1997
- Cluster related items (attribute values) using
association rules andcluster related
transactions (data points) using clusters of
items.
21Different Approaches Association Rule
Hypergraph Partitioning
- Procedures
- Find all frequent itemsets in the dataset.
- Construct a hypergraph with each item as a
vertex, and each hyperedge corresponding to a
frequent itemset.(If A, B, C is a frequent
itemset, there is a hyperedge connecting the
vertices of A, B, and C.)
22Different Approaches Association Rule
Hypergraph Partitioning
- Procedures
- Each hyperedge is assigned a weight equal to a
function of the confidences of all the
association rules between the connecting
items.(If there are association
rulesAgtB,C (c. 0.8), A,BgtC (c.
0.4),A,CgtB (c. 0.6), BgtA,C (c.
0.4),B,CgtA (c. 0.8) and CgtA,B (c.
0.6),then the weight of the hyperedge ABC can
be the average of the confidences, i.e. 0.6)
23Different Approaches Association Rule
Hypergraph Partitioning
- Procedures
- Use a hypergraph partitioning algorithm (e.g.
HMETIS, 97) to divide the hypergraph into k
partitions, so that the sum of the weights that
straddle partitions is minimized. Each partition
forms a cluster with different subset of items. - Assign each transaction to a cluster, based on a
scoring function (e.g. percentage of matched
items).
24Different Approaches Association Rule
Hypergraph Partitioning
- Problems with the association rule hypergraph
partitioning approach - In real clusters, an item can be related to
multiple clusters. - May not be applicable to numeric attributes.
25Different Approaches Context-Specific Bayesian
Clustering
- Naïve-Bayesian classification given a training
set with classes Ci (i1..k), a data point with
attribute values x1, x2, , xd is classified
by P(CCi x1, x2, , xd) P(x1, x2, , xd
CCi) P(CCi) / P(x1, x2, , xd)a P(x1, x2, ,
xd CCi) P(CCi) P(x1CCi)P(x2CCi)P(xdCC
i)P(CCi)
26Different Approaches Context-Specific Bayesian
Clustering
- A RECOMB 2001 paper
- Context-specific independence (CSI) modeleach
attribute Xi depends only on classes in a set Li. - E.g. if k5 and L11, 4, thenP(X1CC2)
P(X1CC3) P(X1CC5) P(X1CCdef)
27Different Approaches Context-Specific Bayesian
Clustering
- A CSI model M containsk the number of
classes.G the set of attributes that depend on
some classes.Li the local structures of the
attributes. - Parameters for a CSI model, ?M P(CCi),
P(XiLiCj)
28Different Approaches Context-Specific Bayesian
Clustering
- Recall P(CCi x1, x2, , xd)a
P(x1CCi)P(x2CCi)P(xdCCi)P(CCi),in the
CSI model, it equalsP(X1LiCj)P(X2LiCj)P(XdL
iCj)P(CCi) - So, for a dataset (without class labels), if we
can guess a CSI model and its parameters, then we
can assign each data point to a class gt
clustering.
29Different Approaches Context-Specific Bayesian
Clustering
- Searching best model and parameters
- Define a score to rank the current model and
parameters (BIC(M, ?M) or CS(M, ?M)). - Randomly pick a model and a set of parameters and
calculate the score. - Try modifying the model (e.g. add an attribute to
a local structure), recalculate the score. - If the score is better, keep it and try modifying
a parameter.
30Different Approaches Context-Specific Bayesian
Clustering
- Repeat until a stopping criterion is reached
(e.g. using simulated annealing). - M1, ?M1 -gtM2, ?M1 -gtM2, ?M2 -gtM3, ?M2 -gt
31Different Approaches Context-Specific Bayesian
Clustering
- The scoring functions (just have a taste)
32Different Approaches Context-Specific Bayesian
Clustering
- Problems with the context-specific Bayesian
clustering approach - Cluster quality and execution time not
guaranteed. - Easily get into local minimum.
33Focus The Projective Clustering Approach
- PROCLUS (99), ORCLUS (00), etc.
- K-medoid partitional clustering.
- Basic idea use a set of sample points to
determine the relating dimensions for each
cluster, assign points to the clusters according
to the dimension sets, throw away some bad
medoids and repeat.
34Focus The Projective Clustering Approach
- Algorithm (3 phases)
- Initialization phase
- Input k target number of clusters.
- Input l average number of dimensions in a
cluster. - Draw Ak samples randomly from the dataset, where
A is a constant. - Use max-min algorithm to draw Bk points from the
sample, where B is a constant lt A. Call this set
of points M.
35Focus The Projective Clustering Approach
- Iterative Phase
- Draw k medoids from M.
- For each medoid mi, calculate the Manhattan
distance di (involving all dimensions) to the
nearest medoid. - Find all points in the whole dataset that are
within a distance di from mi.
36Focus The Projective Clustering Approach
- Finding the set of surrounding points for a
medoid
B
A
C
37Focus The Projective Clustering Approach
- The average distance between the points and the
medoid along each dimension will be calculated. - Among all kd dimensions, select kl of them with
exceptionally small average distances. An extra
restriction is that each medoid must pick at
least 2 dimensions. - Whether the distance from medoid of a particular
dimension is exceptionally small in a cluster
is determined by its standard score
38Focus The Projective Clustering Approach
B
A
C
39Focus The Projective Clustering Approach
- Example
- In cluster C1, the average distances from medoid
along dimension D110, along D215 and along
D313. In Cluster C2, the average distances are
7, 6 and 12. - Mean(C1) (10 15 13) / 3 12.67.
- S.D.(C1) 2.52.
- Z(C1D1) (10-12.67)/2.52 -1.06.
- Similarly, Z(C1D2) 0.93, Z(C1D3) 0.13,
Z(C2D1) -0.41, Z(C2D2) -0.73, Z(C2D3) 1.14. - So the order to pick the dimensions will be C1D1
-gt C2D2 -gt C2D1 -gt C1D3 -gt C1D2 -gt C2D3.
40Focus The Projective Clustering Approach
- Iterative Phase (contd)
- Now, each medoid has a related set of dimensions.
Assign all points in the whole dataset to the
medoid closest to it (using a normalized distance
function involving only the selected dimension). - Calculate the overall score of the clustering.
Record the cluster definitions (relating
attributes and assigned points) if the score is
the new best one. - Throw away medoids with too few points. Replace
them with some points remained in M.
41Focus The Projective Clustering Approach
- Refinement Phase
- After determining the best set of medoids, use
the assigned points to re-determine the sets of
dimensions, and reassign all points. - If the distance between a point and its medoid is
longer than the distance between the medoid and
its closest medoid, the point is marked as an
outlier.
42Focus The Projective Clustering Approach
- Experiment
- Dataset synthetic, 100, 000 points, 20
dimensions. - Set 1 5 clusters, each with 7 dimensions.
- Set 2 5 clusters, with 2-7 dimensions.
- Machine 233-MHz IBM RS/6000, 128M RAM, running
AIX. Dataset stored in a 2GB SCSI drive. - Comparison CLIQUE (grid-based)
43Focus The Projective Clustering Approach
44Focus The Projective Clustering Approach
45Focus The Projective Clustering Approach
46Focus The Projective Clustering Approach
47Focus The Projective Clustering Approach
48Focus The Projective Clustering Approach
- Scalability(withdatasetsize)
49Focus The Projective Clustering Approach
- Scalability(withaveragedimension-ality)
50Focus The Projective Clustering Approach
- Scalability(withspacedimension-ality)
51Focus The Projective Clustering Approach
- Problems with the projective clustering approach
- Need to know l, the average number of dimensions.
- A cluster with very small number of selected
dimensions will absorb the points of other
clusters. - Using a distance measure over the whole dimension
space to select the sets of dimensions may not be
accurate, especially when the number of noise
attributes is large.
52Summary
- The subspace clustering problem given a dataset
of N data points and d dimensions, we want to
divide the points into k disjoint clusters, each
relating to a subset of dimensions, such that an
objective function is optimized. - Grid-based dimension selection
- Association rule hypergraph partitioning
- Context-specific Bayesian clustering
- Projective clustering
53References
- Grid-based dimension selection
- Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications
(SIGMOD 1998) - Entropy-based Subspace Clustering for Mining
Numerical Data (SIGKDD 1999) - MAFIA Efficient and Scalable Subspace
Clustering for Very Large Data Sets (Technical
Report 9906-010, Northwestern University 1999) - Association rule hypergraph partitioning
- Clustering Based On Association Rule
Hypergraphs (Clustering Workshop 1997)
54References
- Multilevel Hypergraph Partitioning Application
in VLSI Domain (DAC 1997) - Context-specific Bayesian clustering
- Context-Specific Bayesian Clustering for Gene
Expression Data (RECOMB 2001) - Projective clustering
- Fast Algorithms for Projected Clustering
(SIGMOD 1999) - Finding Generalized Projected Clusters in High
Dimensional Spaces (SIGMOD 2000) - A Monte Carlo Algorithm for Fast Projective
Clustering (SIGMOD 2002)