DB Seminar Series: The Subspace Clustering Problem - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

DB Seminar Series: The Subspace Clustering Problem

Description:

2. Should also try to maximize the number of dimensions. ... Try modifying the model (e.g. add an attribute to a local structure), recalculate the score. If the ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 55

Provided by: kevi60

Category:

more less

Transcript and Presenter's Notes

Title: DB Seminar Series: The Subspace Clustering Problem

1
DB Seminar SeriesThe Subspace Clustering Problem

By Kevin Yip
(17 May 2002)

2
Presentation Outline

Problem definition
Different approaches
Focus the projective clustering approach

3
Problem Definition Traditional Clustering

Traditional clustering problemTo divide data
points into disjoint groups such that the value
of an objective function is optimized.
Objective function to minimize intra-cluster
distance and maximize inter-cluster distance.
Distance function define over all dimensions,
numeric or categorical.

4
Problem Definition Traditional Clustering

ExampleProblem clustering points in 2-D
space.Distance function Euclidean distance (d
no. of dimensions, 2 in this case).

5
Problem Definition Traditional Clustering

Example (source CURE, SIGMOD 1998)

6
Problem Definition Distance Function Problem

Observation distance measures defined over all
dimensions are sometimes inappropriate.
Example (source DOC, SIGMOD 2002)
C1 (x1, x2)
C2 (x2, x3)
C3 (x1, x3)

7
Problem Definition Distance Function Problem

As the number of noise dimensions increases, the
distance functions become less and less accurate.
gt For each cluster, except the set of data
points, we also need to find out the set of
related dimensions (bounded attributes)

8
Problem Definition The Subspace Clustering
Problem

Formal DefinitionGiven a dataset of N data
points and d dimensions, we want to divide the
points into k disjoint clusters, each relating to
a subset of dimensions, such that an objective
function is optimized.
Objective function usually intra-cluster
distance, each cluster uses its own set of
dimensions in distance calculation.

9
Problem Definition The Subspace Clustering
Problem

Observation normal distance functions
(Manhattan, Euclidean, etc.) give a smaller value
if less dimensions are involved.
gt 1. Use a normalized distance function.gt 2.
Should also try to maximize the number of
dimensions.
Example (DOC) score(C, D) C(1/ß)D, C
points in a cluster, D relating attributes, ß
is a constant.

10
Different Approaches Overview

Grid-based dimension selection
Association rule hypergraph partitioning
Context-specific Bayesian clustering
Projective clustering (Focus)

11
Different Approaches Grid-Based Dimension
Selection

CLIQUE (98), ENCLUS (99), MAFIA (99), etc.
Basic idea
A cluster is a region with high density.
Divide the domain of each dimension into units.
For each dimension, find all dense units units
with many points.
Merge neighboring dense units into clusters.
After finding all 1-d clusters, find 2-d dense
units.
Repeat with higher dimensions.

12
Different Approaches Grid-Based Dimension
Selection

A 2-D dataset for illustration

13
Different Approaches Grid-Based Dimension
Selection

Divide the domain of each dimension into
sub-units.

14
Different Approaches Grid-Based Dimension
Selection

Find all dense units units with many points.
(assume density threshold 3 points)

15
Different Approaches Grid-Based Dimension
Selection

Merge neighboring dense units into clusters.

16
Different Approaches Grid-Based Dimension
Selection

Find 2-d dense units. Merge neighboring dense
units, if any.

17
Different Approaches Grid-Based Dimension
Selection

Repeat with higher dimensions.

18
Different Approaches Grid-Based Dimension
Selection

Results 1-d ltd1(2,5gt, ltd1(6,8gt, ltd2(1,3gt,
ltd2(4,6gt.
2-d ltd1,d2(4,5,(4,5gt, ltd1,d2(7,8,(4,5gt.

19
Different Approaches Grid-Based Dimension
Selection

Problems with the grid-based dimension selection
approach
Non-disjoint clusters.
Exponential dependency on the number of
dimensions.

20
Different Approaches - Association Rule
Hypergraph Partitioning

1997
Cluster related items (attribute values) using
association rules andcluster related
transactions (data points) using clusters of
items.

21
Different Approaches Association Rule
Hypergraph Partitioning

Procedures
Find all frequent itemsets in the dataset.
Construct a hypergraph with each item as a
vertex, and each hyperedge corresponding to a
frequent itemset.(If A, B, C is a frequent
itemset, there is a hyperedge connecting the
vertices of A, B, and C.)

22
Different Approaches Association Rule
Hypergraph Partitioning

Procedures
Each hyperedge is assigned a weight equal to a
function of the confidences of all the
association rules between the connecting
items.(If there are association
rulesAgtB,C (c. 0.8), A,BgtC (c.
0.4),A,CgtB (c. 0.6), BgtA,C (c.
0.4),B,CgtA (c. 0.8) and CgtA,B (c.
0.6),then the weight of the hyperedge ABC can
be the average of the confidences, i.e. 0.6)

23
Different Approaches Association Rule
Hypergraph Partitioning

Procedures
Use a hypergraph partitioning algorithm (e.g.
HMETIS, 97) to divide the hypergraph into k
partitions, so that the sum of the weights that
straddle partitions is minimized. Each partition
forms a cluster with different subset of items.
Assign each transaction to a cluster, based on a
scoring function (e.g. percentage of matched
items).

24
Different Approaches Association Rule
Hypergraph Partitioning

Problems with the association rule hypergraph
partitioning approach
In real clusters, an item can be related to
multiple clusters.
May not be applicable to numeric attributes.

25
Different Approaches Context-Specific Bayesian
Clustering

Naïve-Bayesian classification given a training
set with classes Ci (i1..k), a data point with
attribute values x1, x2, , xd is classified
by P(CCi x1, x2, , xd) P(x1, x2, , xd
CCi) P(CCi) / P(x1, x2, , xd)a P(x1, x2, ,
xd CCi) P(CCi) P(x1CCi)P(x2CCi)P(xdCC
i)P(CCi)

26
Different Approaches Context-Specific Bayesian
Clustering

A RECOMB 2001 paper
Context-specific independence (CSI) modeleach
attribute Xi depends only on classes in a set Li.
E.g. if k5 and L11, 4, thenP(X1CC2)
P(X1CC3) P(X1CC5) P(X1CCdef)

27
Different Approaches Context-Specific Bayesian
Clustering

A CSI model M containsk the number of
classes.G the set of attributes that depend on
some classes.Li the local structures of the
attributes.
Parameters for a CSI model, ?M P(CCi),
P(XiLiCj)

28
Different Approaches Context-Specific Bayesian
Clustering

Recall P(CCi x1, x2, , xd)a
P(x1CCi)P(x2CCi)P(xdCCi)P(CCi),in the
CSI model, it equalsP(X1LiCj)P(X2LiCj)P(XdL
iCj)P(CCi)
So, for a dataset (without class labels), if we
can guess a CSI model and its parameters, then we
can assign each data point to a class gt
clustering.

29
Different Approaches Context-Specific Bayesian
Clustering

Searching best model and parameters
Define a score to rank the current model and
parameters (BIC(M, ?M) or CS(M, ?M)).
Randomly pick a model and a set of parameters and
calculate the score.
Try modifying the model (e.g. add an attribute to
a local structure), recalculate the score.
If the score is better, keep it and try modifying
a parameter.

30
Different Approaches Context-Specific Bayesian
Clustering

Repeat until a stopping criterion is reached
(e.g. using simulated annealing).
M1, ?M1 -gtM2, ?M1 -gtM2, ?M2 -gtM3, ?M2 -gt

31
Different Approaches Context-Specific Bayesian
Clustering

The scoring functions (just have a taste)

32
Different Approaches Context-Specific Bayesian
Clustering

Problems with the context-specific Bayesian
clustering approach
Cluster quality and execution time not
guaranteed.
Easily get into local minimum.

33
Focus The Projective Clustering Approach

PROCLUS (99), ORCLUS (00), etc.
K-medoid partitional clustering.
Basic idea use a set of sample points to
determine the relating dimensions for each
cluster, assign points to the clusters according
to the dimension sets, throw away some bad
medoids and repeat.

34
Focus The Projective Clustering Approach

Algorithm (3 phases)
Initialization phase
Input k target number of clusters.
Input l average number of dimensions in a
cluster.
Draw Ak samples randomly from the dataset, where
A is a constant.
Use max-min algorithm to draw Bk points from the
sample, where B is a constant lt A. Call this set
of points M.

35
Focus The Projective Clustering Approach

Iterative Phase
Draw k medoids from M.
For each medoid mi, calculate the Manhattan
distance di (involving all dimensions) to the
nearest medoid.
Find all points in the whole dataset that are
within a distance di from mi.

36
Focus The Projective Clustering Approach

Finding the set of surrounding points for a
medoid

B
A
C
37
Focus The Projective Clustering Approach

The average distance between the points and the
medoid along each dimension will be calculated.
Among all kd dimensions, select kl of them with
exceptionally small average distances. An extra
restriction is that each medoid must pick at
least 2 dimensions.
Whether the distance from medoid of a particular
dimension is exceptionally small in a cluster
is determined by its standard score

38
Focus The Projective Clustering Approach

Scoring dimensions

B
A
C
39
Focus The Projective Clustering Approach

Example
In cluster C1, the average distances from medoid
along dimension D110, along D215 and along
D313. In Cluster C2, the average distances are
7, 6 and 12.
Mean(C1) (10 15 13) / 3 12.67.
S.D.(C1) 2.52.
Z(C1D1) (10-12.67)/2.52 -1.06.
Similarly, Z(C1D2) 0.93, Z(C1D3) 0.13,
Z(C2D1) -0.41, Z(C2D2) -0.73, Z(C2D3) 1.14.
So the order to pick the dimensions will be C1D1
-gt C2D2 -gt C2D1 -gt C1D3 -gt C1D2 -gt C2D3.

40
Focus The Projective Clustering Approach

Iterative Phase (contd)
Now, each medoid has a related set of dimensions.
Assign all points in the whole dataset to the
medoid closest to it (using a normalized distance
function involving only the selected dimension).
Calculate the overall score of the clustering.
Record the cluster definitions (relating
attributes and assigned points) if the score is
the new best one.
Throw away medoids with too few points. Replace
them with some points remained in M.

41
Focus The Projective Clustering Approach

Refinement Phase
After determining the best set of medoids, use
the assigned points to re-determine the sets of
dimensions, and reassign all points.
If the distance between a point and its medoid is
longer than the distance between the medoid and
its closest medoid, the point is marked as an
outlier.

42
Focus The Projective Clustering Approach

Experiment
Dataset synthetic, 100, 000 points, 20
dimensions.
Set 1 5 clusters, each with 7 dimensions.
Set 2 5 clusters, with 2-7 dimensions.
Machine 233-MHz IBM RS/6000, 128M RAM, running
AIX. Dataset stored in a 2GB SCSI drive.
Comparison CLIQUE (grid-based)

43
Focus The Projective Clustering Approach

Result accuracy (set 1)

44
Focus The Projective Clustering Approach

Result accuracy (set 1)

45
Focus The Projective Clustering Approach

Result accuracy (set 1)

46
Focus The Projective Clustering Approach

Result accuracy (set 2)

47
Focus The Projective Clustering Approach

Result accuracy (set 2)

48
Focus The Projective Clustering Approach

Scalability(withdatasetsize)

49
Focus The Projective Clustering Approach

Scalability(withaveragedimension-ality)

50
Focus The Projective Clustering Approach

Scalability(withspacedimension-ality)

51
Focus The Projective Clustering Approach

Problems with the projective clustering approach
Need to know l, the average number of dimensions.
A cluster with very small number of selected
dimensions will absorb the points of other
clusters.
Using a distance measure over the whole dimension
space to select the sets of dimensions may not be
accurate, especially when the number of noise
attributes is large.

52
Summary

The subspace clustering problem given a dataset
of N data points and d dimensions, we want to
divide the points into k disjoint clusters, each
relating to a subset of dimensions, such that an
objective function is optimized.
Grid-based dimension selection
Association rule hypergraph partitioning
Context-specific Bayesian clustering
Projective clustering

53
References

Grid-based dimension selection
Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications
(SIGMOD 1998)
Entropy-based Subspace Clustering for Mining
Numerical Data (SIGKDD 1999)
MAFIA Efficient and Scalable Subspace
Clustering for Very Large Data Sets (Technical
Report 9906-010, Northwestern University 1999)
Association rule hypergraph partitioning
Clustering Based On Association Rule
Hypergraphs (Clustering Workshop 1997)

54
References