Uncertain Data Clustering: Models, Methods and Applications

About This Presentation

Title:

Uncertain Data Clustering: Models, Methods and Applications

Description:

Extension to EM Algorithm. Monitoring Moving ... Efficient Uncertain Clustering Algorithm (k-means, EM) ... Modeling methods: EM algorithm. A Basic Requirement ... – PowerPoint PPT presentation

Number of Views:483

Avg rating:3.0/5.0

Slides: 111

Provided by: dcs3

Category:

more less

Transcript and Presenter's Notes

Title: Uncertain Data Clustering: Models, Methods and Applications

1
Uncertain Data Clustering Models, Methods and
Applications

Anthony K.H. Tung,
Zhenjie Zhang

2
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

3
Clustering

Unsupervised Learning
Clustering dividing objects into groups
Modeling derive a probability model fitting the
observations

probability density function (pdf)
2-Clustering
4
Applications of Clustering

Application of unsupervised learning
Gain insight into the data
Design promotion plans for different cluster of
customer
Optimize search engine

5
Applications of Clustering

Application of unsupervised learning
As a pre-processing tools for other more complex
operation

A
Probability of point A being a skyline
Probability of having no point in the shaded
region
Density distribution modeled by mixture of 3
Gaussian distribution
6
Applications of Clustering

In general, if one can do clustering in an
efficient manner (in term of minimizing
computation, communication and energy cost), lots
of other problems can be solved on top of it

Range Query, kNN Query, Skyline Query
Relatively Stable Density Distribution
Efficient Uncertain Clustering Algorithm
(k-means, EM)
Dynamically changing, distributed data sources
7
Clustering Methods

So many unsupervised learning methods
Clustering methods K-Means Clustering, K-Median
Clustering
Modeling methods EM algorithm
A Basic Requirement
The information of the objects are clear on every
dimension

8
Uncertainty

Uncertainty is ubiquitous in real systems for the
following reasons
Limited data collection accuracy GPS

9
Uncertainty

Uncertainty is ubiquitous in real systems for the
following reasons
Uncertain nature animals

10
Uncertainty

Uncertainty is ubiquitous in real systems for the
following reasons
Update cost issue distributed database

11
Motivation

Simple extension from old method?
Find a representative location for each object
NKC06

center of boundary
actual location
boundary
approximate center
actual center
12
Motivation

Any help when having more information?
If we can pay some cost to retrieve the exact
position of some object, then the clustering can
be more accurate

13
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

14
Categories of Clustering

Distance-Based Clustering
K-Center, K-Means, K-Median
Density-Based Clustering
DBScan, CLIQUE
Graph-Based Clustering
Normalized-Cut, Spectral Clustering
Margin-Based Clustering
Support Vector Clustering

15
Categories of Clustering

Distance-Based Clustering
K-Center, K-Means, K-Median
Density-Based Clustering
DBScan, CLIQUE
Graph-Based Clustering
Normalized-Cut, Spectral Clustering
Margin-Based Clustering
Support Vector Clustering

16
Distance-Based Clustering

K-Center Choose k centers from the data set,
minimizing the maximum distance from any point to
the closest center

17
Distance-Based Clustering

K-Means Choose k centers from the space,
minimizing the sum of squared Euclidean distance
from any point to the closest center

18
Distance-Based Clustering

K-Median Choose k centers from the space,
minimizing the sum of some Metric distance from
any point to the closest center

19
Distance-Based Clustering

Classic Solutions to K-Center
Greedy Incremental 2-approximation in O(nk) G85

20
Distance-Based Clustering

Classic Solutions to K-Means
K-Means Algorithm convergence guarantee only to
local optimum
PTAS approximation within any error e with
polynomial time w.r.t. (1/e) KSS04
Local Search Heuristic (9e)-approximation
KMN02
Greedy Incremental logK-approximate in quadratic
time AK07

21
Distance-Based Clustering

Classic Solutions to K-Median
PTAS the first approximate k-median algorithm in
2-dimensional Euclidean space ARR98
Local Search Heuristic 5-approximation AGK01

22
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

23
Basic Principles

Expected Case Analysis
Minimize some objective function in expectation
Convergence Analysis
Does it converge to this clustering, if we have
more samples
Worst Case Analysis
Minimize the maximum change of the clustering

24
Expected Case Analysis

Input
Points with distribution information

25
Expected Case Analysis

Output
Clustering minimizing objective function in
expectation

26
Expected Case Analysis

Output
Clustering minimizing objective function in
expectation

Exp(Dist(C1,P1))0.30.050.70.050.05
27
Expected Case Analysis

K-Center in Expected Case Analysis CM08
A bicriteria algorithm can find O(ke-1log2n)
centers with (1e)-approximate cost in polynomial
time
K-Means and K-Median? CM08
A simple reduction can transform the problem to
traditional weighted K-Means and weighted K-Median

28
Expected Case Analysis

Reduction Example

29
Expected Case Analysis

Reduction
P1,P2,P3gt Q1,Q2,Q3,Q4,Q5,Q6,Q7
Given any center set C, the K-Means or K-Median
costs on these two sets are always the same,
because of the linearity of expectation
Any existing k-means or k-median algorithm can be
applied now!
Why not k-center?

30
Expected Case Analysis

Advantage
More robust than Naïve model
Disadvantage
Every object must have a clear distribution
representation
The clustering result can have large variance

31
Convergence Analysis

Input
Random points instead of uncertain points
Assumption some fixed but unknown distribution
exists

32
Convergence Analysis

Output
A stable clustering, which is the convergence
result asymptotically with more samples

33
Convergence Analysis

A general result BLP06
The stability of the clustering does not depend
on the algorithm, but on the underlying objective
function and the data distribution
For K-Means clustering BPS07
If there is a unique minimizer of k-means over
the distribution, the clustering will converge to
it if we can generate more samples

34
Convergence Analysis

Advantage
Better for parameter selection
With strong statistical support
Disadvantage
Not all distributions are clusterable
Hard to verify if the current distribution
satisfies the condition

35
A Simple Comparison
unbounded
Expected Case Analysis
Variance of Clustering Result
Worst Case Analysis
Convergence Analysis
0
36
Worst Case Analysis

Input
Possible positions of the points
No distribution information

37
Worst Case Analysis

Output
Clustering of the objects, and also the maximum
change on clustering

38
Worst Case Analysis

Advantage
No distribution information is needed
More sub-models can be derived
Disadvantage
Some detailed information is wasted

39
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

40
Worst Case Models

Concepts
Object Uncertainty, Satisfaction of Exact Data,
Clustering Uncertainty
Models
ZUM, SUM, DUM, RUM
All models are independent to the underlying
clustering method
Using K-Means as running example

41
Worst Case Models

Object Uncertainty
Only a bounding sphere in the space
No distribution assumption
General enough to handle complex distribution

42
Worst Case Models

An exact data D satisfies an uncertain data set
P, if di in D is in the sphere pi in P, for all
i.

43
Worst Case Models

Given an exact data set D and two Clustering C1
and C2, there is a mapping f from (C1, C2) to a
real value, measuring the difference on the
quality.
This mapping usually depends on the objective
function of the clustering problem, such as
k-means

C1
C2
44
Worst Case Models

K-Means

f(C1,C2)2
C1
C2
K-means Cost of C1 is 20
K-means Cost of C2 is 18
45
Worst Case Models

Clustering Uncertainty
Given an uncertain data set P, a clustering C,
and an exact clustering Algorithm A, the
uncertainty of C is the maximum f(C,C), where C
can be the clustering result with A on any exact
data set D satisfying P.

AK-means
C based on uncertain location of objects
Exact data set D given
C' based on exact locations D
46
Zero Uncertain Model

Zero Uncertain Model (ZUM) ZDT06
Every object is certain
The clustering is uncertain due to incomplete
computation (eg. to save space in data streaming)
Derive some bound on the movement of centers

47
Applications
48
Static Uncertain Model

Static Uncertain Model (SUM) ZDT08a
There are uncertain objects instead of exact
Find out the worst case of f(C,C')
By no mean an easy problem since the centers are
uncertain too

boundaries of potential centers
this could belong to the red center!
49
Applications
50
Dissolvable Uncertain Model

Dissolvable Uncertain Model (DUM)
The error bound can be large in SUM
Dissolution get the exact location by some cost
Find out the minimal cost to reduce the error to
some specified degree

updated centers
reduce error
51
Applications
52
Reverse Uncertain Model

Reverse Uncertain Model (RUM) ZYTP08
We have exact objects, but wants to derive bounds
on their movement
When every object in this bound, the clustering
is still stable i.e. the centers don't change
much

furthest possible distance to black center is
smaller than nearest possible distance to red
center
furthest distance to black center
nearest distance to red center
53
Applications
54
K-Means Algorithm

K-Means Algorithm
Initialization Randomly choose k centers
Re-assignment Assign the points to the closest
center
Center Update Replace the previous centers with
the new center of the cluster
Convergence Stop when there is no assignment
change

55
K-Means Algorithm

Advantage of K-Means Algorithm
Simple to understand
Popular in real applications
Clustering representation for exact k-means
Cluster Centers
The cost is the sum of squared Euclidean distance
from the objects to the cluster center
The movement of the cluster centers can bound the
change on cost

56
K-Means in ZUM

K-Means Algorithm under ZUM ZDT06
Input Centers for the current clustering
The problem is actually finding out how far the
cluster centers may move by the following
iterations

57
K-Means in ZUM

K-Means

C1
If the centers move no more than r, the cost of
any clustering is between -r2N, r2N
58
Solution Space

Given a d-dimensional problem space, we define
the solution space as a kd-dimensional space

c2
M1
c1
c2
c1
59
Solution Space

With every iterations, the center set moves in
the solution space

M3
M2
c2
M1
c1
c2
c1
60
Maximal Region

Maximal Region is a region in the solution space
that bound the local optimum that can be reached
in future iterations
A region is maximal region for clustering center
set M, if
It contains M
Any solution on the boundary of the region has
equal cost of M

61
Maximal Region
The cost of center sets in solutions space
is represented by contour lines, lighter color
meaning smaller cost
c2
M2
Any solution between M1 and M2 must have
smaller cost than M1
M1
c1
62
Maximal Region
c2
M2
M1
Maximal Region of the local optimum, the
local optimum must locate in
c1
63
Maximal Region
c2
M2
M1
every center moves no more than Delta
c1
64
Maximal Region
M1
m1
m2
65
Maximal Region

How to verify a maximal region?
Step 1 fix the current cluster assignment, and
move at least one center by Delta. The cost will
increase
Step 2 fix the current cluster center, and
reassign the points to closest centers. The cost
will decrease
If the increase on first step is larger than the
decrease in second step, the center sets on the
boundary are worse than current one

66
Maximal Region

How to find such a maximal region
If every center is allowed to move by distance
, the cost of every cluster will increase, if
every object remains in the cluster?

67
Maximal Region

How to find such a maximal region
If every center is allowed to move by distance
, how much every object can improve by switching
to other cluster?
Case 1

68
Maximal Region

How to find such a maximal region
Case 2

69
Maximal Region

How to find such a maximal region
Case 3

70
Maximal Region

How to find such a maximal region
Assume d1 is the distance of pi to the nearest
center and d2 is the distance of pi to the second
nearest neighbor
When is smaller than (d1-d2)/2, its in Case 1
When between (d1-d2)/2 and d2, its in Case 2
When is larger than d2, its in Case 3

71
Maximal Region

How to find such a maximal region
We can split the space of into 2n1 segments
In each segment, if there is a real root for some
quadratic equation, there is at least valid upper
bound inside

72
Maximal Region

Complexity
The break point can be sorted in O(nlogn) time
Every quadratic equation can be solved in
constant time
The total complexity is O(nlogn)

73
K-Means in SUM

K-Means in SUM
Intra-Cluster Uncertainty average of radiuses

74
K-Means in SUM

K-Means in SUM
Inter-Cluster Uncertainty

75
K-Means Algorithm

K-Means in SUM
Inter-Cluster Uncertainty similar to ZUM

76
K-Means Algorithm

K-Means in SUM
Inter-Cluster Uncertainty similar to ZUM
The complexity is the same as ZUM, O(nlogn) time

77
K-Means Algorithm

DUM
The cost of retrieving the exact position of an
object is ci
NP-hard
Even NP-hard to approximate within any constant
factor! Only Heuristics to do this!

78
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

79
EM Algorithm in ZUM

A new result with EM Algorithm under ZUM model
ZDT08b
Problem
Given a group of point and a Gaussian Mixture
Model
How far is the real local optimum from the
current one?

80
Gaussian Mixture Model

Basics of GMM
There are k components following Gaussian
distributions
The probability of observing x from component i
The probability of observing x

81
Gaussian Mixture Model

Find the configuration to maximize the
probability of the whole data set
Equivalent to maximize the log likelihood

82
Expectation-Maximization

Initialization
Randomly selects the initial configuration
E-Step
Calculate the cluster membership probability
M-Step
Update the parameters of component
Convergence Condition
No change on membership probability

83
Solution Space

Solution Space ZDT08b
Configuration move by EM Iteration

84
Local Trapping Property

Local Trapping
There is a path, any configuration on the path is
a better solution than

85
Local Trapping Property

Local Trapping
No way to get out of the region, if every
boundary configuration no better than

86
Maximal Region

Maximal Region
It covers the current configuration
Every configuration on the boundary is no better
than
Local Optimum must be in Maximal Region
By local trapping property

87
Maximal Region

A special class of region
Containing all configurations satisfying

88
Maximal Region

Verification of maximal region
O(n) time, where n is the number of points
Derivation of an upper bound on the log
likelihood of local optimum
Constant time

89
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

90
K-Means on Moving Objects

Server monitoring k-means clustering of the
vehicles in the traffic system
Client vehicles reporting recent positions to
the central Server

91
Basic Solutions

Update at every second
All clusterings can be captured
Heavy Communication Cost on client
Heavy Computation Cost on server

92
Basic Solutions

Update after L second
Less communication and less computation
Hard to choose L
Some meaningful clustering may be skipped

93
K-Means on Moving Objects

To reduce the computation time and the number of
messages, RUM is used over k-means clustering
ZYTP08

Update happens here
94
K-Means over Moving Objects

Cost Model
If the radius of the uncertain sphere is ri, and
the speed of the object is si, the maximum
updating frequency is si/ri.
Optimization Problem
Subject Minimize the sum over si/ri
Constraint the clustering is error-bounded

95
K-Means over Moving Objects

Intuition on Radius Computation
We have the constraint on the clustering
uncertainty function over the radiuses
r1,r2,,rn
Lagrange Multiplier method is used to find the
optimal radiuses, minimizing the update
frequencies.

96
K-Means over Moving Objects

Update issue
Step 1 Re-compute the cluster centers
Step 2 Send probe request to objects if
necessary
Step 3 Compute new spheres for some objects

97
Outline

Introduction and Motivation
Traditional Clustering Techniques
Uncertain Clustering Principles
Worst Case Models and K-Means
Extension to EM Algorithm
Monitoring Moving Objects
Future Work and Conclusion

98
Possible Applications

Change detection over data stream
Insertion-Deletion Model
ZUM model?

99
Possible Applications

Private Information Publication
Publish uncertain object
Keep the current distribution
RUM model?

100
Possible Applications

Clustering over sensor network
Energy is a big issue
Update only when necessary
RUM Model?

101
Possible Extension

With other clustering method?

102
On Theory

Open Problems
Can we find good uncertain clustering models for
discrete structures, like graph and tree?

103
On Theory

Open Problems
Can we combine uncertain clustering model with
kernels and support vector machine?

104
On Theory

Open Problems
Can we find some uncertain clustering models,
between expected case analysis and worst case
analysis?

105
Conclusion

Conclusion
Uncertain Analysis
Expected Case Analysis Convergence Analysis
Worst Case Analysis
Uncertain Models
ZUM, SUM, DUM, RUM
Uncertain Applications
Monitoring moving objects with minimal
communication cost

106
Download

Link
www.comp.nus.edu.sg/atung/waim08-tut.ppt

107
References

NKC06 W.K. Ngai, B. Kao, C.K. Chui, R. Cheng,
M. Chau, K.Y. Yip, Efficient Clustering of
Uncertain Data, ICDM 2006
ZDT06 Z. Zhang, B.T. Dai, A.K.H. Tung, On the
Lower Bound of Local Optimums in K-Means
Algorithm, ICDM 2006
ZDT08a Z. Zhang, B.T. Dai, A.K.H. Tung, Robust
Clustering over Uncertain Objects, manuscript
ZYTP08 Z. Zhang, Y. Yang, A.K.H. Tung, D.
Papadais, Continuous K-Means Monitoring over
Moving Objects, TKDE 2008
ZDT08b Z. Zhang, B.T. Dai, A.K.H. Tung,
Estimating Local Optimums in EM Algorithm over
Gaussian Mixture Model, in ICML 2008

108
References

CM08 G. Cormode, A. McGregor. Approximation
Algorithms for Clustering Uncertain Data, PODS
2008
BLP06 S. Ben-David, U. v. Luxburg, D. Pal, A
Sober Look at Clustering Stability, COLT 2006
BPS07 S. Ben-David, D. Pal, H. Simon,
Stability of k-Means Clustering , COLT 2007
G85 T.F. Gonzalez, Clustering to Minimize the
Maximum Intercluster Distance, Theoretical
Computer Science 1985
AV07 D. Arthur, S. Vassilvitskii, k-means
The Advantage of Careful Seeding, SODA 2007
KMN02 T. Kanungo, D. Mount, N. Netanyahu, C.
Piatko, R. Silverman, A. Wu, An efficient
k-means clustering algorithm analysis and
implementation, TPAMI 2002

109
References

KSS04 Amit Kumar, Yogish Sabharwal, Sandeep
Sen, A Simple Linear Time (1e)-Approximation
Algorithm for k-Means Clustering in Any
Dimensions, FOCS 2004
ARR98 S. Arora, P. Raghavan, S. Rao,
Approximation schemes for Euclidean k-medians
and related problems, STOC 1998
AGK01 V. Arya, N. Garg, R. Khandekar, K.
Munagala, V. Pandit, Local search heuristic for
k-median and facility location problems, STOC
2001

110
Question Answer

Write a Comment

User Comments (0)