Title: Uncertain Data Clustering: Models, Methods and Applications
1Uncertain Data Clustering Models, Methods and
Applications
- Anthony K.H. Tung,
- Zhenjie Zhang
2Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
3Clustering
- Unsupervised Learning
- Clustering dividing objects into groups
- Modeling derive a probability model fitting the
observations
probability density function (pdf)
2-Clustering
4Applications of Clustering
- Application of unsupervised learning
- Gain insight into the data
- Design promotion plans for different cluster of
customer - Optimize search engine
5Applications of Clustering
- Application of unsupervised learning
- As a pre-processing tools for other more complex
operation
A
Probability of point A being a skyline
Probability of having no point in the shaded
region
Density distribution modeled by mixture of 3
Gaussian distribution
6Applications of Clustering
- In general, if one can do clustering in an
efficient manner (in term of minimizing
computation, communication and energy cost), lots
of other problems can be solved on top of it
Range Query, kNN Query, Skyline Query
Relatively Stable Density Distribution
Efficient Uncertain Clustering Algorithm
(k-means, EM)
Dynamically changing, distributed data sources
7Clustering Methods
- So many unsupervised learning methods
- Clustering methods K-Means Clustering, K-Median
Clustering - Modeling methods EM algorithm
- A Basic Requirement
- The information of the objects are clear on every
dimension
8Uncertainty
- Uncertainty is ubiquitous in real systems for the
following reasons - Limited data collection accuracy GPS
9Uncertainty
- Uncertainty is ubiquitous in real systems for the
following reasons - Uncertain nature animals
10Uncertainty
- Uncertainty is ubiquitous in real systems for the
following reasons - Update cost issue distributed database
11Motivation
- Simple extension from old method?
- Find a representative location for each object
NKC06
center of boundary
actual location
boundary
approximate center
actual center
12Motivation
- Any help when having more information?
- If we can pay some cost to retrieve the exact
position of some object, then the clustering can
be more accurate
13Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
14Categories of Clustering
- Distance-Based Clustering
- K-Center, K-Means, K-Median
- Density-Based Clustering
- DBScan, CLIQUE
- Graph-Based Clustering
- Normalized-Cut, Spectral Clustering
- Margin-Based Clustering
- Support Vector Clustering
15Categories of Clustering
- Distance-Based Clustering
- K-Center, K-Means, K-Median
- Density-Based Clustering
- DBScan, CLIQUE
- Graph-Based Clustering
- Normalized-Cut, Spectral Clustering
- Margin-Based Clustering
- Support Vector Clustering
16Distance-Based Clustering
- K-Center Choose k centers from the data set,
minimizing the maximum distance from any point to
the closest center
17Distance-Based Clustering
- K-Means Choose k centers from the space,
minimizing the sum of squared Euclidean distance
from any point to the closest center
18Distance-Based Clustering
- K-Median Choose k centers from the space,
minimizing the sum of some Metric distance from
any point to the closest center
19Distance-Based Clustering
- Classic Solutions to K-Center
- Greedy Incremental 2-approximation in O(nk) G85
20Distance-Based Clustering
- Classic Solutions to K-Means
- K-Means Algorithm convergence guarantee only to
local optimum - PTAS approximation within any error e with
polynomial time w.r.t. (1/e) KSS04 - Local Search Heuristic (9e)-approximation
KMN02 - Greedy Incremental logK-approximate in quadratic
time AK07
21Distance-Based Clustering
- Classic Solutions to K-Median
- PTAS the first approximate k-median algorithm in
2-dimensional Euclidean space ARR98 - Local Search Heuristic 5-approximation AGK01
22Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
23Basic Principles
- Expected Case Analysis
- Minimize some objective function in expectation
- Convergence Analysis
- Does it converge to this clustering, if we have
more samples - Worst Case Analysis
- Minimize the maximum change of the clustering
24Expected Case Analysis
- Input
- Points with distribution information
25Expected Case Analysis
- Output
- Clustering minimizing objective function in
expectation
26Expected Case Analysis
- Output
- Clustering minimizing objective function in
expectation
Exp(Dist(C1,P1))0.30.050.70.050.05
27Expected Case Analysis
- K-Center in Expected Case Analysis CM08
- A bicriteria algorithm can find O(ke-1log2n)
centers with (1e)-approximate cost in polynomial
time - K-Means and K-Median? CM08
- A simple reduction can transform the problem to
traditional weighted K-Means and weighted K-Median
28Expected Case Analysis
29Expected Case Analysis
- Reduction
- P1,P2,P3gt Q1,Q2,Q3,Q4,Q5,Q6,Q7
- Given any center set C, the K-Means or K-Median
costs on these two sets are always the same,
because of the linearity of expectation - Any existing k-means or k-median algorithm can be
applied now! - Why not k-center?
30Expected Case Analysis
- Advantage
- More robust than Naïve model
- Disadvantage
- Every object must have a clear distribution
representation - The clustering result can have large variance
31Convergence Analysis
- Input
- Random points instead of uncertain points
- Assumption some fixed but unknown distribution
exists
32Convergence Analysis
- Output
- A stable clustering, which is the convergence
result asymptotically with more samples
33Convergence Analysis
- A general result BLP06
- The stability of the clustering does not depend
on the algorithm, but on the underlying objective
function and the data distribution - For K-Means clustering BPS07
- If there is a unique minimizer of k-means over
the distribution, the clustering will converge to
it if we can generate more samples
34Convergence Analysis
- Advantage
- Better for parameter selection
- With strong statistical support
- Disadvantage
- Not all distributions are clusterable
- Hard to verify if the current distribution
satisfies the condition
35A Simple Comparison
unbounded
Expected Case Analysis
Variance of Clustering Result
Worst Case Analysis
Convergence Analysis
0
36Worst Case Analysis
- Input
- Possible positions of the points
- No distribution information
37Worst Case Analysis
- Output
- Clustering of the objects, and also the maximum
change on clustering
38Worst Case Analysis
- Advantage
- No distribution information is needed
- More sub-models can be derived
- Disadvantage
- Some detailed information is wasted
39Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
40Worst Case Models
- Concepts
- Object Uncertainty, Satisfaction of Exact Data,
Clustering Uncertainty - Models
- ZUM, SUM, DUM, RUM
- All models are independent to the underlying
clustering method - Using K-Means as running example
41Worst Case Models
- Object Uncertainty
- Only a bounding sphere in the space
- No distribution assumption
- General enough to handle complex distribution
42Worst Case Models
- An exact data D satisfies an uncertain data set
P, if di in D is in the sphere pi in P, for all
i.
43Worst Case Models
- Given an exact data set D and two Clustering C1
and C2, there is a mapping f from (C1, C2) to a
real value, measuring the difference on the
quality. - This mapping usually depends on the objective
function of the clustering problem, such as
k-means
C1
C2
44Worst Case Models
f(C1,C2)2
C1
C2
K-means Cost of C1 is 20
K-means Cost of C2 is 18
45Worst Case Models
- Clustering Uncertainty
- Given an uncertain data set P, a clustering C,
and an exact clustering Algorithm A, the
uncertainty of C is the maximum f(C,C), where C
can be the clustering result with A on any exact
data set D satisfying P.
AK-means
C based on uncertain location of objects
Exact data set D given
C' based on exact locations D
46Zero Uncertain Model
- Zero Uncertain Model (ZUM) ZDT06
- Every object is certain
- The clustering is uncertain due to incomplete
computation (eg. to save space in data streaming) - Derive some bound on the movement of centers
47Applications
48Static Uncertain Model
- Static Uncertain Model (SUM) ZDT08a
- There are uncertain objects instead of exact
- Find out the worst case of f(C,C')
- By no mean an easy problem since the centers are
uncertain too
boundaries of potential centers
this could belong to the red center!
49Applications
50Dissolvable Uncertain Model
- Dissolvable Uncertain Model (DUM)
- The error bound can be large in SUM
- Dissolution get the exact location by some cost
- Find out the minimal cost to reduce the error to
some specified degree
updated centers
reduce error
51Applications
52Reverse Uncertain Model
- Reverse Uncertain Model (RUM) ZYTP08
- We have exact objects, but wants to derive bounds
on their movement - When every object in this bound, the clustering
is still stable i.e. the centers don't change
much
furthest possible distance to black center is
smaller than nearest possible distance to red
center
furthest distance to black center
nearest distance to red center
53Applications
54K-Means Algorithm
- K-Means Algorithm
- Initialization Randomly choose k centers
- Re-assignment Assign the points to the closest
center - Center Update Replace the previous centers with
the new center of the cluster - Convergence Stop when there is no assignment
change
55K-Means Algorithm
- Advantage of K-Means Algorithm
- Simple to understand
- Popular in real applications
- Clustering representation for exact k-means
- Cluster Centers
- The cost is the sum of squared Euclidean distance
from the objects to the cluster center - The movement of the cluster centers can bound the
change on cost
56K-Means in ZUM
- K-Means Algorithm under ZUM ZDT06
- Input Centers for the current clustering
- The problem is actually finding out how far the
cluster centers may move by the following
iterations
57K-Means in ZUM
C1
If the centers move no more than r, the cost of
any clustering is between -r2N, r2N
58Solution Space
- Given a d-dimensional problem space, we define
the solution space as a kd-dimensional space
c2
M1
c1
c2
c1
59Solution Space
- With every iterations, the center set moves in
the solution space
M3
M2
c2
M1
c1
c2
c1
60Maximal Region
- Maximal Region is a region in the solution space
that bound the local optimum that can be reached
in future iterations - A region is maximal region for clustering center
set M, if - It contains M
- Any solution on the boundary of the region has
equal cost of M
61Maximal Region
The cost of center sets in solutions space
is represented by contour lines, lighter color
meaning smaller cost
c2
M2
Any solution between M1 and M2 must have
smaller cost than M1
M1
c1
62Maximal Region
c2
M2
M1
Maximal Region of the local optimum, the
local optimum must locate in
c1
63Maximal Region
c2
M2
M1
every center moves no more than Delta
c1
64Maximal Region
M1
m1
m2
65Maximal Region
- How to verify a maximal region?
- Step 1 fix the current cluster assignment, and
move at least one center by Delta. The cost will
increase - Step 2 fix the current cluster center, and
reassign the points to closest centers. The cost
will decrease - If the increase on first step is larger than the
decrease in second step, the center sets on the
boundary are worse than current one
66Maximal Region
- How to find such a maximal region
- If every center is allowed to move by distance
, the cost of every cluster will increase, if
every object remains in the cluster?
67Maximal Region
- How to find such a maximal region
- If every center is allowed to move by distance
, how much every object can improve by switching
to other cluster? - Case 1
68Maximal Region
- How to find such a maximal region
- Case 2
69Maximal Region
- How to find such a maximal region
- Case 3
70Maximal Region
- How to find such a maximal region
- Assume d1 is the distance of pi to the nearest
center and d2 is the distance of pi to the second
nearest neighbor - When is smaller than (d1-d2)/2, its in Case 1
- When between (d1-d2)/2 and d2, its in Case 2
- When is larger than d2, its in Case 3
71Maximal Region
- How to find such a maximal region
- We can split the space of into 2n1 segments
- In each segment, if there is a real root for some
quadratic equation, there is at least valid upper
bound inside
72Maximal Region
- Complexity
- The break point can be sorted in O(nlogn) time
- Every quadratic equation can be solved in
constant time - The total complexity is O(nlogn)
73K-Means in SUM
- K-Means in SUM
- Intra-Cluster Uncertainty average of radiuses
74K-Means in SUM
- K-Means in SUM
- Inter-Cluster Uncertainty
75K-Means Algorithm
- K-Means in SUM
- Inter-Cluster Uncertainty similar to ZUM
76K-Means Algorithm
- K-Means in SUM
- Inter-Cluster Uncertainty similar to ZUM
- The complexity is the same as ZUM, O(nlogn) time
77K-Means Algorithm
- DUM
- The cost of retrieving the exact position of an
object is ci - NP-hard
- Even NP-hard to approximate within any constant
factor! Only Heuristics to do this!
78Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
79EM Algorithm in ZUM
- A new result with EM Algorithm under ZUM model
ZDT08b - Problem
- Given a group of point and a Gaussian Mixture
Model - How far is the real local optimum from the
current one?
80Gaussian Mixture Model
- Basics of GMM
- There are k components following Gaussian
distributions - The probability of observing x from component i
- The probability of observing x
81Gaussian Mixture Model
- Find the configuration to maximize the
probability of the whole data set - Equivalent to maximize the log likelihood
82Expectation-Maximization
- Initialization
- Randomly selects the initial configuration
- E-Step
- Calculate the cluster membership probability
- M-Step
- Update the parameters of component
- Convergence Condition
- No change on membership probability
83Solution Space
- Solution Space ZDT08b
- Configuration move by EM Iteration
84Local Trapping Property
- Local Trapping
- There is a path, any configuration on the path is
a better solution than
85Local Trapping Property
- Local Trapping
- No way to get out of the region, if every
boundary configuration no better than
86Maximal Region
- Maximal Region
- It covers the current configuration
- Every configuration on the boundary is no better
than - Local Optimum must be in Maximal Region
- By local trapping property
87Maximal Region
- A special class of region
- Containing all configurations satisfying
88Maximal Region
- Verification of maximal region
- O(n) time, where n is the number of points
- Derivation of an upper bound on the log
likelihood of local optimum - Constant time
89Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
90K-Means on Moving Objects
- Server monitoring k-means clustering of the
vehicles in the traffic system - Client vehicles reporting recent positions to
the central Server
91Basic Solutions
- Update at every second
- All clusterings can be captured
- Heavy Communication Cost on client
- Heavy Computation Cost on server
92Basic Solutions
- Update after L second
- Less communication and less computation
- Hard to choose L
- Some meaningful clustering may be skipped
93K-Means on Moving Objects
- To reduce the computation time and the number of
messages, RUM is used over k-means clustering
ZYTP08
Update happens here
94K-Means over Moving Objects
- Cost Model
- If the radius of the uncertain sphere is ri, and
the speed of the object is si, the maximum
updating frequency is si/ri. - Optimization Problem
- Subject Minimize the sum over si/ri
- Constraint the clustering is error-bounded
95K-Means over Moving Objects
- Intuition on Radius Computation
- We have the constraint on the clustering
uncertainty function over the radiuses
r1,r2,,rn - Lagrange Multiplier method is used to find the
optimal radiuses, minimizing the update
frequencies.
96K-Means over Moving Objects
- Update issue
- Step 1 Re-compute the cluster centers
- Step 2 Send probe request to objects if
necessary - Step 3 Compute new spheres for some objects
97Outline
- Introduction and Motivation
- Traditional Clustering Techniques
- Uncertain Clustering Principles
- Worst Case Models and K-Means
- Extension to EM Algorithm
- Monitoring Moving Objects
- Future Work and Conclusion
98Possible Applications
- Change detection over data stream
- Insertion-Deletion Model
- ZUM model?
99Possible Applications
- Private Information Publication
- Publish uncertain object
- Keep the current distribution
- RUM model?
100Possible Applications
- Clustering over sensor network
- Energy is a big issue
- Update only when necessary
- RUM Model?
101Possible Extension
- With other clustering method?
102On Theory
- Open Problems
- Can we find good uncertain clustering models for
discrete structures, like graph and tree?
103On Theory
- Open Problems
- Can we combine uncertain clustering model with
kernels and support vector machine?
104On Theory
- Open Problems
- Can we find some uncertain clustering models,
between expected case analysis and worst case
analysis?
105Conclusion
- Conclusion
- Uncertain Analysis
- Expected Case Analysis Convergence Analysis
Worst Case Analysis - Uncertain Models
- ZUM, SUM, DUM, RUM
- Uncertain Applications
- Monitoring moving objects with minimal
communication cost
106Download
- Link
- www.comp.nus.edu.sg/atung/waim08-tut.ppt
107References
- NKC06 W.K. Ngai, B. Kao, C.K. Chui, R. Cheng,
M. Chau, K.Y. Yip, Efficient Clustering of
Uncertain Data, ICDM 2006 - ZDT06 Z. Zhang, B.T. Dai, A.K.H. Tung, On the
Lower Bound of Local Optimums in K-Means
Algorithm, ICDM 2006 - ZDT08a Z. Zhang, B.T. Dai, A.K.H. Tung, Robust
Clustering over Uncertain Objects, manuscript - ZYTP08 Z. Zhang, Y. Yang, A.K.H. Tung, D.
Papadais, Continuous K-Means Monitoring over
Moving Objects, TKDE 2008 - ZDT08b Z. Zhang, B.T. Dai, A.K.H. Tung,
Estimating Local Optimums in EM Algorithm over
Gaussian Mixture Model, in ICML 2008
108References
- CM08 G. Cormode, A. McGregor. Approximation
Algorithms for Clustering Uncertain Data, PODS
2008 - BLP06 S. Ben-David, U. v. Luxburg, D. Pal, A
Sober Look at Clustering Stability, COLT 2006 - BPS07 S. Ben-David, D. Pal, H. Simon,
Stability of k-Means Clustering , COLT 2007 - G85 T.F. Gonzalez, Clustering to Minimize the
Maximum Intercluster Distance, Theoretical
Computer Science 1985 - AV07 D. Arthur, S. Vassilvitskii, k-means
The Advantage of Careful Seeding, SODA 2007 - KMN02 T. Kanungo, D. Mount, N. Netanyahu, C.
Piatko, R. Silverman, A. Wu, An efficient
k-means clustering algorithm analysis and
implementation, TPAMI 2002
109References
- KSS04 Amit Kumar, Yogish Sabharwal, Sandeep
Sen, A Simple Linear Time (1e)-Approximation
Algorithm for k-Means Clustering in Any
Dimensions, FOCS 2004 - ARR98 S. Arora, P. Raghavan, S. Rao,
Approximation schemes for Euclidean k-medians
and related problems, STOC 1998 - AGK01 V. Arya, N. Garg, R. Khandekar, K.
Munagala, V. Pandit, Local search heuristic for
k-median and facility location problems, STOC
2001
110Question Answer