Uncertain Data Clustering: Models, Methods and Applications - PowerPoint PPT Presentation

1 / 110
About This Presentation
Title:

Uncertain Data Clustering: Models, Methods and Applications

Description:

Extension to EM Algorithm. Monitoring Moving ... Efficient Uncertain Clustering Algorithm (k-means, EM) ... Modeling methods: EM algorithm. A Basic Requirement ... – PowerPoint PPT presentation

Number of Views:482
Avg rating:3.0/5.0
Slides: 111
Provided by: dcs3
Category:

less

Transcript and Presenter's Notes

Title: Uncertain Data Clustering: Models, Methods and Applications


1
Uncertain Data Clustering Models, Methods and
Applications
  • Anthony K.H. Tung,
  • Zhenjie Zhang

2
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

3
Clustering
  • Unsupervised Learning
  • Clustering dividing objects into groups
  • Modeling derive a probability model fitting the
    observations

probability density function (pdf)
2-Clustering
4
Applications of Clustering
  • Application of unsupervised learning
  • Gain insight into the data
  • Design promotion plans for different cluster of
    customer
  • Optimize search engine

5
Applications of Clustering
  • Application of unsupervised learning
  • As a pre-processing tools for other more complex
    operation

A
Probability of point A being a skyline
Probability of having no point in the shaded
region
Density distribution modeled by mixture of 3
Gaussian distribution
6
Applications of Clustering
  • In general, if one can do clustering in an
    efficient manner (in term of minimizing
    computation, communication and energy cost), lots
    of other problems can be solved on top of it

Range Query, kNN Query, Skyline Query
Relatively Stable Density Distribution
Efficient Uncertain Clustering Algorithm
(k-means, EM)
Dynamically changing, distributed data sources
7
Clustering Methods
  • So many unsupervised learning methods
  • Clustering methods K-Means Clustering, K-Median
    Clustering
  • Modeling methods EM algorithm
  • A Basic Requirement
  • The information of the objects are clear on every
    dimension

8
Uncertainty
  • Uncertainty is ubiquitous in real systems for the
    following reasons
  • Limited data collection accuracy GPS

9
Uncertainty
  • Uncertainty is ubiquitous in real systems for the
    following reasons
  • Uncertain nature animals

10
Uncertainty
  • Uncertainty is ubiquitous in real systems for the
    following reasons
  • Update cost issue distributed database

11
Motivation
  • Simple extension from old method?
  • Find a representative location for each object
    NKC06

center of boundary
actual location
boundary
approximate center
actual center
12
Motivation
  • Any help when having more information?
  • If we can pay some cost to retrieve the exact
    position of some object, then the clustering can
    be more accurate

13
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

14
Categories of Clustering
  • Distance-Based Clustering
  • K-Center, K-Means, K-Median
  • Density-Based Clustering
  • DBScan, CLIQUE
  • Graph-Based Clustering
  • Normalized-Cut, Spectral Clustering
  • Margin-Based Clustering
  • Support Vector Clustering

15
Categories of Clustering
  • Distance-Based Clustering
  • K-Center, K-Means, K-Median
  • Density-Based Clustering
  • DBScan, CLIQUE
  • Graph-Based Clustering
  • Normalized-Cut, Spectral Clustering
  • Margin-Based Clustering
  • Support Vector Clustering

16
Distance-Based Clustering
  • K-Center Choose k centers from the data set,
    minimizing the maximum distance from any point to
    the closest center

17
Distance-Based Clustering
  • K-Means Choose k centers from the space,
    minimizing the sum of squared Euclidean distance
    from any point to the closest center

18
Distance-Based Clustering
  • K-Median Choose k centers from the space,
    minimizing the sum of some Metric distance from
    any point to the closest center

19
Distance-Based Clustering
  • Classic Solutions to K-Center
  • Greedy Incremental 2-approximation in O(nk) G85

20
Distance-Based Clustering
  • Classic Solutions to K-Means
  • K-Means Algorithm convergence guarantee only to
    local optimum
  • PTAS approximation within any error e with
    polynomial time w.r.t. (1/e) KSS04
  • Local Search Heuristic (9e)-approximation
    KMN02
  • Greedy Incremental logK-approximate in quadratic
    time AK07

21
Distance-Based Clustering
  • Classic Solutions to K-Median
  • PTAS the first approximate k-median algorithm in
    2-dimensional Euclidean space ARR98
  • Local Search Heuristic 5-approximation AGK01

22
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

23
Basic Principles
  • Expected Case Analysis
  • Minimize some objective function in expectation
  • Convergence Analysis
  • Does it converge to this clustering, if we have
    more samples
  • Worst Case Analysis
  • Minimize the maximum change of the clustering

24
Expected Case Analysis
  • Input
  • Points with distribution information

25
Expected Case Analysis
  • Output
  • Clustering minimizing objective function in
    expectation

26
Expected Case Analysis
  • Output
  • Clustering minimizing objective function in
    expectation

Exp(Dist(C1,P1))0.30.050.70.050.05
27
Expected Case Analysis
  • K-Center in Expected Case Analysis CM08
  • A bicriteria algorithm can find O(ke-1log2n)
    centers with (1e)-approximate cost in polynomial
    time
  • K-Means and K-Median? CM08
  • A simple reduction can transform the problem to
    traditional weighted K-Means and weighted K-Median

28
Expected Case Analysis
  • Reduction Example

29
Expected Case Analysis
  • Reduction
  • P1,P2,P3gt Q1,Q2,Q3,Q4,Q5,Q6,Q7
  • Given any center set C, the K-Means or K-Median
    costs on these two sets are always the same,
    because of the linearity of expectation
  • Any existing k-means or k-median algorithm can be
    applied now!
  • Why not k-center?

30
Expected Case Analysis
  • Advantage
  • More robust than Naïve model
  • Disadvantage
  • Every object must have a clear distribution
    representation
  • The clustering result can have large variance

31
Convergence Analysis
  • Input
  • Random points instead of uncertain points
  • Assumption some fixed but unknown distribution
    exists

32
Convergence Analysis
  • Output
  • A stable clustering, which is the convergence
    result asymptotically with more samples

33
Convergence Analysis
  • A general result BLP06
  • The stability of the clustering does not depend
    on the algorithm, but on the underlying objective
    function and the data distribution
  • For K-Means clustering BPS07
  • If there is a unique minimizer of k-means over
    the distribution, the clustering will converge to
    it if we can generate more samples

34
Convergence Analysis
  • Advantage
  • Better for parameter selection
  • With strong statistical support
  • Disadvantage
  • Not all distributions are clusterable
  • Hard to verify if the current distribution
    satisfies the condition

35
A Simple Comparison
unbounded
Expected Case Analysis
Variance of Clustering Result
Worst Case Analysis
Convergence Analysis
0
36
Worst Case Analysis
  • Input
  • Possible positions of the points
  • No distribution information

37
Worst Case Analysis
  • Output
  • Clustering of the objects, and also the maximum
    change on clustering

38
Worst Case Analysis
  • Advantage
  • No distribution information is needed
  • More sub-models can be derived
  • Disadvantage
  • Some detailed information is wasted

39
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

40
Worst Case Models
  • Concepts
  • Object Uncertainty, Satisfaction of Exact Data,
    Clustering Uncertainty
  • Models
  • ZUM, SUM, DUM, RUM
  • All models are independent to the underlying
    clustering method
  • Using K-Means as running example

41
Worst Case Models
  • Object Uncertainty
  • Only a bounding sphere in the space
  • No distribution assumption
  • General enough to handle complex distribution

42
Worst Case Models
  • An exact data D satisfies an uncertain data set
    P, if di in D is in the sphere pi in P, for all
    i.

43
Worst Case Models
  • Given an exact data set D and two Clustering C1
    and C2, there is a mapping f from (C1, C2) to a
    real value, measuring the difference on the
    quality.
  • This mapping usually depends on the objective
    function of the clustering problem, such as
    k-means

C1
C2
44
Worst Case Models
  • K-Means

f(C1,C2)2
C1
C2
K-means Cost of C1 is 20
K-means Cost of C2 is 18
45
Worst Case Models
  • Clustering Uncertainty
  • Given an uncertain data set P, a clustering C,
    and an exact clustering Algorithm A, the
    uncertainty of C is the maximum f(C,C), where C
    can be the clustering result with A on any exact
    data set D satisfying P.

AK-means
C based on uncertain location of objects
Exact data set D given
C' based on exact locations D
46
Zero Uncertain Model
  • Zero Uncertain Model (ZUM) ZDT06
  • Every object is certain
  • The clustering is uncertain due to incomplete
    computation (eg. to save space in data streaming)
  • Derive some bound on the movement of centers

47
Applications
48
Static Uncertain Model
  • Static Uncertain Model (SUM) ZDT08a
  • There are uncertain objects instead of exact
  • Find out the worst case of f(C,C')
  • By no mean an easy problem since the centers are
    uncertain too

boundaries of potential centers
this could belong to the red center!
49
Applications
50
Dissolvable Uncertain Model
  • Dissolvable Uncertain Model (DUM)
  • The error bound can be large in SUM
  • Dissolution get the exact location by some cost
  • Find out the minimal cost to reduce the error to
    some specified degree

updated centers
reduce error
51
Applications
52
Reverse Uncertain Model
  • Reverse Uncertain Model (RUM) ZYTP08
  • We have exact objects, but wants to derive bounds
    on their movement
  • When every object in this bound, the clustering
    is still stable i.e. the centers don't change
    much

furthest possible distance to black center is
smaller than nearest possible distance to red
center
furthest distance to black center
nearest distance to red center
53
Applications
54
K-Means Algorithm
  • K-Means Algorithm
  • Initialization Randomly choose k centers
  • Re-assignment Assign the points to the closest
    center
  • Center Update Replace the previous centers with
    the new center of the cluster
  • Convergence Stop when there is no assignment
    change

55
K-Means Algorithm
  • Advantage of K-Means Algorithm
  • Simple to understand
  • Popular in real applications
  • Clustering representation for exact k-means
  • Cluster Centers
  • The cost is the sum of squared Euclidean distance
    from the objects to the cluster center
  • The movement of the cluster centers can bound the
    change on cost

56
K-Means in ZUM
  • K-Means Algorithm under ZUM ZDT06
  • Input Centers for the current clustering
  • The problem is actually finding out how far the
    cluster centers may move by the following
    iterations

57
K-Means in ZUM
  • K-Means

C1
If the centers move no more than r, the cost of
any clustering is between -r2N, r2N
58
Solution Space
  • Given a d-dimensional problem space, we define
    the solution space as a kd-dimensional space

c2
M1
c1
c2
c1
59
Solution Space
  • With every iterations, the center set moves in
    the solution space

M3
M2
c2
M1
c1
c2
c1
60
Maximal Region
  • Maximal Region is a region in the solution space
    that bound the local optimum that can be reached
    in future iterations
  • A region is maximal region for clustering center
    set M, if
  • It contains M
  • Any solution on the boundary of the region has
    equal cost of M

61
Maximal Region
The cost of center sets in solutions space
is represented by contour lines, lighter color
meaning smaller cost
c2
M2
Any solution between M1 and M2 must have
smaller cost than M1
M1
c1
62
Maximal Region
c2
M2
M1
Maximal Region of the local optimum, the
local optimum must locate in
c1
63
Maximal Region
c2
M2
M1
every center moves no more than Delta
c1
64
Maximal Region
M1
m1
m2
65
Maximal Region
  • How to verify a maximal region?
  • Step 1 fix the current cluster assignment, and
    move at least one center by Delta. The cost will
    increase
  • Step 2 fix the current cluster center, and
    reassign the points to closest centers. The cost
    will decrease
  • If the increase on first step is larger than the
    decrease in second step, the center sets on the
    boundary are worse than current one

66
Maximal Region
  • How to find such a maximal region
  • If every center is allowed to move by distance
    , the cost of every cluster will increase, if
    every object remains in the cluster?

67
Maximal Region
  • How to find such a maximal region
  • If every center is allowed to move by distance
    , how much every object can improve by switching
    to other cluster?
  • Case 1

68
Maximal Region
  • How to find such a maximal region
  • Case 2

69
Maximal Region
  • How to find such a maximal region
  • Case 3

70
Maximal Region
  • How to find such a maximal region
  • Assume d1 is the distance of pi to the nearest
    center and d2 is the distance of pi to the second
    nearest neighbor
  • When is smaller than (d1-d2)/2, its in Case 1
  • When between (d1-d2)/2 and d2, its in Case 2
  • When is larger than d2, its in Case 3

71
Maximal Region
  • How to find such a maximal region
  • We can split the space of into 2n1 segments
  • In each segment, if there is a real root for some
    quadratic equation, there is at least valid upper
    bound inside

72
Maximal Region
  • Complexity
  • The break point can be sorted in O(nlogn) time
  • Every quadratic equation can be solved in
    constant time
  • The total complexity is O(nlogn)

73
K-Means in SUM
  • K-Means in SUM
  • Intra-Cluster Uncertainty average of radiuses

74
K-Means in SUM
  • K-Means in SUM
  • Inter-Cluster Uncertainty

75
K-Means Algorithm
  • K-Means in SUM
  • Inter-Cluster Uncertainty similar to ZUM

76
K-Means Algorithm
  • K-Means in SUM
  • Inter-Cluster Uncertainty similar to ZUM
  • The complexity is the same as ZUM, O(nlogn) time

77
K-Means Algorithm
  • DUM
  • The cost of retrieving the exact position of an
    object is ci
  • NP-hard
  • Even NP-hard to approximate within any constant
    factor! Only Heuristics to do this!

78
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

79
EM Algorithm in ZUM
  • A new result with EM Algorithm under ZUM model
    ZDT08b
  • Problem
  • Given a group of point and a Gaussian Mixture
    Model
  • How far is the real local optimum from the
    current one?

80
Gaussian Mixture Model
  • Basics of GMM
  • There are k components following Gaussian
    distributions
  • The probability of observing x from component i
  • The probability of observing x

81
Gaussian Mixture Model
  • Find the configuration to maximize the
    probability of the whole data set
  • Equivalent to maximize the log likelihood

82
Expectation-Maximization
  • Initialization
  • Randomly selects the initial configuration
  • E-Step
  • Calculate the cluster membership probability
  • M-Step
  • Update the parameters of component
  • Convergence Condition
  • No change on membership probability

83
Solution Space
  • Solution Space ZDT08b
  • Configuration move by EM Iteration

84
Local Trapping Property
  • Local Trapping
  • There is a path, any configuration on the path is
    a better solution than

85
Local Trapping Property
  • Local Trapping
  • No way to get out of the region, if every
    boundary configuration no better than

86
Maximal Region
  • Maximal Region
  • It covers the current configuration
  • Every configuration on the boundary is no better
    than
  • Local Optimum must be in Maximal Region
  • By local trapping property

87
Maximal Region
  • A special class of region
  • Containing all configurations satisfying

88
Maximal Region
  • Verification of maximal region
  • O(n) time, where n is the number of points
  • Derivation of an upper bound on the log
    likelihood of local optimum
  • Constant time

89
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

90
K-Means on Moving Objects
  • Server monitoring k-means clustering of the
    vehicles in the traffic system
  • Client vehicles reporting recent positions to
    the central Server

91
Basic Solutions
  • Update at every second
  • All clusterings can be captured
  • Heavy Communication Cost on client
  • Heavy Computation Cost on server

92
Basic Solutions
  • Update after L second
  • Less communication and less computation
  • Hard to choose L
  • Some meaningful clustering may be skipped

93
K-Means on Moving Objects
  • To reduce the computation time and the number of
    messages, RUM is used over k-means clustering
    ZYTP08

Update happens here
94
K-Means over Moving Objects
  • Cost Model
  • If the radius of the uncertain sphere is ri, and
    the speed of the object is si, the maximum
    updating frequency is si/ri.
  • Optimization Problem
  • Subject Minimize the sum over si/ri
  • Constraint the clustering is error-bounded

95
K-Means over Moving Objects
  • Intuition on Radius Computation
  • We have the constraint on the clustering
    uncertainty function over the radiuses
    r1,r2,,rn
  • Lagrange Multiplier method is used to find the
    optimal radiuses, minimizing the update
    frequencies.

96
K-Means over Moving Objects
  • Update issue
  • Step 1 Re-compute the cluster centers
  • Step 2 Send probe request to objects if
    necessary
  • Step 3 Compute new spheres for some objects

97
Outline
  • Introduction and Motivation
  • Traditional Clustering Techniques
  • Uncertain Clustering Principles
  • Worst Case Models and K-Means
  • Extension to EM Algorithm
  • Monitoring Moving Objects
  • Future Work and Conclusion

98
Possible Applications
  • Change detection over data stream
  • Insertion-Deletion Model
  • ZUM model?

99
Possible Applications
  • Private Information Publication
  • Publish uncertain object
  • Keep the current distribution
  • RUM model?

100
Possible Applications
  • Clustering over sensor network
  • Energy is a big issue
  • Update only when necessary
  • RUM Model?

101
Possible Extension
  • With other clustering method?

102
On Theory
  • Open Problems
  • Can we find good uncertain clustering models for
    discrete structures, like graph and tree?

103
On Theory
  • Open Problems
  • Can we combine uncertain clustering model with
    kernels and support vector machine?

104
On Theory
  • Open Problems
  • Can we find some uncertain clustering models,
    between expected case analysis and worst case
    analysis?

105
Conclusion
  • Conclusion
  • Uncertain Analysis
  • Expected Case Analysis Convergence Analysis
    Worst Case Analysis
  • Uncertain Models
  • ZUM, SUM, DUM, RUM
  • Uncertain Applications
  • Monitoring moving objects with minimal
    communication cost

106
Download
  • Link
  • www.comp.nus.edu.sg/atung/waim08-tut.ppt

107
References
  • NKC06 W.K. Ngai, B. Kao, C.K. Chui, R. Cheng,
    M. Chau, K.Y. Yip, Efficient Clustering of
    Uncertain Data, ICDM 2006
  • ZDT06 Z. Zhang, B.T. Dai, A.K.H. Tung, On the
    Lower Bound of Local Optimums in K-Means
    Algorithm, ICDM 2006
  • ZDT08a Z. Zhang, B.T. Dai, A.K.H. Tung, Robust
    Clustering over Uncertain Objects, manuscript
  • ZYTP08 Z. Zhang, Y. Yang, A.K.H. Tung, D.
    Papadais, Continuous K-Means Monitoring over
    Moving Objects, TKDE 2008
  • ZDT08b Z. Zhang, B.T. Dai, A.K.H. Tung,
    Estimating Local Optimums in EM Algorithm over
    Gaussian Mixture Model, in ICML 2008

108
References
  • CM08 G. Cormode, A. McGregor. Approximation
    Algorithms for Clustering Uncertain Data, PODS
    2008
  • BLP06 S. Ben-David, U. v. Luxburg, D. Pal, A
    Sober Look at Clustering Stability, COLT 2006
  • BPS07 S. Ben-David, D. Pal, H. Simon,
    Stability of k-Means Clustering , COLT 2007
  • G85 T.F. Gonzalez, Clustering to Minimize the
    Maximum Intercluster Distance, Theoretical
    Computer Science 1985
  • AV07 D. Arthur, S. Vassilvitskii, k-means
    The Advantage of Careful Seeding, SODA 2007
  • KMN02 T. Kanungo, D. Mount, N. Netanyahu, C.
    Piatko, R. Silverman, A. Wu, An efficient
    k-means clustering algorithm analysis and
    implementation, TPAMI 2002

109
References
  • KSS04 Amit Kumar, Yogish Sabharwal, Sandeep
    Sen, A Simple Linear Time (1e)-Approximation
    Algorithm for k-Means Clustering in Any
    Dimensions, FOCS 2004
  • ARR98 S. Arora, P. Raghavan, S. Rao,
    Approximation schemes for Euclidean k-medians
    and related problems, STOC 1998
  • AGK01 V. Arya, N. Garg, R. Khandekar, K.
    Munagala, V. Pandit, Local search heuristic for
    k-median and facility location problems, STOC
    2001

110
Question Answer
Write a Comment
User Comments (0)
About PowerShow.com