Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip ... max-dist), which results at least 10-folds increase in the clustering efficiency. ... – PowerPoint PPT presentation
Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip
Speaker Wang Kay Ngai 2 Data clustering
Data clustering is used to discover any cluster patterns in a data set.
One kind of clustering is to partition the data set into groups, called clusters, such that data within the same cluster are closer to each other (based on some distance functions such as an Euclidean distance) than data from any other clusters.
3
K-means is a common method that tries to achieve the above clustering by ensuring each data closer to a representative of its cluster than those of any other clusters.
The representative of a cluster is the mean value of all the data in that cluster.
Data Representative Cluster Cluster Cluster 4 Uncertain data
How about clustering of uncertain data?
Sometimes a data value is uncertain but is within a certain range represented by a probability density function (pdf). We call it an uncertain data.
An example of uncertain data is the 2D location reported from a mobile device in a tracking system.
The device may already move to a location other than the reported one when the data is received.
5
The exact location of the device is uncertain but is within a circular region determined by the maximum velocity v of the device and the time t elapsed since the location data is sent.
Uniform pdf f(x,y) r vt y Reported Location x 6
The region is called uncertainty region.
It can have an arbitrary shape and its pdf can also be arbitrary.
f(x,y) f(x,y) y y Lake x x 7 Clustering of uncertain data
For the example of mobile devices, in some applications each device may need to communicate with a server.
The communication cost may depend on the communication distance and could be saved by clustering the devices.
8 Device Leader Server Cluster Cluster
Most devices communications are in short distance with the leaders in their clusters.
Only the communications between the leaders and the servers are in long distance.
So the clustering could save the total communication cost of the devices.
9 UK-means
K-means can be used to cluster uncertain data by using a new distance function, called expected distance, between the data and the cluster representative.
We call this specialized K-means method UK-means.
10
An expected distance between the data o and the cluster representative c is defined as follows
For a cluster C of uncertain data, its representative value is assigned the mean value of the centers of mass of all the data in that cluster as follows
11 Major overhead in UK-means
A clustering process of UK-means composes of iterations.
In each iteration, for each data o, UK-means assigns o to the cluster, among the others, whose representative has the smallest expected distance to o. We refer this as a cluster assignment.
In a brute-force approach for the cluster assignment of each data o, UK-means simply computes the expected distance between o and every cluster representative in order to find the smallest expected distance.
12
In some applications where the uncertainty regions or pdfs are arbitrary, computing an expected distance requires an expensive numerical integration.
Since the brute-force approach incurs a lot of expected distance computations, they become the major overhead in the whole clustering process.
13
Suppose a lower bound of the expected distance ED(o,c1) between a data o and a cluster representative c1 is larger than an upper bound of ED(o,c2) for another cluster representative c2, then, without computing ED(o,c1), we know that o cannot be assigned to c1 (or c1 is pruned).
Upper bound of ED(o,c2) c2 c1 Lower bound of ED(o,c1) Data o 14
If this condition is not true for c1, ED(o,c1), ED(o,c2) or both may need to be computed in the cluster assignment of o.
For each data, most of the cluster representatives will likely satisfy this condition since in general each data is much closer to a few of all the cluster representatives than the others.
So expected distance computations can be reduced in the whole clustering process using the upper and lower bounds.
The amount of reduction depends on the tightness of the upper and lower bounds.
15 Min-max-dist
The basic approach for computing upper and lower bounds of ED(o,c), is to compute the maximum distance (MaxDist) and minimum distance (MinDist) , respectively, between c and any points in the Minimum Bounding Box (MBR) of os uncertainty region.
The approach is called min-max-dist.
These bounds may bound ED(o,c) very loosely.
MBR MaxDist c MinDist Data o 16 Upre and Lpre
Using the pdf (and hence uncertainty region) instead of MBR of the data o in computing the bounds of ED(o,c) may give tighter bounds.
So we propose two approaches Upre and Lpre that use the pdf for computing the upper and lower bounds, respectively.
At first, some anchor points y are placed nearby the data o, and the expected distance ED(o,y) is pre-computed for each anchor point y.
MBR 17
Then, by the Triangle Inequalities, the following inequality shows an upper bound of ED(o,c) for any cluster representative c
y p c 18
Then in Upre approach the minimum of such bounds among all anchor points y is used as the upper bound of ED(o,c)
Similarly, by the Triangle Inequalities, the lower bound of ED(o,c) in Lpre approach is derived as follows
Upre and Lpre approaches try to reduce expected distance computations in the cluster assignment while they incur few extra expected distance computations before the clustering process starts.
19 Ucs and Lcs
As mentioned before, if a cluster representative c1 cannot be pruned when the lower bound of ED(o,c1) is compared to the upper bound of ED(o,c2) for another cluster representative c2, some expected distances may be computed.
Suppose the min-max-dist approach is used for computing the upper and lower bounds.
And suppose, after any pruning in the cluster assignment of o in an iteration of the clustering process, ED(o,c) still needs to be computed for some cluster representative c.
20
Then in any later iteration, we propose two approaches Ucs and Lcs that use the computed expected distance ED(o,c) for computing the upper and lower bounds of ED(o,c), respectively, where c represents c in that iteration.
Note that the values of c and c could be different because the value of a cluster representative is updated in each iteration.
So, the approaches Ucs and Lcs save the pre-computations of expected distances required in the approaches Upre and Lpre.
But they can only be used after a specific expected distance is computed.
21
By the Triangle Inequalities, in Ucs approach the upper bound of ED(o,c) is computed as follows
Similarly, in Lcs approach the lower bound of ED(o,c) is computed as follows
The values of these bounds become closer to ED(o,c) and hence become tighter bounds as D(c,c) becomes smaller.
D(c,c) will become smaller and eventually zero in the later iterations of a clustering process.
22 Experimental results
We want to see how many expected distance computations in the brute-force approach of the cluster assignment are saved when the approaches of upper and lower bounds are used.
And see how the saving affects the efficiency of the whole clustering process.
So, we conduct experiments on random 2D uncertain data with or without cluster patterns.
Each data has a random uncertainty region of a random pdf.
23
The pdf is approximated by s sample points for computing expected distances.
The larger s is, the more accurate the computed expected distance is (with a higher computation cost).
We vary the parameters s, maximum size of uncertainty region (d), number of objects (n) and number of clusters (K).
The cluster representatives are initialized to be distributed uniformly among the data or to be the centers of mass of the uncertainty regions of randomly selected data.
24
In the experiments the approaches Upre, Lpre, Ucs and Lcs also use the bounds computed in the min-max-dist approach for the pruning.
For an example of the Upre approach, the minimum (tightest) of its upper bound and the one computed in the min-max-dist approach is used to get the actual upper bound for the pruning in a cluster assignment.
25
K ( 49) expected distances are computed per object per iteration for the brute-force approach.
26
K ( 49) expected distances are computed per object per iteration for the brute-force approach.
27
100 of K expected distances are computed per object per iteration for the brute-force approach.
28
The run-time of the clustering process using the proposed approaches is at least 10 times shorter than that using the brute-force approach.
29 (No Transcript) 30 Conclusions
Approaches are proposed for reducing expected distance computations, the major overhead, in the brute-force approach of UK-means.
In most experiments, using both Ucs and Lcs reduce the overheads at least 200 times (10 times more than that of min-max-dist), which results at least 10-folds increase in the clustering efficiency.
If the expected distance pre-computations in Upre and Lpre can be discounted in an application, using all the proposed approaches incurs the most overhead reduction, which should yield the most increase in the clustering efficiency.