Efficient Clustering of Uncertain Data

1 / 30

About This Presentation

Title:

Efficient Clustering of Uncertain Data

Description:

Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip ... max-dist), which results at least 10-folds increase in the clustering efficiency. ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 31

Provided by: non8106

more less

Transcript and Presenter's Notes

Title: Efficient Clustering of Uncertain Data

1
Efficient Clustering of Uncertain Data

Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold
Cheng, Michael Chau, Kevin Y. Yip

Speaker Wang Kay Ngai
2
Data clustering

Data clustering is used to discover any cluster
patterns in a data set.
One kind of clustering is to partition the data
set into groups, called clusters, such that data
within the same cluster are closer to each other
(based on some distance functions such as an
Euclidean distance) than data from any other
clusters.

K-means is a common method that tries to achieve
the above clustering by ensuring each data closer
to a representative of its cluster than those of
any other clusters.
The representative of a cluster is the mean value
of all the data in that cluster.

Data
Representative
Cluster
Cluster
Cluster
4
Uncertain data

How about clustering of uncertain data?
Sometimes a data value is uncertain but is within
a certain range represented by a probability
density function (pdf). We call it an uncertain
data.
An example of uncertain data is the 2D location
reported from a mobile device in a tracking
system.
The device may already move to a location other
than the reported one when the data is received.

The exact location of the device is uncertain but
is within a circular region determined by the
maximum velocity v of the device and the time t
elapsed since the location data is sent.

Uniform pdf
f(x,y)
r vt
y
Reported Location
x
6

The region is called uncertainty region.
It can have an arbitrary shape and its pdf can
also be arbitrary.

f(x,y)
f(x,y)
y
y
Lake
x
x
7
Clustering of uncertain data

For the example of mobile devices, in some
applications each device may need to communicate
with a server.
The communication cost may depend on the
communication distance and could be saved by
clustering the devices.

8
Device
Leader
Server
Cluster
Cluster

Most devices communications are in short
distance with the leaders in their clusters.
Only the communications between the leaders and
the servers are in long distance.
So the clustering could save the total
communication cost of the devices.

9
UK-means

K-means can be used to cluster uncertain data by
using a new distance function, called expected
distance, between the data and the cluster
representative.
We call this specialized K-means method UK-means.

An expected distance between the data o and the
cluster representative c is defined as follows

For a cluster C of uncertain data, its
representative value is assigned the mean value
of the centers of mass of all the data in that
cluster as follows

11
Major overhead in UK-means

A clustering process of UK-means composes of
iterations.
In each iteration, for each data o, UK-means
assigns o to the cluster, among the others, whose
representative has the smallest expected distance
to o. We refer this as a cluster assignment.
In a brute-force approach for the cluster
assignment of each data o, UK-means simply
computes the expected distance between o and
every cluster representative in order to find the
smallest expected distance.

In some applications where the uncertainty
regions or pdfs are arbitrary, computing an
expected distance requires an expensive numerical
integration.
Since the brute-force approach incurs a lot of
expected distance computations, they become the
major overhead in the whole clustering process.

Suppose a lower bound of the expected distance
ED(o,c1) between a data o and a cluster
representative c1 is larger than an upper bound
of ED(o,c2) for another cluster representative
c2, then, without computing ED(o,c1), we know
that o cannot be assigned to c1 (or c1 is pruned).

Upper bound of ED(o,c2)
c2
c1
Lower bound of ED(o,c1)
Data o
14

If this condition is not true for c1, ED(o,c1),
ED(o,c2) or both may need to be computed in the
cluster assignment of o.
For each data, most of the cluster
representatives will likely satisfy this
condition since in general each data is much
closer to a few of all the cluster
representatives than the others.
So expected distance computations can be reduced
in the whole clustering process using the upper
and lower bounds.
The amount of reduction depends on the tightness
of the upper and lower bounds.

15
Min-max-dist

The basic approach for computing upper and lower
bounds of ED(o,c), is to compute the maximum
distance (MaxDist) and minimum distance (MinDist)
, respectively, between c and any points in the
Minimum Bounding Box (MBR) of os uncertainty
region.
The approach is called min-max-dist.
These bounds may bound ED(o,c) very loosely.

MBR
MaxDist
c
MinDist
Data o
16
Upre and Lpre

Using the pdf (and hence uncertainty region)
instead of MBR of the data o in computing the
bounds of ED(o,c) may give tighter bounds.
So we propose two approaches Upre and Lpre that
use the pdf for computing the upper and lower
bounds, respectively.
At first, some anchor points y are placed nearby
the data o, and the expected distance ED(o,y) is
pre-computed for each anchor point y.

MBR
17

Then, by the Triangle Inequalities, the following
inequality shows an upper bound of ED(o,c) for
any cluster representative c

y
p
c
18

Then in Upre approach the minimum of such bounds
among all anchor points y is used as the upper
bound of ED(o,c)

Similarly, by the Triangle Inequalities, the
lower bound of ED(o,c) in Lpre approach is
derived as follows

Upre and Lpre approaches try to reduce expected
distance computations in the cluster assignment
while they incur few extra expected distance
computations before the clustering process starts.

19
Ucs and Lcs

As mentioned before, if a cluster representative
c1 cannot be pruned when the lower bound of
ED(o,c1) is compared to the upper bound of
ED(o,c2) for another cluster representative c2,
some expected distances may be computed.
Suppose the min-max-dist approach is used for
computing the upper and lower bounds.
And suppose, after any pruning in the cluster
assignment of o in an iteration of the clustering
process, ED(o,c) still needs to be computed for
some cluster representative c.

Then in any later iteration, we propose two
approaches Ucs and Lcs that use the computed
expected distance ED(o,c) for computing the upper
and lower bounds of ED(o,c), respectively, where
c represents c in that iteration.
Note that the values of c and c could be
different because the value of a cluster
representative is updated in each iteration.
So, the approaches Ucs and Lcs save the
pre-computations of expected distances required
in the approaches Upre and Lpre.
But they can only be used after a specific
expected distance is computed.

By the Triangle Inequalities, in Ucs approach the
upper bound of ED(o,c) is computed as follows

Similarly, in Lcs approach the lower bound of
ED(o,c) is computed as follows

The values of these bounds become closer to
ED(o,c) and hence become tighter bounds as
D(c,c) becomes smaller.
D(c,c) will become smaller and eventually zero
in the later iterations of a clustering process.

22
Experimental results

We want to see how many expected distance
computations in the brute-force approach of the
cluster assignment are saved when the approaches
of upper and lower bounds are used.
And see how the saving affects the efficiency of
the whole clustering process.
So, we conduct experiments on random 2D uncertain
data with or without cluster patterns.
Each data has a random uncertainty region of a
random pdf.

The pdf is approximated by s sample points for
computing expected distances.
The larger s is, the more accurate the computed
expected distance is (with a higher computation
cost).
We vary the parameters s, maximum size of
uncertainty region (d), number of objects (n) and
number of clusters (K).
The cluster representatives are initialized to be
distributed uniformly among the data or to be the
centers of mass of the uncertainty regions of
randomly selected data.

In the experiments the approaches Upre, Lpre, Ucs
and Lcs also use the bounds computed in the
min-max-dist approach for the pruning.
For an example of the Upre approach, the minimum
(tightest) of its upper bound and the one
computed in the min-max-dist approach is used to
get the actual upper bound for the pruning in a
cluster assignment.

K ( 49) expected distances are computed per
object per iteration for the brute-force
approach.

K ( 49) expected distances are computed per
object per iteration for the brute-force
approach.

100 of K expected distances are computed per
object per iteration for the brute-force
approach.

The run-time of the clustering process using the
proposed approaches is at least 10 times shorter
than that using the brute-force approach.

29
(No Transcript)
30
Conclusions

Approaches are proposed for reducing expected
distance computations, the major overhead, in the
brute-force approach of UK-means.
In most experiments, using both Ucs and Lcs
reduce the overheads at least 200 times (10 times
more than that of min-max-dist), which results at
least 10-folds increase in the clustering
efficiency.
If the expected distance pre-computations in Upre
and Lpre can be discounted in an application,
using all the proposed approaches incurs the most
overhead reduction, which should yield the most
increase in the clustering efficiency.