Efficient Clustering of Uncertain Data

1 / 30
About This Presentation
Title:

Efficient Clustering of Uncertain Data

Description:

Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip ... max-dist), which results at least 10-folds increase in the clustering efficiency. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 31
Provided by: non8106

less

Transcript and Presenter's Notes

Title: Efficient Clustering of Uncertain Data


1
Efficient Clustering of Uncertain Data
  • Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold
    Cheng, Michael Chau, Kevin Y. Yip

Speaker Wang Kay Ngai
2
Data clustering
  • Data clustering is used to discover any cluster
    patterns in a data set.
  • One kind of clustering is to partition the data
    set into groups, called clusters, such that data
    within the same cluster are closer to each other
    (based on some distance functions such as an
    Euclidean distance) than data from any other
    clusters.

3
  • K-means is a common method that tries to achieve
    the above clustering by ensuring each data closer
    to a representative of its cluster than those of
    any other clusters.
  • The representative of a cluster is the mean value
    of all the data in that cluster.

Data
Representative
Cluster
Cluster
Cluster
4
Uncertain data
  • How about clustering of uncertain data?
  • Sometimes a data value is uncertain but is within
    a certain range represented by a probability
    density function (pdf). We call it an uncertain
    data.
  • An example of uncertain data is the 2D location
    reported from a mobile device in a tracking
    system.
  • The device may already move to a location other
    than the reported one when the data is received.

5
  • The exact location of the device is uncertain but
    is within a circular region determined by the
    maximum velocity v of the device and the time t
    elapsed since the location data is sent.

Uniform pdf
f(x,y)
r vt
y
Reported Location
x
6
  • The region is called uncertainty region.
  • It can have an arbitrary shape and its pdf can
    also be arbitrary.

f(x,y)
f(x,y)
y
y
Lake
x
x
7
Clustering of uncertain data
  • For the example of mobile devices, in some
    applications each device may need to communicate
    with a server.
  • The communication cost may depend on the
    communication distance and could be saved by
    clustering the devices.

8
Device
Leader
Server
Cluster
Cluster
  • Most devices communications are in short
    distance with the leaders in their clusters.
  • Only the communications between the leaders and
    the servers are in long distance.
  • So the clustering could save the total
    communication cost of the devices.

9
UK-means
  • K-means can be used to cluster uncertain data by
    using a new distance function, called expected
    distance, between the data and the cluster
    representative.
  • We call this specialized K-means method UK-means.

10
  • An expected distance between the data o and the
    cluster representative c is defined as follows
  • For a cluster C of uncertain data, its
    representative value is assigned the mean value
    of the centers of mass of all the data in that
    cluster as follows

11
Major overhead in UK-means
  • A clustering process of UK-means composes of
    iterations.
  • In each iteration, for each data o, UK-means
    assigns o to the cluster, among the others, whose
    representative has the smallest expected distance
    to o. We refer this as a cluster assignment.
  • In a brute-force approach for the cluster
    assignment of each data o, UK-means simply
    computes the expected distance between o and
    every cluster representative in order to find the
    smallest expected distance.

12
  • In some applications where the uncertainty
    regions or pdfs are arbitrary, computing an
    expected distance requires an expensive numerical
    integration.
  • Since the brute-force approach incurs a lot of
    expected distance computations, they become the
    major overhead in the whole clustering process.

13
  • Suppose a lower bound of the expected distance
    ED(o,c1) between a data o and a cluster
    representative c1 is larger than an upper bound
    of ED(o,c2) for another cluster representative
    c2, then, without computing ED(o,c1), we know
    that o cannot be assigned to c1 (or c1 is pruned).

Upper bound of ED(o,c2)
c2
c1
Lower bound of ED(o,c1)
Data o
14
  • If this condition is not true for c1, ED(o,c1),
    ED(o,c2) or both may need to be computed in the
    cluster assignment of o.
  • For each data, most of the cluster
    representatives will likely satisfy this
    condition since in general each data is much
    closer to a few of all the cluster
    representatives than the others.
  • So expected distance computations can be reduced
    in the whole clustering process using the upper
    and lower bounds.
  • The amount of reduction depends on the tightness
    of the upper and lower bounds.

15
Min-max-dist
  • The basic approach for computing upper and lower
    bounds of ED(o,c), is to compute the maximum
    distance (MaxDist) and minimum distance (MinDist)
    , respectively, between c and any points in the
    Minimum Bounding Box (MBR) of os uncertainty
    region.
  • The approach is called min-max-dist.
  • These bounds may bound ED(o,c) very loosely.

MBR
MaxDist
c
MinDist
Data o
16
Upre and Lpre
  • Using the pdf (and hence uncertainty region)
    instead of MBR of the data o in computing the
    bounds of ED(o,c) may give tighter bounds.
  • So we propose two approaches Upre and Lpre that
    use the pdf for computing the upper and lower
    bounds, respectively.
  • At first, some anchor points y are placed nearby
    the data o, and the expected distance ED(o,y) is
    pre-computed for each anchor point y.

MBR
17
  • Then, by the Triangle Inequalities, the following
    inequality shows an upper bound of ED(o,c) for
    any cluster representative c

y
p
c
18
  • Then in Upre approach the minimum of such bounds
    among all anchor points y is used as the upper
    bound of ED(o,c)
  • Similarly, by the Triangle Inequalities, the
    lower bound of ED(o,c) in Lpre approach is
    derived as follows
  • Upre and Lpre approaches try to reduce expected
    distance computations in the cluster assignment
    while they incur few extra expected distance
    computations before the clustering process starts.

19
Ucs and Lcs
  • As mentioned before, if a cluster representative
    c1 cannot be pruned when the lower bound of
    ED(o,c1) is compared to the upper bound of
    ED(o,c2) for another cluster representative c2,
    some expected distances may be computed.
  • Suppose the min-max-dist approach is used for
    computing the upper and lower bounds.
  • And suppose, after any pruning in the cluster
    assignment of o in an iteration of the clustering
    process, ED(o,c) still needs to be computed for
    some cluster representative c.

20
  • Then in any later iteration, we propose two
    approaches Ucs and Lcs that use the computed
    expected distance ED(o,c) for computing the upper
    and lower bounds of ED(o,c), respectively, where
    c represents c in that iteration.
  • Note that the values of c and c could be
    different because the value of a cluster
    representative is updated in each iteration.
  • So, the approaches Ucs and Lcs save the
    pre-computations of expected distances required
    in the approaches Upre and Lpre.
  • But they can only be used after a specific
    expected distance is computed.

21
  • By the Triangle Inequalities, in Ucs approach the
    upper bound of ED(o,c) is computed as follows
  • Similarly, in Lcs approach the lower bound of
    ED(o,c) is computed as follows
  • The values of these bounds become closer to
    ED(o,c) and hence become tighter bounds as
    D(c,c) becomes smaller.
  • D(c,c) will become smaller and eventually zero
    in the later iterations of a clustering process.

22
Experimental results
  • We want to see how many expected distance
    computations in the brute-force approach of the
    cluster assignment are saved when the approaches
    of upper and lower bounds are used.
  • And see how the saving affects the efficiency of
    the whole clustering process.
  • So, we conduct experiments on random 2D uncertain
    data with or without cluster patterns.
  • Each data has a random uncertainty region of a
    random pdf.

23
  • The pdf is approximated by s sample points for
    computing expected distances.
  • The larger s is, the more accurate the computed
    expected distance is (with a higher computation
    cost).
  • We vary the parameters s, maximum size of
    uncertainty region (d), number of objects (n) and
    number of clusters (K).
  • The cluster representatives are initialized to be
    distributed uniformly among the data or to be the
    centers of mass of the uncertainty regions of
    randomly selected data.

24
  • In the experiments the approaches Upre, Lpre, Ucs
    and Lcs also use the bounds computed in the
    min-max-dist approach for the pruning.
  • For an example of the Upre approach, the minimum
    (tightest) of its upper bound and the one
    computed in the min-max-dist approach is used to
    get the actual upper bound for the pruning in a
    cluster assignment.

25
  • K ( 49) expected distances are computed per
    object per iteration for the brute-force
    approach.

26
  • K ( 49) expected distances are computed per
    object per iteration for the brute-force
    approach.

27
  • 100 of K expected distances are computed per
    object per iteration for the brute-force
    approach.

28
  • The run-time of the clustering process using the
    proposed approaches is at least 10 times shorter
    than that using the brute-force approach.

29
(No Transcript)
30
Conclusions
  • Approaches are proposed for reducing expected
    distance computations, the major overhead, in the
    brute-force approach of UK-means.
  • In most experiments, using both Ucs and Lcs
    reduce the overheads at least 200 times (10 times
    more than that of min-max-dist), which results at
    least 10-folds increase in the clustering
    efficiency.
  • If the expected distance pre-computations in Upre
    and Lpre can be discounted in an application,
    using all the proposed approaches incurs the most
    overhead reduction, which should yield the most
    increase in the clustering efficiency.
Write a Comment
User Comments (0)