Clustering with kmeans: faster, smarter, cheaper - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering with kmeans: faster, smarter, cheaper

Description:

With success at (2), comparisons for different k are less likely to be misleading. ... After each 'jump', run k-means to convergence starting with an 'allocate' step. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 39
Provided by: charle87
Category:

less

Transcript and Presenter's Notes

Title: Clustering with kmeans: faster, smarter, cheaper


1
Clustering with k-means faster, smarter,
cheaper
  • Charles Elkan
  • University of California, San Diego
  • April 24, 2004

2
Acknowledgments
  • Funding from Sun Microsystems, with sponsor Dr.
    Kenny Gross.
  • Advice from colleagues and students, especially
  • Sanjoy Dasgupta (UCSD),
  • Greg Hamerly (Baylor University starting Fall
    04),
  • Doug Turnbull.

3
Clustering is difficult!
Source Patrick de Smet, University of Ghent
4
The standard k-means algorithm
  • Input n points, distance function d(),
    number k of clusters to find. 
  • STEP NAME
  • Start with k centers
  • Compute d(each point x, each center c)
  • For each x, find closest center c(x)
    ALLOCATE
  • If no point has changed owner c(x), stop
  • Each c ? mean of points owned by it
    LOCATE
  • Repeat from 2

5
A typical k-means result
6
Observations
  • Theorem If d() is Euclidean, then k-means
    converges monotonically to a local minimum of
    within-class squared distortion
    ?x d(c(x),x)2
  • Many variants, complex history since 1956, over
    100 papers per year currently
  • Iterative, related to expectation-maximization
    (EM)
  • of iterations to converge grows slowly with n,
    k, d
  • No accepted method exists to discover k.

7
We want to
  • make the algorithm faster.
  • find lower-cost local minima.  
  • (Finding the global optimum is NP-hard.)
  • choose the correct k intelligently.
  • With success at (1), we can try more alternatives
    for (2). 
  • With success at (2), comparisons for different k
    are less likely to be misleading.

8
Is this clustering better?
9
Or is this better?
10
Standard initialization methods
  • Forgy initialization choose k points at random
    as starting center locations.
  • Random partitions divide the data points
    randomly into k subsets.
  • Both these methods are bad.
  • E. Forgy. Cluster analysis of multivariate data
    Efficiency vs. interpretability of
    classifications. Biometrics, 21(3)768, 1965.

11
Forgy initialization
12
k-means result
13
Smarter initialization
  • The furthest first" algorithm (FF)
  • Pick first center randomly.
  • Next is the point furthest from the first center.
  • Third is the point furthest from both previous
    centers.
  • In general next center is argmaxx minc d(x,c)
  • D. Hochbaum, D. Shmoys. A best possible heuristic
    for the k-center problem, Mathematics of
    Operations Research, 10(2)180-184, 1985.

14
Furthest-first initialization
FF
Furthest-first initialization
15
Subset furthest-first (SFF)
  • FF finds outliers, by definition not good cluster
    centers!
  • Can we choose points far apart and typical of the
    dataset?
  • Idea  A random sample includes many
    representative points, but few outliers.
  • But How big should the random sample be?
  • Lemma  Given k equal-size sets and c gt1, with
    high probability ck log k random points
    intersect each set.

16
Subset furthest-first c 2
17
Comparing initialization methods
218 means 218 worse than the best clustering
known. Lower is better.
18
How to find lower-cost local minima
  • Random restarts, even initialized well, are
    inadequate.
  • The central limit catastrophe almost all local
    minima are only averagely good.
  • K. D. Boese, A. B. Kahng, S. Muddu, A new
    adaptive multi-start technique for combinatorial
    global optimizations. Operations Research Letters
    16 (1994) 101-113.
  • The art of designing a local search algorithm
    defining a neighborhood rich in improving
    candidate moves.

19
Our local search method
  • k-means alternates two guaranteed-improvement
    steps allocate and locate. 
  • Sadly, we know no other guaranteed-improvement
    steps.
  • So, we do non-guaranteed jump operations
    delete an existing center and create a new center
    at a data point. 
  • After each jump, run k-means to convergence
    starting with an allocate step.

20
Add a center below
Remove a center at left
21
Theory versus practice
  • Theorem  Let C be a set of centers such that no
    jump operation improves the value of C.  Then C
    is at most 25 times worse than the global
    optimum.
  • T. Kanungo et al. A local search approximation
    algorithm for clustering, ACM Symposium on
    Computational Geometry, 2002.
  • Our aim Find heuristics to identify jump steps
    that are likely to be good.
  • Experiments indicate we can solve problems with
    up to 2000 points and 20 centers optimally.

22
An upper bound ...
  • Lemma 1  The maximum loss from removing center
    c.
  • Proof
  • Suppose b is the center closest to c let B and C
    be the subsets owned by b and c, with m B and
    n C.
  • If B and C merge, the new center is b
    (mbnc)/(mn).
  • Because c is the mean of C, for any z ?x in
    C d(z,x)2 ?x in C d(c,x)2 nd(z,c)2.
  • So the loss from the merge is nd(b,c)2
    md(b,b)2.
  • This computation is cheap, so we do it for every
    center. 

23
and a lower bound
  • Suppose we add a new center at point z.
  • Lemma 2 The gain from adding a center at z is at
    least
  • ? x d(x,c(x)) gt d(x,z) d(x,c(x))2 -
    d(x,z)2.
  • This computation is more expensive, so we do it
    for only 2k log k random candidates z.

24
Sometimes a jump should only be a jiggle
  • How to use Lemmas 1 and 2
  • delete the center with smallest maximum loss,
  • make new center at point with greatest minimum
    gain.
  • This procedure identifies good global
    improvements.
  • Small-scale improvements come from jiggling the
    center of an existing cluster moving the center
    to a point inside the same cluster.

25
jj-means the smarter k-means algorithm
  • Run k-means with SFF initialization.
  • Repeat
  • While improvement do
  • Try the best jump according to Lemmas 1 and 2
  • Until improvement do
  • Try a random jiggle
  • Try means run k-means to convergence after.
  • Insert random jumps to satisfy theorem.

26
Results with 1000 points, 8 dimensions, 10 centers
Conclusion Running 10x longer is faster and
better than restarting 10x.
27
Goal Make k-means faster, but with same answer
  • Allow any black-box d(),
  • any initialization method.
  • In later iterations, little movement of
    centers.
  • Distance calculations use the most time.
  • Geometrically, these are mostly redundant.

Source D. Pelleg.
28
  • Let x be a point, c(x) its owner, and c a
    different center.
  • If we already know d(x,c) ? d(x,c(x)) 
  • then computing d(x,c) precisely is not
    necessary.
  • Strategy Use the triangle inequality d(x,z)  ?
    d(x,y)  d(y,z)to get sufficient conditions for
    d(x,c) ? d(x,b).
  • kd-trees are useful up to ? 10 dimensions.
  • Distance-based data structures can be better.
  • Our approach is adaptive.

29
  • Lemma 1  Let x be a point, and let b and c be
    centers. 
  • If d(b,c) ? 2d(x,b) then d(x,c) ? d(x,b).
  • Proof  We know d(b,c) ? d(b,x)  d(x,c). So 
    d(b,c) - d(x,b)  ? d(x,c). Now d(b,c) -
    d(x,b) ? 2d(x,b) - d(x,b) d(x,b).So d(x,b) 
    ? d(x,c).

b
c
x
30
  • Lemma 2  Let x be a point, let b and c be
    centers. 
  • Then  d(x,c)  ? max 0, d(x,b) - d(b,c) .
  • Proof  We know d(x,b)  ? d(x,c)  d(b,c),
  • So  d(x,c)  ? d(x,b) - d(b,c).
  • Also d(x,c)  ? 0.

c
b
x
31
How to use Lemma 1
  • Let c(x) be the owner of point x, c' another
    center
  • compute d(x,c') only if
  • d(x,c(x))  gt ½ d(c(x),c').
  • If we know an upper bound u(x) ? d(x,c(x))
  • compute d(x,c') and d(x,c(x)) only if
  • u(x)  gt ½ d(c(x),c').
  • If u(x)  ? ½ min c' ? c(x) d(c(x), c') 
  • eliminate all distance calculations for x.

32
How to use Lemma 2
  • Let x be any point, let c be any center,
  • let c be c at previous iteration.
  • Assume previous lower bound d(x,c) ? l'. 
  • Then we get a new lower bound for the current
    iteration
  • d(x,c)  ? max 0, d(x,c) - d(c, c)
  • ? max 0, l' - d(c,c)
  • If l' is a good approximation, and the center
    only moves slightly, then we get a good updated
    approximation.

33
  • Pick initial centers c.
  • For all x and c, compute d(x,c)
  • Initialize lower bounds l(x,c)  ? d(x,c)
  • Initialize upper bounds u(x)  ? minc d(x,c)
  • Initialize ownership c(x)  ? argminc d(x,c)
  • Repeat until convergence
  • Find all x s.t. u(x)  ? ½ minc' ? c(x)
    d(c(x), c') 
  • For each remaining x and c ? c(x) s.t.
  • u(x) gt  l(x,c)
  • u(x) gt  ½ d(c(x), c)
  • Compute d(x,c) and d(x,c(x))
  • If d(x,c) lt d(x,c(x)) then change owner c(x)  ? c
  • Update l(x,c)  ? d(x,c) and u(x)  ? d(x,c(x))
  • For each c, m(c) ? mean of points owned by c
  • For each x and c, update l(x,c)  ? max 0,
    l(x,c) - d(m(c),c)
  • For each x, update u(x) ? u(x)  d(c(x), m(c(x))
    )
  • Update each center c ? m(c)

34
Notes on the new algorithm
  • Empirical issue which checks to do in which
    order.
  • Implement for each remaining x and c by looping
    over c, with vectorized code processing all x
    together.
  • Or, sequentially scan x and l(x,c) from disk.
  • Obvious initialization computes O(nk) distances.
    Faster methods give inaccurate l(x,c) and
    u(x), hence may do more distance
    calculations later.

35
(No Transcript)
36
Experimental observations
  • Natural clusters are found while computing the
    distance between each point and each center less
    than once!
  • We find k 100 clusters in n 100,000 covtype
    points with 7,353,400 lt nk 15,000,000 distance
    calculations.
  • Number of distance calculations is o(kc), because
    later iterations compute very few distances.

37
Current limitations
  • Computing distances is no longer the dominant
    cost.
  • Reason After each iteration, we
  • update nk lower bounds l(x,c)
  • use O(kd) time to recompute k means
  • use O(k2d) time to recompute all inter-center
    distances
  • Moreover, we can approximate distances in
    o(d) time, by considering the largest dimensions
    first.

38
Deeper questions
  • What is the minimum of distance calculations
    needed?
  • Adversary argument? If some calculations are
    omitted, an opponent can choose their values to
    make any clustering algorithms output incorrect.
  • Can we extend to clustering with general Bregman
    divergences?
  • Can we extend to soft-assignment clustering? Via
    lower and upper bounds on weights?
Write a Comment
User Comments (0)
About PowerShow.com