Clustering with kmeans: faster, smarter, cheaper - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering with kmeans: faster, smarter, cheaper

Description:

With success at (2), comparisons for different k are less likely to be misleading. ... After each 'jump', run k-means to convergence starting with an 'allocate' step. ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 39

Provided by: charle87

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering with kmeans: faster, smarter, cheaper

1
Clustering with k-means faster, smarter,
cheaper

Charles Elkan
University of California, San Diego
April 24, 2004

2
Acknowledgments

Funding from Sun Microsystems, with sponsor Dr.
Kenny Gross.
Advice from colleagues and students, especially
Sanjoy Dasgupta (UCSD),
Greg Hamerly (Baylor University starting Fall
04),
Doug Turnbull.

3
Clustering is difficult!
Source Patrick de Smet, University of Ghent
4
The standard k-means algorithm

Input n points, distance function d(),
number k of clusters to find.
STEP NAME
Start with k centers
Compute d(each point x, each center c)
For each x, find closest center c(x)
ALLOCATE
If no point has changed owner c(x), stop
Each c ? mean of points owned by it
LOCATE
Repeat from 2

5
A typical k-means result
6
Observations

Theorem If d() is Euclidean, then k-means
converges monotonically to a local minimum of
within-class squared distortion
?x d(c(x),x)2
Many variants, complex history since 1956, over
100 papers per year currently
Iterative, related to expectation-maximization
(EM)
of iterations to converge grows slowly with n,
k, d
No accepted method exists to discover k.

7
We want to

make the algorithm faster.
find lower-cost local minima.
(Finding the global optimum is NP-hard.)
choose the correct k intelligently.
With success at (1), we can try more alternatives
for (2).
With success at (2), comparisons for different k
are less likely to be misleading.

8
Is this clustering better?
9
Or is this better?
10
Standard initialization methods

Forgy initialization choose k points at random
as starting center locations.
Random partitions divide the data points
randomly into k subsets.
Both these methods are bad.
E. Forgy. Cluster analysis of multivariate data
Efficiency vs. interpretability of
classifications. Biometrics, 21(3)768, 1965.

11
Forgy initialization
12
k-means result
13
Smarter initialization

The furthest first" algorithm (FF)
Pick first center randomly.
Next is the point furthest from the first center.
Third is the point furthest from both previous
centers.
In general next center is argmaxx minc d(x,c)
D. Hochbaum, D. Shmoys. A best possible heuristic
for the k-center problem, Mathematics of
Operations Research, 10(2)180-184, 1985.

14
Furthest-first initialization
FF
Furthest-first initialization
15
Subset furthest-first (SFF)

FF finds outliers, by definition not good cluster
centers!
Can we choose points far apart and typical of the
dataset?
Idea A random sample includes many
representative points, but few outliers.
But How big should the random sample be?
Lemma Given k equal-size sets and c gt1, with
high probability ck log k random points
intersect each set.

16
Subset furthest-first c 2
17
Comparing initialization methods
218 means 218 worse than the best clustering
known. Lower is better.
18
How to find lower-cost local minima

Random restarts, even initialized well, are
inadequate.
The central limit catastrophe almost all local
minima are only averagely good.
K. D. Boese, A. B. Kahng, S. Muddu, A new
adaptive multi-start technique for combinatorial
global optimizations. Operations Research Letters
16 (1994) 101-113.
The art of designing a local search algorithm
defining a neighborhood rich in improving
candidate moves.

19
Our local search method

k-means alternates two guaranteed-improvement
steps allocate and locate.
Sadly, we know no other guaranteed-improvement
steps.
So, we do non-guaranteed jump operations
delete an existing center and create a new center
at a data point.
After each jump, run k-means to convergence
starting with an allocate step.

20
Add a center below
Remove a center at left
21
Theory versus practice

Theorem Let C be a set of centers such that no
jump operation improves the value of C. Then C
is at most 25 times worse than the global
optimum.
T. Kanungo et al. A local search approximation
algorithm for clustering, ACM Symposium on
Computational Geometry, 2002.
Our aim Find heuristics to identify jump steps
that are likely to be good.
Experiments indicate we can solve problems with
up to 2000 points and 20 centers optimally.

22
An upper bound ...

Lemma 1 The maximum loss from removing center
c.
Proof
Suppose b is the center closest to c let B and C
be the subsets owned by b and c, with m B and
n C.
If B and C merge, the new center is b
(mbnc)/(mn).
Because c is the mean of C, for any z ?x in
C d(z,x)2 ?x in C d(c,x)2 nd(z,c)2.
So the loss from the merge is nd(b,c)2
md(b,b)2.
This computation is cheap, so we do it for every
center.

23
and a lower bound

Suppose we add a new center at point z.
Lemma 2 The gain from adding a center at z is at
least
? x d(x,c(x)) gt d(x,z) d(x,c(x))2 -
d(x,z)2.
This computation is more expensive, so we do it
for only 2k log k random candidates z.

24
Sometimes a jump should only be a jiggle

How to use Lemmas 1 and 2
delete the center with smallest maximum loss,
make new center at point with greatest minimum
gain.
This procedure identifies good global
improvements.
Small-scale improvements come from jiggling the
center of an existing cluster moving the center
to a point inside the same cluster.

25
jj-means the smarter k-means algorithm

Run k-means with SFF initialization.
Repeat
While improvement do
Try the best jump according to Lemmas 1 and 2
Until improvement do
Try a random jiggle
Try means run k-means to convergence after.
Insert random jumps to satisfy theorem.

26
Results with 1000 points, 8 dimensions, 10 centers
Conclusion Running 10x longer is faster and
better than restarting 10x.
27
Goal Make k-means faster, but with same answer

Allow any black-box d(),
any initialization method.
In later iterations, little movement of
centers.
Distance calculations use the most time.
Geometrically, these are mostly redundant.

Source D. Pelleg.
28

Let x be a point, c(x) its owner, and c a
different center.
If we already know d(x,c) ? d(x,c(x))
then computing d(x,c) precisely is not
necessary.
Strategy Use the triangle inequality d(x,z) ?
d(x,y) d(y,z)to get sufficient conditions for
d(x,c) ? d(x,b).
kd-trees are useful up to ? 10 dimensions.
Distance-based data structures can be better.
Our approach is adaptive.

Lemma 1 Let x be a point, and let b and c be
centers.
If d(b,c) ? 2d(x,b) then d(x,c) ? d(x,b).
Proof We know d(b,c) ? d(b,x) d(x,c). So
d(b,c) - d(x,b) ? d(x,c). Now d(b,c) -
d(x,b) ? 2d(x,b) - d(x,b) d(x,b).So d(x,b)
? d(x,c).

b
c
x
30

Lemma 2 Let x be a point, let b and c be
centers.
Then d(x,c) ? max 0, d(x,b) - d(b,c) .
Proof We know d(x,b) ? d(x,c) d(b,c),
So d(x,c) ? d(x,b) - d(b,c).
Also d(x,c) ? 0.

c
b
x
31
How to use Lemma 1

Let c(x) be the owner of point x, c' another
center
compute d(x,c') only if
d(x,c(x)) gt ½ d(c(x),c').
If we know an upper bound u(x) ? d(x,c(x))
compute d(x,c') and d(x,c(x)) only if
u(x) gt ½ d(c(x),c').
If u(x) ? ½ min c' ? c(x) d(c(x), c')
eliminate all distance calculations for x.

32
How to use Lemma 2

Let x be any point, let c be any center,
let c be c at previous iteration.
Assume previous lower bound d(x,c) ? l'.
Then we get a new lower bound for the current
iteration
d(x,c) ? max 0, d(x,c) - d(c, c)
? max 0, l' - d(c,c)
If l' is a good approximation, and the center
only moves slightly, then we get a good updated
approximation.

Pick initial centers c.
For all x and c, compute d(x,c)
Initialize lower bounds l(x,c) ? d(x,c)
Initialize upper bounds u(x) ? minc d(x,c)
Initialize ownership c(x) ? argminc d(x,c)
Repeat until convergence
Find all x s.t. u(x) ? ½ minc' ? c(x)
d(c(x), c')
For each remaining x and c ? c(x) s.t.
u(x) gt l(x,c)
u(x) gt ½ d(c(x), c)
Compute d(x,c) and d(x,c(x))
If d(x,c) lt d(x,c(x)) then change owner c(x) ? c
Update l(x,c) ? d(x,c) and u(x) ? d(x,c(x))
For each c, m(c) ? mean of points owned by c
For each x and c, update l(x,c) ? max 0,
l(x,c) - d(m(c),c)
For each x, update u(x) ? u(x) d(c(x), m(c(x))
)
Update each center c ? m(c)

34
Notes on the new algorithm

Empirical issue which checks to do in which
order.
Implement for each remaining x and c by looping
over c, with vectorized code processing all x
together.
Or, sequentially scan x and l(x,c) from disk.
Obvious initialization computes O(nk) distances.
Faster methods give inaccurate l(x,c) and
u(x), hence may do more distance
calculations later.

35
(No Transcript)
36
Experimental observations

Natural clusters are found while computing the
distance between each point and each center less
than once!
We find k 100 clusters in n 100,000 covtype
points with 7,353,400 lt nk 15,000,000 distance
calculations.
Number of distance calculations is o(kc), because
later iterations compute very few distances.

37
Current limitations

Computing distances is no longer the dominant
cost.
Reason After each iteration, we
update nk lower bounds l(x,c)
use O(kd) time to recompute k means
use O(k2d) time to recompute all inter-center
distances
Moreover, we can approximate distances in
o(d) time, by considering the largest dimensions
first.

38
Deeper questions

What is the minimum of distance calculations
needed?
Adversary argument? If some calculations are
omitted, an opponent can choose their values to
make any clustering algorithms output incorrect.
Can we extend to clustering with general Bregman
divergences?
Can we extend to soft-assignment clustering? Via
lower and upper bounds on weights?

Write a Comment

User Comments (0)