Title: Analyzing and Improving Local Search: kmeans and ICP
1Analyzing and Improving Local Search k-means
and ICP
- David Arthur
- Special University Oral Exam
- Stanford University
2What is this talk about?
- Two popular but poorly understood algorithms
- Fast in practice
- Nobody knows why
- Find (highly) sub-optimal solutions
- The big questions
- What makes these algorithms fast?
- How can the solutions they find be improved?
3What is k-means?
- Divides point-set into k tightly-packed
clusters
k3
4What is ICP (Iterative Closest Point)?
- Finds a subset of A similar to B
A
B
5Main outline
- Focus on k-means in talk
- Goals
- Understand running time
- Harder than you might think!
- Worst case exponential
- Smoothed polynomial
- Find better clusterings
- k-means (modification of k-means)
- Provably near-optimal
- In practice faster and more accurate than the
competition
6Main outline
- What is k-means?
- k-means worst-case complexity
- k-means smoothed complexity
- k-means
7Main outline
- What is k-means?
- What exactly is being solved?
- Why this algorithm?
- How does it work?
- k-means worst-case complexity
- k-means smoothed complexity
- k-means
8The k-means problem
- Input
- An integer k
- A set X of n points in Rd
- Task
- Partition the points into k clusters C1, C2,, Ck
- Also choose centers c1, c2,, ck for the
clusters
k3
9The k-means problem
- Input
- An integer k
- A set X of n points in Rd
- Task
- Partition the points into k clusters C1, C2,, Ck
- Also choose centers c1, c2,, ck for the
clusters - Minimize objective function
- where c(x) is the center of the cluster
containing x - Similar to variance
-
k3
10The k-means problem
- Problem is NP-hard
- Even when k 2 (Drineas et al., 04)
- Even when d 2 (Mahajan et al., 09)
- Some 1e approximation algorithms known
- Example running times
- O(n kk2 e (2d1)k logk1(n) logk1(1/e))
- (Har-Peled and Mazumdar, 04)
- O(2(k/e)O(1)dn)
- (Kumar et al., 04)
- All exponential (or worse) in k
11The k-means problem
- An example real-world data set
- From the UC-Irvine Machine Learning Repository
- Looking to detect malevolent network connections
- n494,021
- k100
- d38
- 1e approximation algorithms too slow for this!
12k-means method
- The fast way
- k-means method (Lloyd 82, MacQueen 67)
- By far the most popular clustering algorithm
used in scientific and industrial applications
(Berkhin, 02)
13k-means method
- Start with k arbitrary centers ci
- (In practice chosen at random from data points)
14k-means method
- Start with k arbitrary centers ci
- (In practice chosen at random from data points)
15k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
16k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
17k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
18k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
19k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
20k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
21k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
22k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
23k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
24k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
25k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
26k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
27k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
28k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
29k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
30k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
31k-means method
- Start with k arbitrary centers ci
- Assign points to clusters Ci based on closest ci
- Set each ci to be the center of mass of Ci
- Repeat the last two steps until stable
Clustering is stable k-means terminates!
32What is known already
- Number of iterations
- Finite!
- Each iteration decreases f
- In practice
- Sub-linear (Duda, 00)
- Worst-case
- O(n) (Har-Peled and Sadri, 04)
- min(O(n3kd), O(kn)) (Inaba et al., 00)
- Very large gap!
33What is known already
- Accuracy
- Only finds a local optimum
- Local optimum can be arbitrarily bad
- i.e., f / fOPT unbounded
- Even with high probability in 1 dimension
- Even in natural examples well separated
Gaussians
34Main outline
- What is k-means?
- k-means worst-case complexity
- Number of iterations can be
- Super-polynomial!
- k-means smoothed complexity
- k-means
- How slow is the k-means method? (Arthur and
Vassilvitskii, 06)
35Worst-case overview
- Recursive construction
- Start with input X (goes from A to B in T
iterations) - Then modify X
- Add 1 dimension, O(k) points, O(1) clusters
- Old part of input still goes from A to B in T
iterations - New part Resets everything once
A
B
A
B
Reset
T
T
36Worst-case overview
- Recursive construction
- Repeat m times
- O(m2) points
- O(m) clusters
- 2m iterations
- Lower bound follows
37Recursive construction (Overview)
Ci
The original input X (Data points not shown)
Start with an arbitrary input...
38Recursive construction (Overview)
Ci
G
G
H
H
H
H
... and add O(1) clusters, O(k) points along a
new dimension Note the symmetry!
39Recursive construction (Trace t0)
Ci
G
H
H
Zoomed in, showing only one side We trace k-means
from here
40Recursive construction (Trace t0...T)
Ci
G
H
H
New points are far away New clusters are stable
while k-means works on old points
41Recursive construction (Trace tT1)Assigning
points to clusters
Ci
G
H
H
42Recursive construction (Trace tT1)Assigning
points to clusters
Ci
G
pi
H
H
Choose pi to be direct lift of final Ci center
At time T1 pi closer to joining Ci than ever
before Can position G so pi joins Ci at time T1
43Recursive construction (Trace tT1)Assigning
points to clusters
Ci
G
pi
H
H
Choose pi to be direct lift of final Ci center
At time T1 pi closer to joining Ci than ever
before Can position G so pi joins Ci at time T1
44Recursive construction (Trace tT1)Assigning
points to clusters
Ci
G
pi
H
H
Choose pi to be direct lift of final Ci center
At time T1 pi closer to joining Ci than ever
before Can position G so pi joins Ci at time T1
45Recursive construction (Trace tT1)Recomputing
centers
Ci
G
H
H
Center of G moves further away Centers of Ci
constant by symmetry
46Recursive construction (Trace tT1)Recomputing
centers
Ci
G
H
H
Center of G moves further away Centers of Ci
constant by symmetry
47Recursive construction (Trace tT2) Assigning
points to clusters
Ci
G
qi
H
H
Gs center is far away it loses points Each qi
switches to Ci regardless of qis position in the
base space
48Recursive construction (Trace tT2) Assigning
points to clusters
Ci
G
qi
H
H
Gs center is far away it loses points Each qi
switches to Ci regardless of qis position in the
base space
49Recursive construction (Trace tT2) Assigning
points to clusters
Ci
G
qi
H
H
Gs center is far away it loses points Each qi
switches to Ci regardless of qis position in the
base space
50Recursive construction (Trace tT2)Recomputing
centers
Ci
G
Centers reset to t0
H
H
Symmetry Centers of Ci not lifted towards
G Choose qis position to reset Ci in base space
51Recursive construction (Trace tT2)Recomputing
centers
Ci
G
Centers reset to t0
H
H
Symmetry Centers of Ci not lifted towards
G Choose qis position to reset Ci in base space
52Recursive construction (Trace tT3)Assigning
points to clusters
Ci
G
Centers reset to t0
H
H
H has moved closer to pi, qi but Ci has
not Position H so pi, qi switch to H now
53Recursive construction (Trace tT3)Assigning
points to clusters
Ci
G
H
Same state as t1
H
H has moved closer to pi, qi but Ci has
not Position H so pi, qi switch to H now
54Recursive construction (Trace tT3)Assigning
points to clusters
Ci
G
H
Same state as t1
H
H has moved closer to pi, qi but Ci has
not Position H so pi, qi switch to H now
55Recursive construction (Trace tT3)Recomputing
centers
Ci
G
H
Same state as t1
H
56Recursive construction (Trace tT3)Recomputing
centers
Ci
G
H
Same state as t1
H
57Recursive construction (Trace tT4)Assigning
points to clusters
Ci
G
H
Same state as t1
H
58Recursive construction (Trace tT4)Assigning
points to clusters
Ci
G
H
Same state as t2
H
59Recursive construction (Trace tT4)Assigning
points to clusters
Ci
G
H
Same state as t2
H
60Recursive construction (Trace tT4)Recomputing
centers
Ci
G
H
Same state as t2
H
61Recursive construction (Trace tT4)Recomputing
centers
Ci
G
H
Same state as t2
H
We are done! New clusters are completely
stable T-2 more iterations needed for Ci Total
time 2T 2
62Worst-case complexity summary
- k-means can require iterations
- For random centers even with high probability
- Even when d2 (Vattani, 09)
- (d1 is open)
63Worst-case complexity summary
- k-means can require iterations
- For random centers even with high probability
- Even when d2 (Vattani, 09)
- (d1 is open)
- ICP
- Can require ?(n/d)d iterations
- Similar (but easier) argument
64Main outline
- What is k-means?
- k-means worst-case complexity
- k-means smoothed complexity
- Take an arbitrary input, but randomly perturb it
- Expected number of iterations is polynomial
- Works for any k, d
- k-means
- k-means has smoothed polynomial complexity
(Arthur, Manthey, and Röglin, 09)
65The problem with worst-case complexity
- What is the problem?
- k-means has bad worst-case complexity
- But is not actually slow in practice
- Need a different model to understand real world
- Simple explanations
- Average case
- Real-world data is not random
66The problem with worst-case complexity
- A better explanation
- Smoothed analysis (Spielman and Teng, 01)
- Between average case and worst case
- Perturb each point by normal distribution,
variance s2 - Show expected running time is poly in n, D/s
- D diameter of point-set
67Proof overview
- Recall the potential function
- X is set of all data points
- c(x) is corresponding cluster center
-
68Proof overview
- Recall the potential function
- X is set of all data points
- c(x) is corresponding cluster center
- Bound f
- f nD2 initially
- Will prove f very likely to drop e2 each iteration
69Proof overview
- Recall the potential function
- X is set of all data points
- c(x) is corresponding cluster center
- Bound f
- f nD2 initially
- Will prove f very likely to drop e2 each
iteration - Gives of iterations is at most n(D/e)2
70The easy approach
- Do union bound over all possible k-means steps
- What defines a step?
- Original clustering A ( kn choices)
- Actually n3kd choices (Inaba et al., 00)
- Resulting clustering B
- Total number of possible steps n6kd
- Probability a fixed step can be bad
- Bounded by probability that A and B have
near-identical f - Probability (e/s)d
71The easy approach
- The argument
- Pk-means takes more than n(D/e)2 iterations
- PThere exists a possible bad step
- ( of possible steps) Pstep is bad
- n6kd (e/s)d
- small... if e lt s (1/n)O(k)
- Resulting bound nO(k) iterations
- Not polynomial!
- (Arthur and Vassilitskii, 06), (Manthey and
Röglin, 09) -
-
-
72How can this be improved?
- Union bound is wasteful!
- These two k-means steps can be analyzed together
73How can this be improved?
- Union bound is wasteful!
- These two k-means steps can be analyzed together
If point is not equidistant between centers,
potential drops. True for both pictures.
74How can this be improved?
- Union bound is wasteful!
- These two k-means steps can be analyzed together
Other clusters do not matter...
75How can this be improved?
- Union bound is wasteful!
- These two k-means steps can be analyzed together
And for the relevant clusters, only the center
matters, not the exact points.
76How can this be improved?
- A transition blueprint
- Which points switched clusters
- Approximate positions for relevant centers
- Bonus Most approximate centers determined by
above! - Not obvious facts
- m number of points switching clusters
- transition blueprints (nk2)m (D/e)O(m)
- Pblueprint is bad (e/s)m
- (for most blueprints)
77A good approach
- The new argument
- Pk-means takes more than n(D/e)2 iterations
- PThere exists a possible bad blueprint
- ( of possible blueprints) Pblueprint is
bad - (nk2)m (D/e)O(m) (e/s)m
- small... if e lt s (s/nD)O(1)
- Resulting bound polynomial iterations!
78Smoothed complexity summary
- Smoothed complexity is polynomial
- Still have work to do O(n26)
- Getting tight exponents in smoothed analysis is
hard - Original theorem for Simplex O(n96)!
- ICP
- Also polynomial smoothed complexity
- Much easier argument!
79Main outline
- What is k-means?
- k-means worst-case complexity
- k-means smoothed complexity
- k-means
- Whats wrong with k-means?
- Whats k-means?
- O(log k)-competitive with OPT
- Experimental results
- k-means The advantages of careful seeding
(Arthur and Vassilvitskii, 07)
80Whats wrong with k-means?
- Recall
- Only finds a local optimum
- Local optimum can be arbitrarily bad
- i.e., f / fOPT unbounded
- Even with high probability in 1 dimension
- Even in natural examples well separated
Gaussians
81Whats wrong with k-means?
- k-means locally optimizes a clustering
- But can miss the big picture!
If the data set has well separated clusters...
82Whats wrong with k-means?
- k-means locally optimizes a clustering
- But can miss the big picture!
... and we do the standard approach (choose
initial centers uniformly at random) it is easy
to get two centers in one cluster...
83Whats wrong with k-means?
- k-means locally optimizes a clustering
- But can miss the big picture!
... and then k-means gets stuck in a local optimum
84The solution
- Easy way to fix this mistake
- Make centers far apart
85k-means
- The right way of choosing initial centers
- Choose first center uniformly at random
86k-means
- The right way of choosing initial centers
- Choose first center uniformly at random
- Repeat until k centers
- Add a new center
- Choose x0 with probability
- D(x) distance between x and closest center
87k-means
(k3)
88k-means
(k3)
Choose 1st center uniformly at random
89k-means
D21272
D28242
D27232
D22212
(k3)
Choose 2nd center probability proportional to D2
90k-means
D21272
D21212
D22212
(k3)
Choose 3rd center probability proportional to D2
91k-means
(k3)
Now run k-means as normal
92Theoretical guarantee
- Claim
- Ef O(log k) fOPT
- Guarantee holds as soon as centers picked
- But k-means steps good in practice
93Proof idea
- Let C1, C2, ..., Ck be OPT clusters
- Points in Ci contribute
- fOPT(Ci) to OPT potential
- fkm(Ci) to k-means potential
- Lemma If we pick a center from Ci
- Efkm(Ci) 8fOPT(Ci)
- Proof Linear algebra magic
- True for any reasonable probability distribution
94Proof idea
- Real danger is we waste center on already covered
Ci - Probability of choosing covered cluster
- fcurrent(covered clusters) / fcurrent
- 8fOPT / fcurrent
- Cost of choosing one
- If t uncovered clusters left
- 1/t fraction of fcurrent now unfixable
- Cost fcurrent / t
- Expected cost
- (fOPT / fcurrent ) (fcurrent / t) fOPT / t
95Proof idea
- Cost over all k steps
- fOPT (1/1 1/2 ... 1/k) fOPT (log k)
96k-means accuracy improvement
Improvement factor in f
Data set
97k-means vs k-means
- Values above 1 indicate k-means is
out-performing k-means
98Other algorithms?
- Theory community has proposed other reasonable
algorithms too - Iterative swapping best in practice (Kanungo et
al., 04) - Theoretically O(1)-approximation
- Implementation gives this up to be viable in
practice - Actual guarantees None
99k-means accuracy improvementvs Iterative
Swapping (Kanungo et al., 04)
Improvement factor in f
Data set
100k-means vs Iterative Swapping
- Values above 1 indicate k-means is
out-performing Iterative Swapping
101k-means summary
- Friends dont let friends use vanilla k-means!
- k-means has provable accuracy guarantee
- O(log k)-competitive with OPT
- k-means is faster on average
- k-means gets better clusterings (almost) always
102Main outline
- Goals
- Understand number of iterations
- Harder than you might think!
- Worst case exponential
- Smoothed polynomial
- Find better clusterings
- k-means (modification of k-means)
- Provably near-optimal
- In practice faster and better than the
competition
103Special thanks!
- My advisor
- Rajeev Motwani
- My defense committee
- Ashish Goel
- Vladlen Koltun
- Serge Plotkin
- Tim Roughgarden
104Special thanks!
- My co-authors
- Bodo Manthey
- Rina Panigrahy
- Heiko Röglin
- Aneesh Sharma
- Sergei Vassilvitskii
- Ying Xu
105Special thanks!
- My other fellow students
- Gagan Aggarwal
- Brian Babcock
- Bahman Bahmani
- Krishnaram Kenthapadi
- Aleksandra Korolova
- Shubha Nabar
- Dilys Thomas
106Special thanks!