Title: Pass-Efficient Algorithms for Clustering
1Pass-Efficient Algorithms for Clustering
- Dissertation Defense
- Kevin Chang
- Adviser Ravi Kannan
Committee Dana Angluin Joan Feigenbaum
Petros Drineas (RPI)
2Overview
- Massive Data Sets in Theoretical Computer
Science - Algorithms for input that is too large to fit in
memory of computer. - Clustering Problems
- Learning generative models
- Clustering via combinatorial optimization
- Both massive data set and traditional algorithms
3Theoretical Abstractions for Massive Data Sets
computation
- The input on disk/storage is modeled as a
read-only array. Elements may only be accessed
through a sequential pass. - Input elements may be arbitrarily ordered.
- Main memory is modeled as extra space used for
intermediate calculations. - Algorithm is allowed extra time before and after
each pass to perform calculations. - Goal Minimize memory usage, number of passes.
4Models of Computation
- Streaming Model Algorithm may make a single
pass over the data. Space must be o(n). - Pass-Efficient Model Algorithm may make a
small, constant number of passes. Ideally, space
is O(1). - Other models sublinear algorithms
- Pass-Efficient more flexible than streaming, but
not suitable for streaming data arriving that
is processed immediately and then forgotten.
5Main Question What will multiple passes buy you?
- Is it better, in terms of other resources, to
make 3 passes instead of 1 pass? - Example Find the mean of an array of integers.
- 1 pass requires O(1) space.
- More passes dont help.
- Example Find the median. MP 80
- 1 pass algorithm requires n space.
- 2 pass needs only O(n1/2) space.
- We will study the Trade-Off between passes and
memory. - Other work for graphs FKMSZ 05
6Overview of Results
- General framework for Pass-Efficient clustering
algorithms. Specific problems - A learning problem generative model of
clustering. - Sharp trade-off between passes-space
- Lower bounds show this is nearly tight
- A combinatorial optimization problem Facility
Location - Same sharp trade-off
- Algorithm for a graph partitioning problem
- Can be considered a clustering problem.
7Graph Partitioning Sum-of-squares partition
- Joint with Alex Yampolskiy, Jim Aspnes.
- Given a graph G (V, E), remove m nodes or
edges to disconnect the graph into components
Ci in order to minimize objective function ?i
Ci2.
The objective function is natural, because large
components will have roughly equal size in the
optimum partition.
8Graph Partitioning Our Results
- Definition (a, ß)-bicriterion approximation
- remove am nodes or edges to get partition Ci
- such that ?i Ci2 lt ß OPT
- OPT is best sum of squares for removing m edges.
- Approximation Algorithm for SSP
- Thm There exists a (O(log 1.5 n), O(1))
algorithm. - Hardness of Approximation
- Thm NP-Hard to compute a (1.18, 1)
approximation. - By reduction from minimum Vertex Cover.
9Outline of Algorithm
- Recursively partition the graph by removing
sparse node or edge cuts. - Sparsest cut is an important theory problem.
- Informally find a cut that removes a small
number of edges to get two relatively large
components. - Recent activity in this area. LR99, ARV04.
- Node cut problem reduces to edge cut.
- Similar approach taken by KVV 00 for a
different clustering problem.
10Rough Idea of Algorithm
- In first iteration, partition the graph G into
C1, C2 by removing a sparse node cut. -
G
11Rough Idea of Algorithm
- In first iteration, partition the graph G into
C1, C2 by removing a sparse node cut. -
C1
C2
12Rough Idea of Algorithm
- Subsequent iterations calculate cuts in all
components, and remove the cut that most
effectively reduces the objective function.
C1
C2
13Rough Idea of Algorithm
- Continue procedure until T(log1.5n)m nodes have
been removed. -
C1
C2
C3
14Generative Models of Clustering
- Assume the input is generated by k different
random processes. - k distributions F1, F2, Fk , each with weight
wi gt 0 such that ?i wi 1. - A sample is drawn according to the mixture by
picking Fi with probability wi, and then drawing
a point according to Fi. - Alternatively, if Fi is a density function, then
mixture has density F ?i wi Fi. - Much TCS work on learning mixtures Gaussian
Distributions in high dimension. D 99,AK
01, VW 02, KSV 05
15Learning Mixtures
- Q Given samples from the mixture, can we learn
the parameters of the k processes? - A Not always. Question can be ill-posed.
- Two different mixtures can create the same
density. - Our problem Can we learn the density F of the
mixture in the pass-efficient model? - What are the trade-offs between space and passes?
- Suppose F is the density of mixture. Accuracy of
our estimate G measured by L1 distance - ? F-G.
16Mixtures of k uniform distributions in R
- Assume each Fi is a uniform distribution over
some interval (ai, bi) in R. - Then mixture density looks like a step
distribution
The problem reduces to learning a density
function that is known to be piecewise constant
with at most 2k-1 steps.
5
0
x1
x2
x3
x4
x5
A mixture of three distributions F1 on (x1, x3),
F2 on (x2, x3) and F3 on (x4,x5)
17Algorithmic Results
- Pick any integer Pgt0, and 1gt?gt0. 2P pass
randomized algorithm to learn a mixture of k
uniform distributions. - Error at most ?P, memory required is
O(k3/?2Pk/?). - Error drops exponentially in number of passes
while memory used grows very slowly. - Error at most ?, memory required is O(k3/?2/P).
- If you take 2 passes you need (1/?)2, 4 passes
you need (1/?), 8 passes, you need (1/?)1/2 - Memory drops very sharply as P increases, while
the error remains constant. - Failure probability 1-?.
18Step distribution First attempt
- Let X be the data stream of samples from F.
- Break the domain into bins and count the number
of points that fall in each bin. Estimate F in
each bin. - In order to get accuracy ?P, you will need to
store at least ¼ 1/?P counters. - Too much! We can do much better using a few more
passes.
19General Framework
- In one pass, take a small uniform subsample of
size s from the data stream. - Compute a coarse solution on the subsample.
- In a second pass, for places where coarse
solution is reasonable, sharpen it. - Also in a second pass, find places where coarse
solution is very far off. - Recurse on these spots.
- Examine these trouble spots more carefullyzoom
in!
20Our algorithm
- In one pass, draw a sample of size s ?(k2/?2)
from X. - Based on the sample, partition the domain into
intervals I such that sI F ?(?/2k). - Number of Is will be O(k/?).
- In one pass, for each I, determine if F is very
close to constant on I (Call subroutine Constant)
. Also count the number of points of X that lie
in I. - If F is constant on I, then XÃ… I / (X
length(I)) is very close to F. - If F is not constant on I, recurse on I. (Zoom
in on the trouble spot). - Requires X ?(k6/?6P). Space usage O(k3/?2P
k/?).
21The Action
Data stream
Sample
Mark interval with jump
. Repeat this P-2 more times
sub-data stream
22Why it works bounding the error
- Easy to learn when F is constant. Our estimate
is very accurate on these intervals. - The weight of bins decreases exponentially at
each iteration. - At the Pth iteration, bins have weight at most
?P/4k - Thus, we can estimate F as 0 on 2k bins where
there is a jump, and incur an error of at most
?P/2k.
23Generalizations
- Can generalize our algorithm to learn the
following types of distribution (roughly same
trade-off of space and passes). - Mixtures of uniform distributions over
axis-aligned rectangles in R2. - Fi is uniform over a rectangle (ai, bi) (ci,
di)½ R2 - Mixtures of linear distributions in R
- The density of Fi is linear over some interval
(ai, bi)½ R. - Heuristic For a mixture of Gaussians, just
treat it like a mixture of m linear distributions
for some mgtgtk (theoretical error will be large if
m is not really big).
24Lower bounds
- We define a Generalized Learning Problem (GLP),
which is a slight generalization of the learning
mixtures of distribution problem. - Thm Any P pass randomized algorithm that solves
the GLP must use at least ?(1/?1/(2P-1)) bits of
memory. - Proof uses r round communication complexity
- Thm There exists a P pass algorithm that solves
the GLP using at most O (1/?4/P) bits of memory.
- Slight generalization of algorithm given above.
- Conclusion trade-off is nearly tight for GLP.
25General framework for clustering adaptive
sampling in 2P passes
- Pass 1 draw a small sample S from the data.
- Compute a solution C on the subproblem S. (If S
is small, can do this in memory.) - Pass 2 determine those points R that are not
clustered well by C, and recurse on R. - If S is representative, then there wont be many
points in R. - If R is small, then we will sample it at a
higher rate in subsequent iterations and get a
better solution for these outliers.
26Facility Location
Input Set X of points, with distances defined
by function d and a facility cost fi for each
point.
d(1,2) 5
d(1,3) 2
2 f240
1 f12
d(2,3 ) 6
Cost of solution for building facilities at 1 and
3 is 2853 18
3 f38
Problem Find a set F of facilities
that minimizes ?i2 F fi ?i2 x d(i,F), d(i,F)
minj2 F d(i, j)
d(1,4) 5
d(3,4) 3
4 f425
27Quick facts about facility location
- Natural operations research problem.
- NP-Hard to solve optimally.
- Many approximation algorithms (not streaming or
massive data set) - Best approximation ratio is 1.52.
- Achieving a factor of 1.463 is hard.
28Other NP-Hard clustering problems in
Combinatorial Optimization
- k-center Find a set of centers C, Ck, that
minimizes maxi2 X d(i, C). - One pass approximation algorithm that requires
O(k) space CCFM 97 - k-median Find a set of centers C, Ck, that
minimizes ?i2 X d(i, C). - One pass approximation algorithm that requires
O(k log2 n) space. COP 03 - No pass-efficient algorithm for general FL!
- One pass for very restricted inputs Indyk 04
29Memory issues for facility location
- Note that some instances of FL will require ?(n)
facilities to get any reasonable approximation. - Thm Any P pass, randomized, algorithm for
approximately computing the optimum FL cost
requires ?(n/P) bits of memory. - How to give reasonable guarantees that the space
usage is small? - Parameterize bounds on memory usage by number of
facilities opened.
30Algorithmic Results
- 3P pass algorithm for solving FL, using at most
O(k n2/P) bits of extra memory, where k is the
number of facilities output by the algorithm - Approximation ratio is O(P) if OPT is the cost
of the optimum solution, then the algorithm will
output a solution with cost at most ? 36 P OPT - Requires a black-box approximation algorithm for
FL with approximation ratio ?. - Best so-far ? 1.52
- Surprising Fact same trade-off, very different
problem!
31Simplifying the Problem
- Large number of facilities complicates matters.
Algorithmic cure involves technical details that
obscure the main point we would like to make. - Easier problem to demonstrate our framework in
action is k Facility Location. - k-FL Find a set of at most k facilities F
(Fk) that minimizes ?i 2 F fi ?i 2 Xd(i,F). - We present a 3P pass algorithm that uses O(k
n1/P) bits of memory. Approx ratio O(P).
32Idea of our algorithm
- Our algorithm will make P triples of passes,
total 3P passes. - In the first two passes of each triple, we take a
uniform sample of X and compute a FL solution on
the sample. - Intuitively, this will be a good solution for
most of the input. - In a third pass, identify bad points which are
far away from our facilities. - In subsequent passes, we recursively compute a
solution for the bad points by restricting our
algorithms to those points.
33 Algorithm, Part 1 Clustering a Sample
- In one pass, draw a sample S1 of s O(k(P-1)/P
n1/P) nodes from the data stream. - In a second pass, compute an O(1)-approximation
to FL on S1 , but with facilities drawn from all
of X and with distances in S scaled up by n/s.
Call the set of facilities F1.
34How good is the solution F1?
- Want to show that the cost of F1 on X is small.
- Easy to show that ?i2 F1 filtO(1)OPT
- Best scenario facilities that service the sample
S1 well will service all points of X well ?j2 F1
fj?j2 X d(j, F1) O(1) OPT. - Unfortunately, this is not true in general.
35Bounding the cost, similar to Indyk 99
- Fix an optimum solution on X, a set of facilities
F, F k that partitions X into k clusters
C1,,Ck. - If cluster Ci is large (i.e. Cigtn/s) then a
sample S1 of X of size s will contain many points
of Ci. - Can prove a good clustering of S1 will be good
for points in the large clusters Ci of X. - Will cost at most O(1)OPT to service large
clusters. - Can prove the number of bad points is at most
O(k1/Pn1-1/P).
36Algorithm, part 2 Recursing on the outliers
- In a third pass, identify the O(k1/P n1-1/P )
points that are farthest away from the facilities
in F1. Can do it using O(n1/P) space using
random sampling. - Assign all other points to facilities in F1.
- Cost of this at most O(1)OPT.
- Recurse on the outliers.
- In subsequent iterations, will draw a sample of
size s from outliers thus, points will be
sampled with much higher frequency.
37The General Recursive Step
- For the mth iteration Let Rm be the points
remaining to be clustered. - Take a sample S of size s from Rm, and compute a
FL solution Fm. - Identify the O(k(m1)/P n1-(m1)/P) points that
are farthest away from Fm , call this set Rm1 - Can prove that Fm will service Rmn Rm1 with cost
O(1) OPT. - Recurse on the points in Rm1.
38Putting it all together
- Let the final set of facilities be F Fi.
- After P iterations, all points will be assigned a
facility. - Total cost of solution will be O(P) OPT.
- Note that possibly Fgtk. Can boil this down to
k facilities that still give O(P) OPT. - Hint Use a known k-median algorithm.
- Algorithm generalizes to k-median, but requires
more passes and space than COP 03.
39References
- J. Aspnes, K. Chang, A. Yampolskiy. Inoculation
strategies for victims of viruses and
sum-of-squares partition problem. To appear in
Journal of Computer and System Sciences.
Preliminary version appeared in SODA 2005. - K. Chang, R. Kannan. Pass-Efficient Algorithms
for Clustering. SODA 2006. - K. Chang. Pass-Efficient Algorithms for Facility
Location. Yale TR 1337. - S. Arora, K. Chang. Approximation Schemes for
low degree MST and Red-blue Separation problem.
Algorithmica 40(3)189-210, 2004. Preliminary
version appeared in ICALP 2003. (Not in thesis)
40Future Directions
- Consider problems from data mining, abstract (and
possibly simplify) them, and see what algorithmic
insights TCS can provide. - Possible extensions of this thesis work
- Design a pass-efficient algorithm for learning
mixtures of Gaussian distributions in high
dimension. Give rigorous guarantees about
accuracy. - Design pass-efficient algorithms for other Comb.
Opt. Problems. Possibility find a set of
centers C, C k to minimize ?i d(i, C)2.
(Generalization of k-means objectiveEuclidean
case already solved). - Many problems out there. Much more to be done!
41Thanks for listening!