Pass-Efficient Algorithms for Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Pass-Efficient Algorithms for Clustering

Description:

Streaming Model: Algorithm may make a single pass over the data. Space must be o(n) ... Assume each Fi is a uniform distribution over some interval (ai, bi) in R. ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 42

Provided by: Kev8118

Learn more at: https://crypto.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Pass-Efficient Algorithms for Clustering

1
Pass-Efficient Algorithms for Clustering

Dissertation Defense
Kevin Chang
Adviser Ravi Kannan

Committee Dana Angluin Joan Feigenbaum
Petros Drineas (RPI)
2
Overview

Massive Data Sets in Theoretical Computer
Science
Algorithms for input that is too large to fit in
memory of computer.
Clustering Problems
Learning generative models
Clustering via combinatorial optimization
Both massive data set and traditional algorithms

3
Theoretical Abstractions for Massive Data Sets
computation

The input on disk/storage is modeled as a
read-only array. Elements may only be accessed
through a sequential pass.
Input elements may be arbitrarily ordered.
Main memory is modeled as extra space used for
intermediate calculations.
Algorithm is allowed extra time before and after
each pass to perform calculations.
Goal Minimize memory usage, number of passes.

4
Models of Computation

Streaming Model Algorithm may make a single
pass over the data. Space must be o(n).
Pass-Efficient Model Algorithm may make a
small, constant number of passes. Ideally, space
is O(1).
Other models sublinear algorithms
Pass-Efficient more flexible than streaming, but
not suitable for streaming data arriving that
is processed immediately and then forgotten.

5
Main Question What will multiple passes buy you?

Is it better, in terms of other resources, to
make 3 passes instead of 1 pass?
Example Find the mean of an array of integers.
1 pass requires O(1) space.
More passes dont help.
Example Find the median. MP 80
1 pass algorithm requires n space.
2 pass needs only O(n1/2) space.
We will study the Trade-Off between passes and
memory.
Other work for graphs FKMSZ 05

6
Overview of Results

General framework for Pass-Efficient clustering
algorithms. Specific problems
A learning problem generative model of
clustering.
Sharp trade-off between passes-space
Lower bounds show this is nearly tight
A combinatorial optimization problem Facility
Location
Same sharp trade-off
Algorithm for a graph partitioning problem
Can be considered a clustering problem.

7
Graph Partitioning Sum-of-squares partition

Joint with Alex Yampolskiy, Jim Aspnes.
Given a graph G (V, E), remove m nodes or
edges to disconnect the graph into components
Ci in order to minimize objective function ?i
Ci2.

The objective function is natural, because large
components will have roughly equal size in the
optimum partition.
8
Graph Partitioning Our Results

Definition (a, ß)-bicriterion approximation
remove am nodes or edges to get partition Ci
such that ?i Ci2 lt ß OPT
OPT is best sum of squares for removing m edges.
Approximation Algorithm for SSP
Thm There exists a (O(log 1.5 n), O(1))
algorithm.
Hardness of Approximation
Thm NP-Hard to compute a (1.18, 1)
approximation.
By reduction from minimum Vertex Cover.

9
Outline of Algorithm

Recursively partition the graph by removing
sparse node or edge cuts.
Sparsest cut is an important theory problem.
Informally find a cut that removes a small
number of edges to get two relatively large
components.
Recent activity in this area. LR99, ARV04.
Node cut problem reduces to edge cut.
Similar approach taken by KVV 00 for a
different clustering problem.

10
Rough Idea of Algorithm

In first iteration, partition the graph G into
C1, C2 by removing a sparse node cut.

G
11
Rough Idea of Algorithm

In first iteration, partition the graph G into
C1, C2 by removing a sparse node cut.

C1
C2
12
Rough Idea of Algorithm

Subsequent iterations calculate cuts in all
components, and remove the cut that most
effectively reduces the objective function.

C1
C2
13
Rough Idea of Algorithm

Continue procedure until T(log1.5n)m nodes have
been removed.

C1
C2
C3
14
Generative Models of Clustering

Assume the input is generated by k different
random processes.
k distributions F1, F2, Fk , each with weight
wi gt 0 such that ?i wi 1.
A sample is drawn according to the mixture by
picking Fi with probability wi, and then drawing
a point according to Fi.
Alternatively, if Fi is a density function, then
mixture has density F ?i wi Fi.
Much TCS work on learning mixtures Gaussian
Distributions in high dimension. D 99,AK
01, VW 02, KSV 05

15
Learning Mixtures

Q Given samples from the mixture, can we learn
the parameters of the k processes?
A Not always. Question can be ill-posed.
Two different mixtures can create the same
density.
Our problem Can we learn the density F of the
mixture in the pass-efficient model?
What are the trade-offs between space and passes?
Suppose F is the density of mixture. Accuracy of
our estimate G measured by L1 distance
? F-G.

16
Mixtures of k uniform distributions in R

Assume each Fi is a uniform distribution over
some interval (ai, bi) in R.
Then mixture density looks like a step
distribution

The problem reduces to learning a density
function that is known to be piecewise constant
with at most 2k-1 steps.
5
0
x1
x2
x3
x4
x5
A mixture of three distributions F1 on (x1, x3),
F2 on (x2, x3) and F3 on (x4,x5)
17
Algorithmic Results

Pick any integer Pgt0, and 1gt?gt0. 2P pass
randomized algorithm to learn a mixture of k
uniform distributions.
Error at most ?P, memory required is
O(k3/?2Pk/?).
Error drops exponentially in number of passes
while memory used grows very slowly.
Error at most ?, memory required is O(k3/?2/P).
If you take 2 passes you need (1/?)2, 4 passes
you need (1/?), 8 passes, you need (1/?)1/2
Memory drops very sharply as P increases, while
the error remains constant.
Failure probability 1-?.

18
Step distribution First attempt

Let X be the data stream of samples from F.
Break the domain into bins and count the number
of points that fall in each bin. Estimate F in
each bin.
In order to get accuracy ?P, you will need to
store at least ¼ 1/?P counters.
Too much! We can do much better using a few more
passes.

19
General Framework

In one pass, take a small uniform subsample of
size s from the data stream.
Compute a coarse solution on the subsample.
In a second pass, for places where coarse
solution is reasonable, sharpen it.
Also in a second pass, find places where coarse
solution is very far off.
Recurse on these spots.
Examine these trouble spots more carefullyzoom
in!

20
Our algorithm

In one pass, draw a sample of size s ?(k2/?2)
from X.
Based on the sample, partition the domain into
intervals I such that sI F ?(?/2k).
Number of Is will be O(k/?).
In one pass, for each I, determine if F is very
close to constant on I (Call subroutine Constant)
. Also count the number of points of X that lie
in I.
If F is constant on I, then XÅ I / (X
length(I)) is very close to F.
If F is not constant on I, recurse on I. (Zoom
in on the trouble spot).
Requires X ?(k6/?6P). Space usage O(k3/?2P
k/?).

21
The Action
Data stream
Sample
Mark interval with jump
. Repeat this P-2 more times
sub-data stream
22
Why it works bounding the error

Easy to learn when F is constant. Our estimate
is very accurate on these intervals.
The weight of bins decreases exponentially at
each iteration.
At the Pth iteration, bins have weight at most
?P/4k
Thus, we can estimate F as 0 on 2k bins where
there is a jump, and incur an error of at most
?P/2k.

23
Generalizations

Can generalize our algorithm to learn the
following types of distribution (roughly same
trade-off of space and passes).
Mixtures of uniform distributions over
axis-aligned rectangles in R2.
Fi is uniform over a rectangle (ai, bi) (ci,
di)½ R2
Mixtures of linear distributions in R
The density of Fi is linear over some interval
(ai, bi)½ R.
Heuristic For a mixture of Gaussians, just
treat it like a mixture of m linear distributions
for some mgtgtk (theoretical error will be large if
m is not really big).

24
Lower bounds

We define a Generalized Learning Problem (GLP),
which is a slight generalization of the learning
mixtures of distribution problem.
Thm Any P pass randomized algorithm that solves
the GLP must use at least ?(1/?1/(2P-1)) bits of
memory.
Proof uses r round communication complexity
Thm There exists a P pass algorithm that solves
the GLP using at most O (1/?4/P) bits of memory.
Slight generalization of algorithm given above.
Conclusion trade-off is nearly tight for GLP.

25
General framework for clustering adaptive
sampling in 2P passes

Pass 1 draw a small sample S from the data.
Compute a solution C on the subproblem S. (If S
is small, can do this in memory.)
Pass 2 determine those points R that are not
clustered well by C, and recurse on R.
If S is representative, then there wont be many
points in R.
If R is small, then we will sample it at a
higher rate in subsequent iterations and get a
better solution for these outliers.

26
Facility Location
Input Set X of points, with distances defined
by function d and a facility cost fi for each
point.
d(1,2) 5
d(1,3) 2
2 f240
1 f12
d(2,3 ) 6
Cost of solution for building facilities at 1 and
3 is 2853 18
3 f38
Problem Find a set F of facilities
that minimizes ?i2 F fi ?i2 x d(i,F), d(i,F)
minj2 F d(i, j)
d(1,4) 5
d(3,4) 3
4 f425
27
Quick facts about facility location

Natural operations research problem.
NP-Hard to solve optimally.
Many approximation algorithms (not streaming or
massive data set)
Best approximation ratio is 1.52.
Achieving a factor of 1.463 is hard.

28
Other NP-Hard clustering problems in
Combinatorial Optimization

k-center Find a set of centers C, Ck, that
minimizes maxi2 X d(i, C).
One pass approximation algorithm that requires
O(k) space CCFM 97
k-median Find a set of centers C, Ck, that
minimizes ?i2 X d(i, C).
One pass approximation algorithm that requires
O(k log2 n) space. COP 03
No pass-efficient algorithm for general FL!
One pass for very restricted inputs Indyk 04

29
Memory issues for facility location

Note that some instances of FL will require ?(n)
facilities to get any reasonable approximation.
Thm Any P pass, randomized, algorithm for
approximately computing the optimum FL cost
requires ?(n/P) bits of memory.
How to give reasonable guarantees that the space
usage is small?
Parameterize bounds on memory usage by number of
facilities opened.

30
Algorithmic Results

3P pass algorithm for solving FL, using at most
O(k n2/P) bits of extra memory, where k is the
number of facilities output by the algorithm
Approximation ratio is O(P) if OPT is the cost
of the optimum solution, then the algorithm will
output a solution with cost at most ? 36 P OPT
Requires a black-box approximation algorithm for
FL with approximation ratio ?.
Best so-far ? 1.52
Surprising Fact same trade-off, very different
problem!

31
Simplifying the Problem

Large number of facilities complicates matters.
Algorithmic cure involves technical details that
obscure the main point we would like to make.
Easier problem to demonstrate our framework in
action is k Facility Location.
k-FL Find a set of at most k facilities F
(Fk) that minimizes ?i 2 F fi ?i 2 Xd(i,F).
We present a 3P pass algorithm that uses O(k
n1/P) bits of memory. Approx ratio O(P).

32
Idea of our algorithm

Our algorithm will make P triples of passes,
total 3P passes.
In the first two passes of each triple, we take a
uniform sample of X and compute a FL solution on
the sample.
Intuitively, this will be a good solution for
most of the input.
In a third pass, identify bad points which are
far away from our facilities.
In subsequent passes, we recursively compute a
solution for the bad points by restricting our
algorithms to those points.

33
Algorithm, Part 1 Clustering a Sample

In one pass, draw a sample S1 of s O(k(P-1)/P
n1/P) nodes from the data stream.
In a second pass, compute an O(1)-approximation
to FL on S1 , but with facilities drawn from all
of X and with distances in S scaled up by n/s.
Call the set of facilities F1.

34
How good is the solution F1?

Want to show that the cost of F1 on X is small.
Easy to show that ?i2 F1 filtO(1)OPT
Best scenario facilities that service the sample
S1 well will service all points of X well ?j2 F1
fj?j2 X d(j, F1) O(1) OPT.
Unfortunately, this is not true in general.

35
Bounding the cost, similar to Indyk 99

Fix an optimum solution on X, a set of facilities
F, F k that partitions X into k clusters
C1,,Ck.
If cluster Ci is large (i.e. Cigtn/s) then a
sample S1 of X of size s will contain many points
of Ci.
Can prove a good clustering of S1 will be good
for points in the large clusters Ci of X.
Will cost at most O(1)OPT to service large
clusters.
Can prove the number of bad points is at most
O(k1/Pn1-1/P).

36
Algorithm, part 2 Recursing on the outliers

In a third pass, identify the O(k1/P n1-1/P )
points that are farthest away from the facilities
in F1. Can do it using O(n1/P) space using
random sampling.
Assign all other points to facilities in F1.
Cost of this at most O(1)OPT.
Recurse on the outliers.
In subsequent iterations, will draw a sample of
size s from outliers thus, points will be
sampled with much higher frequency.

37
The General Recursive Step

For the mth iteration Let Rm be the points
remaining to be clustered.
Take a sample S of size s from Rm, and compute a
FL solution Fm.
Identify the O(k(m1)/P n1-(m1)/P) points that
are farthest away from Fm , call this set Rm1
Can prove that Fm will service Rmn Rm1 with cost
O(1) OPT.
Recurse on the points in Rm1.

38
Putting it all together

Let the final set of facilities be F Fi.
After P iterations, all points will be assigned a
facility.
Total cost of solution will be O(P) OPT.
Note that possibly Fgtk. Can boil this down to
k facilities that still give O(P) OPT.
Hint Use a known k-median algorithm.
Algorithm generalizes to k-median, but requires
more passes and space than COP 03.

39
References

J. Aspnes, K. Chang, A. Yampolskiy. Inoculation
strategies for victims of viruses and
sum-of-squares partition problem. To appear in
Journal of Computer and System Sciences.
Preliminary version appeared in SODA 2005.
K. Chang, R. Kannan. Pass-Efficient Algorithms
for Clustering. SODA 2006.
K. Chang. Pass-Efficient Algorithms for Facility
Location. Yale TR 1337.
S. Arora, K. Chang. Approximation Schemes for
low degree MST and Red-blue Separation problem.
Algorithmica 40(3)189-210, 2004. Preliminary
version appeared in ICALP 2003. (Not in thesis)

40
Future Directions

Consider problems from data mining, abstract (and
possibly simplify) them, and see what algorithmic
insights TCS can provide.
Possible extensions of this thesis work
Design a pass-efficient algorithm for learning
mixtures of Gaussian distributions in high
dimension. Give rigorous guarantees about
accuracy.
Design pass-efficient algorithms for other Comb.
Opt. Problems. Possibility find a set of
centers C, C k to minimize ?i d(i, C)2.
(Generalization of k-means objectiveEuclidean
case already solved).
Many problems out there. Much more to be done!