Pass-Efficient Algorithms for Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Pass-Efficient Algorithms for Clustering

Description:

Streaming Model: Algorithm may make a single pass over the data. Space must be o(n) ... Assume each Fi is a uniform distribution over some interval (ai, bi) in R. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 42
Provided by: Kev8118
Category:

less

Transcript and Presenter's Notes

Title: Pass-Efficient Algorithms for Clustering


1
Pass-Efficient Algorithms for Clustering
  • Dissertation Defense
  • Kevin Chang
  • Adviser Ravi Kannan

Committee Dana Angluin Joan Feigenbaum
Petros Drineas (RPI)
2
Overview
  • Massive Data Sets in Theoretical Computer
    Science
  • Algorithms for input that is too large to fit in
    memory of computer.
  • Clustering Problems
  • Learning generative models
  • Clustering via combinatorial optimization
  • Both massive data set and traditional algorithms

3
Theoretical Abstractions for Massive Data Sets
computation
  • The input on disk/storage is modeled as a
    read-only array. Elements may only be accessed
    through a sequential pass.
  • Input elements may be arbitrarily ordered.
  • Main memory is modeled as extra space used for
    intermediate calculations.
  • Algorithm is allowed extra time before and after
    each pass to perform calculations.
  • Goal Minimize memory usage, number of passes.

4
Models of Computation
  • Streaming Model Algorithm may make a single
    pass over the data. Space must be o(n).
  • Pass-Efficient Model Algorithm may make a
    small, constant number of passes. Ideally, space
    is O(1).
  • Other models sublinear algorithms
  • Pass-Efficient more flexible than streaming, but
    not suitable for streaming data arriving that
    is processed immediately and then forgotten.

5
Main Question What will multiple passes buy you?
  • Is it better, in terms of other resources, to
    make 3 passes instead of 1 pass?
  • Example Find the mean of an array of integers.
  • 1 pass requires O(1) space.
  • More passes dont help.
  • Example Find the median. MP 80
  • 1 pass algorithm requires n space.
  • 2 pass needs only O(n1/2) space.
  • We will study the Trade-Off between passes and
    memory.
  • Other work for graphs FKMSZ 05

6
Overview of Results
  • General framework for Pass-Efficient clustering
    algorithms. Specific problems
  • A learning problem generative model of
    clustering.
  • Sharp trade-off between passes-space
  • Lower bounds show this is nearly tight
  • A combinatorial optimization problem Facility
    Location
  • Same sharp trade-off
  • Algorithm for a graph partitioning problem
  • Can be considered a clustering problem.

7
Graph Partitioning Sum-of-squares partition
  • Joint with Alex Yampolskiy, Jim Aspnes.
  • Given a graph G (V, E), remove m nodes or
    edges to disconnect the graph into components
    Ci in order to minimize objective function ?i
    Ci2.

The objective function is natural, because large
components will have roughly equal size in the
optimum partition.
8
Graph Partitioning Our Results
  • Definition (a, ß)-bicriterion approximation
  • remove am nodes or edges to get partition Ci
  • such that ?i Ci2 lt ß OPT
  • OPT is best sum of squares for removing m edges.
  • Approximation Algorithm for SSP
  • Thm There exists a (O(log 1.5 n), O(1))
    algorithm.
  • Hardness of Approximation
  • Thm NP-Hard to compute a (1.18, 1)
    approximation.
  • By reduction from minimum Vertex Cover.

9
Outline of Algorithm
  • Recursively partition the graph by removing
    sparse node or edge cuts.
  • Sparsest cut is an important theory problem.
  • Informally find a cut that removes a small
    number of edges to get two relatively large
    components.
  • Recent activity in this area. LR99, ARV04.
  • Node cut problem reduces to edge cut.
  • Similar approach taken by KVV 00 for a
    different clustering problem.

10
Rough Idea of Algorithm
  • In first iteration, partition the graph G into
    C1, C2 by removing a sparse node cut.

G
11
Rough Idea of Algorithm
  • In first iteration, partition the graph G into
    C1, C2 by removing a sparse node cut.

C1
C2
12
Rough Idea of Algorithm
  • Subsequent iterations calculate cuts in all
    components, and remove the cut that most
    effectively reduces the objective function.

C1
C2
13
Rough Idea of Algorithm
  • Continue procedure until T(log1.5n)m nodes have
    been removed.

C1
C2
C3
14
Generative Models of Clustering
  • Assume the input is generated by k different
    random processes.
  • k distributions F1, F2, Fk , each with weight
    wi gt 0 such that ?i wi 1.
  • A sample is drawn according to the mixture by
    picking Fi with probability wi, and then drawing
    a point according to Fi.
  • Alternatively, if Fi is a density function, then
    mixture has density F ?i wi Fi.
  • Much TCS work on learning mixtures Gaussian
    Distributions in high dimension. D 99,AK
    01, VW 02, KSV 05

15
Learning Mixtures
  • Q Given samples from the mixture, can we learn
    the parameters of the k processes?
  • A Not always. Question can be ill-posed.
  • Two different mixtures can create the same
    density.
  • Our problem Can we learn the density F of the
    mixture in the pass-efficient model?
  • What are the trade-offs between space and passes?
  • Suppose F is the density of mixture. Accuracy of
    our estimate G measured by L1 distance
  • ? F-G.

16
Mixtures of k uniform distributions in R
  • Assume each Fi is a uniform distribution over
    some interval (ai, bi) in R.
  • Then mixture density looks like a step
    distribution

The problem reduces to learning a density
function that is known to be piecewise constant
with at most 2k-1 steps.
5
0
x1
x2
x3
x4
x5
A mixture of three distributions F1 on (x1, x3),
F2 on (x2, x3) and F3 on (x4,x5)
17
Algorithmic Results
  • Pick any integer Pgt0, and 1gt?gt0. 2P pass
    randomized algorithm to learn a mixture of k
    uniform distributions.
  • Error at most ?P, memory required is
    O(k3/?2Pk/?).
  • Error drops exponentially in number of passes
    while memory used grows very slowly.
  • Error at most ?, memory required is O(k3/?2/P).
  • If you take 2 passes you need (1/?)2, 4 passes
    you need (1/?), 8 passes, you need (1/?)1/2
  • Memory drops very sharply as P increases, while
    the error remains constant.
  • Failure probability 1-?.

18
Step distribution First attempt
  • Let X be the data stream of samples from F.
  • Break the domain into bins and count the number
    of points that fall in each bin. Estimate F in
    each bin.
  • In order to get accuracy ?P, you will need to
    store at least ¼ 1/?P counters.
  • Too much! We can do much better using a few more
    passes.

19
General Framework
  • In one pass, take a small uniform subsample of
    size s from the data stream.
  • Compute a coarse solution on the subsample.
  • In a second pass, for places where coarse
    solution is reasonable, sharpen it.
  • Also in a second pass, find places where coarse
    solution is very far off.
  • Recurse on these spots.
  • Examine these trouble spots more carefullyzoom
    in!

20
Our algorithm
  • In one pass, draw a sample of size s ?(k2/?2)
    from X.
  • Based on the sample, partition the domain into
    intervals I such that sI F ?(?/2k).
  • Number of Is will be O(k/?).
  • In one pass, for each I, determine if F is very
    close to constant on I (Call subroutine Constant)
    . Also count the number of points of X that lie
    in I.
  • If F is constant on I, then XÅ I / (X
    length(I)) is very close to F.
  • If F is not constant on I, recurse on I. (Zoom
    in on the trouble spot).
  • Requires X ?(k6/?6P). Space usage O(k3/?2P
    k/?).

21
The Action
Data stream
Sample
Mark interval with jump
. Repeat this P-2 more times
sub-data stream
22
Why it works bounding the error
  • Easy to learn when F is constant. Our estimate
    is very accurate on these intervals.
  • The weight of bins decreases exponentially at
    each iteration.
  • At the Pth iteration, bins have weight at most
    ?P/4k
  • Thus, we can estimate F as 0 on 2k bins where
    there is a jump, and incur an error of at most
    ?P/2k.

23
Generalizations
  • Can generalize our algorithm to learn the
    following types of distribution (roughly same
    trade-off of space and passes).
  • Mixtures of uniform distributions over
    axis-aligned rectangles in R2.
  • Fi is uniform over a rectangle (ai, bi) (ci,
    di)½ R2
  • Mixtures of linear distributions in R
  • The density of Fi is linear over some interval
    (ai, bi)½ R.
  • Heuristic For a mixture of Gaussians, just
    treat it like a mixture of m linear distributions
    for some mgtgtk (theoretical error will be large if
    m is not really big).

24
Lower bounds
  • We define a Generalized Learning Problem (GLP),
    which is a slight generalization of the learning
    mixtures of distribution problem.
  • Thm Any P pass randomized algorithm that solves
    the GLP must use at least ?(1/?1/(2P-1)) bits of
    memory.
  • Proof uses r round communication complexity
  • Thm There exists a P pass algorithm that solves
    the GLP using at most O (1/?4/P) bits of memory.
  • Slight generalization of algorithm given above.
  • Conclusion trade-off is nearly tight for GLP.

25
General framework for clustering adaptive
sampling in 2P passes
  • Pass 1 draw a small sample S from the data.
  • Compute a solution C on the subproblem S. (If S
    is small, can do this in memory.)
  • Pass 2 determine those points R that are not
    clustered well by C, and recurse on R.
  • If S is representative, then there wont be many
    points in R.
  • If R is small, then we will sample it at a
    higher rate in subsequent iterations and get a
    better solution for these outliers.

26
Facility Location
Input Set X of points, with distances defined
by function d and a facility cost fi for each
point.
d(1,2) 5
d(1,3) 2
2 f240
1 f12
d(2,3 ) 6
Cost of solution for building facilities at 1 and
3 is 2853 18
3 f38
Problem Find a set F of facilities
that minimizes ?i2 F fi ?i2 x d(i,F), d(i,F)
minj2 F d(i, j)
d(1,4) 5
d(3,4) 3
4 f425
27
Quick facts about facility location
  • Natural operations research problem.
  • NP-Hard to solve optimally.
  • Many approximation algorithms (not streaming or
    massive data set)
  • Best approximation ratio is 1.52.
  • Achieving a factor of 1.463 is hard.

28
Other NP-Hard clustering problems in
Combinatorial Optimization
  • k-center Find a set of centers C, Ck, that
    minimizes maxi2 X d(i, C).
  • One pass approximation algorithm that requires
    O(k) space CCFM 97
  • k-median Find a set of centers C, Ck, that
    minimizes ?i2 X d(i, C).
  • One pass approximation algorithm that requires
    O(k log2 n) space. COP 03
  • No pass-efficient algorithm for general FL!
  • One pass for very restricted inputs Indyk 04

29
Memory issues for facility location
  • Note that some instances of FL will require ?(n)
    facilities to get any reasonable approximation.
  • Thm Any P pass, randomized, algorithm for
    approximately computing the optimum FL cost
    requires ?(n/P) bits of memory.
  • How to give reasonable guarantees that the space
    usage is small?
  • Parameterize bounds on memory usage by number of
    facilities opened.

30
Algorithmic Results
  • 3P pass algorithm for solving FL, using at most
    O(k n2/P) bits of extra memory, where k is the
    number of facilities output by the algorithm
  • Approximation ratio is O(P) if OPT is the cost
    of the optimum solution, then the algorithm will
    output a solution with cost at most ? 36 P OPT
  • Requires a black-box approximation algorithm for
    FL with approximation ratio ?.
  • Best so-far ? 1.52
  • Surprising Fact same trade-off, very different
    problem!

31
Simplifying the Problem
  • Large number of facilities complicates matters.
    Algorithmic cure involves technical details that
    obscure the main point we would like to make.
  • Easier problem to demonstrate our framework in
    action is k Facility Location.
  • k-FL Find a set of at most k facilities F
    (Fk) that minimizes ?i 2 F fi ?i 2 Xd(i,F).
  • We present a 3P pass algorithm that uses O(k
    n1/P) bits of memory. Approx ratio O(P).

32
Idea of our algorithm
  • Our algorithm will make P triples of passes,
    total 3P passes.
  • In the first two passes of each triple, we take a
    uniform sample of X and compute a FL solution on
    the sample.
  • Intuitively, this will be a good solution for
    most of the input.
  • In a third pass, identify bad points which are
    far away from our facilities.
  • In subsequent passes, we recursively compute a
    solution for the bad points by restricting our
    algorithms to those points.

33
Algorithm, Part 1 Clustering a Sample
  • In one pass, draw a sample S1 of s O(k(P-1)/P
    n1/P) nodes from the data stream.
  • In a second pass, compute an O(1)-approximation
    to FL on S1 , but with facilities drawn from all
    of X and with distances in S scaled up by n/s.
    Call the set of facilities F1.

34
How good is the solution F1?
  • Want to show that the cost of F1 on X is small.
  • Easy to show that ?i2 F1 filtO(1)OPT
  • Best scenario facilities that service the sample
    S1 well will service all points of X well ?j2 F1
    fj?j2 X d(j, F1) O(1) OPT.
  • Unfortunately, this is not true in general.

35
Bounding the cost, similar to Indyk 99
  • Fix an optimum solution on X, a set of facilities
    F, F k that partitions X into k clusters
    C1,,Ck.
  • If cluster Ci is large (i.e. Cigtn/s) then a
    sample S1 of X of size s will contain many points
    of Ci.
  • Can prove a good clustering of S1 will be good
    for points in the large clusters Ci of X.
  • Will cost at most O(1)OPT to service large
    clusters.
  • Can prove the number of bad points is at most
    O(k1/Pn1-1/P).

36
Algorithm, part 2 Recursing on the outliers
  • In a third pass, identify the O(k1/P n1-1/P )
    points that are farthest away from the facilities
    in F1. Can do it using O(n1/P) space using
    random sampling.
  • Assign all other points to facilities in F1.
  • Cost of this at most O(1)OPT.
  • Recurse on the outliers.
  • In subsequent iterations, will draw a sample of
    size s from outliers thus, points will be
    sampled with much higher frequency.

37
The General Recursive Step
  • For the mth iteration Let Rm be the points
    remaining to be clustered.
  • Take a sample S of size s from Rm, and compute a
    FL solution Fm.
  • Identify the O(k(m1)/P n1-(m1)/P) points that
    are farthest away from Fm , call this set Rm1
  • Can prove that Fm will service Rmn Rm1 with cost
    O(1) OPT.
  • Recurse on the points in Rm1.

38
Putting it all together
  • Let the final set of facilities be F Fi.
  • After P iterations, all points will be assigned a
    facility.
  • Total cost of solution will be O(P) OPT.
  • Note that possibly Fgtk. Can boil this down to
    k facilities that still give O(P) OPT.
  • Hint Use a known k-median algorithm.
  • Algorithm generalizes to k-median, but requires
    more passes and space than COP 03.

39
References
  • J. Aspnes, K. Chang, A. Yampolskiy. Inoculation
    strategies for victims of viruses and
    sum-of-squares partition problem. To appear in
    Journal of Computer and System Sciences.
    Preliminary version appeared in SODA 2005.
  • K. Chang, R. Kannan. Pass-Efficient Algorithms
    for Clustering. SODA 2006.
  • K. Chang. Pass-Efficient Algorithms for Facility
    Location. Yale TR 1337.
  • S. Arora, K. Chang. Approximation Schemes for
    low degree MST and Red-blue Separation problem.
    Algorithmica 40(3)189-210, 2004. Preliminary
    version appeared in ICALP 2003. (Not in thesis)

40
Future Directions
  • Consider problems from data mining, abstract (and
    possibly simplify) them, and see what algorithmic
    insights TCS can provide.
  • Possible extensions of this thesis work
  • Design a pass-efficient algorithm for learning
    mixtures of Gaussian distributions in high
    dimension. Give rigorous guarantees about
    accuracy.
  • Design pass-efficient algorithms for other Comb.
    Opt. Problems. Possibility find a set of
    centers C, C k to minimize ?i d(i, C)2.
    (Generalization of k-means objectiveEuclidean
    case already solved).
  • Many problems out there. Much more to be done!

41
Thanks for listening!
Write a Comment
User Comments (0)
About PowerShow.com