Algorithmic Techniques for Clustering in the Streaming Data Model Moses Charikar - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Algorithmic Techniques for Clustering in the Streaming Data Model Moses Charikar

Description:

We set f = L/k(1 log n) Run online facility location algorithm (Online-Fac-Locn) Lemma: ... Run offline algorithm on weighted centers to get k centers with cost O(OPT) ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 45
Provided by: mos7
Category:

less

Transcript and Presenter's Notes

Title: Algorithmic Techniques for Clustering in the Streaming Data Model Moses Charikar


1
Algorithmic Techniques for Clustering in the
Streaming Data ModelMoses Charikar
  • Princeton University

2
Clustering
  • Given very large collection of objects
  • Objects could be web pages, news stories, images,
    customer profiles, etc
  • Objective cluster the objects
  • Disjoint partition into clusters
  • Similar/related objects in the same cluster
  • Dissimilar objects in different clusters

3
Formalizing similarity
  • Intuitive notion of objects being similar or
    related
  • Formalized by similarity/distance function
    between objects
  • Similarity is transitive (sort of)
  • Distance function satisfies triangle inequality

4
What is a good clustering ?
  • Intuitively, clusters should be tightly knit
  • Optimization viewpoint of clustering
  • Quality of clustering measured by objective
    function
  • Lower objective function ? higher quality
  • Goal Find a clustering that minimizes objective
    function

5
Clustering objective functions
  • Typically, associate each cluster with cluster
    center (representative)
  • Goal partition into k clusters
  • Equivalently, find k centers and assign points to
    centers
  • Clustering is good if points are close to cluster
    centers
  • Common clustering objectives measure distances of
    points to cluster centers

6
Clustering objective functions
  • Maximum cluster radius (k-center)
  • Sum of distances of points to cluster centers
    (k-median)
  • Sum of cluster radii (k-sumradii)

7
Complexity of Optimization
  • Resulting optimization problems areNP-Hard to
    solve exactly
  • Unreasonable to expect that we can find optimum
    solution
  • Settle for heuristic algorithms
  • Heuristics with provable guarantees

8
Approximation algorithms
  • Algorithm with guarantee that
  • Given an instance with optimum solution of value
    OPT
  • Algorithm produces solution of value at most
    ?OPT
  • We say that algorithm has approximation ratio ?
  • Goal Design approximation algorithm with small
    approximation ratio

9
Offline vs. Streaming
  • Offline model
  • Find good clustering solution in polynomial time
  • Arbitrary access to data
  • Streaming model
  • Produce implicit description of clusters (i.e.
    cluster centers additional info) in one pass,
    using small amount of space.

10
Input representation
  • Measure space requirement in terms of number of
    objects stored
  • What if objects themselves are large ?
  • Schemes to represent objects compactly
  • Similarity/distance of objects can be estimated
    from their compact representations

11
Talk outline
  • Streaming algorithms for clustering
  • K-center
  • K-median
  • Clustering formulations with outliers
  • Compact representations for estimating
    similarity/distance

12
K-center
  • Given collection of points
  • Pick k cluster centers
  • Assign each point to closest center
  • Minimize maximum point-center distance
  • Offline 2-approximationHochbaum, Shmoys
    Dyer, FriezeGonzalez

13
Offline algorithm
  • Suppose optimal radius is OPT
  • Process points sequentially
  • Maintain set of centers S(Initially S first
    point)
  • Consider next point p
  • If p is within distance 2OPT of some center in S,
    add to corresponding cluster
  • Else, add p as new center in S

14
Analysis
  • Assuming we know OPT
  • Guarantee on solution cost
  • Radius of each cluster is at most 2OPT
  • Guarantee on number of centers
  • Distance between points in S is gt2OPT
  • Every point in S must be in a distinct cluster in
    optimal solution
  • S can have at most k points

15
Streaming algorithm
  • Start with very low guess on OPT
  • Run offline algorithm
  • If we get gt k centers, guess was too low
  • Increase guess, merge clusters
  • Algorithm runs in phases
  • ri guess used in phase i
  • ri1 2 ri

16
Phase transitions
  • End of phase i
  • k1 points with pairwise distance gt 2ri
  • Each cluster of radius lt 4ri
  • Beginning of phase i1
  • ri1 2 ri
  • Pick arbitrary center c, merge clusters whose
    center within 2ri1 from c (repeat)
  • New point p
  • Add to cluster if within 2ri1 from center
  • Else, add p to set of centers (create new cluster)

17
2ri1
4ri
Radius of new clusters ? 2ri14ri 4 ri1
18
Approximation guarantee
  • Clusters in phase i1 have radius lt 4ri1
  • OPT gt ri
  • Approximation ratio 4ri1/ri 8
  • Note storage required is k
  • Ratio can be improved
  • More sophisticated algorithm
  • Randomization
  • C,Chekuri,Feder,Motwani

19
k-median
  • Given collection of points
  • Pick k cluster centers
  • Assign each point to closest center
  • Minimize sum of point-center distances
  • Offline 3? approximation Arya, etal
  • LP rounding, primal dual, local search

20
Previous streaming algorithm
  • Guha,Mishra,Motwani,OCallaghan
  • Storage n?, approximation ratio 2O(1/?)
  • Apply offline algorithm to cluster blocks of n?
    points
  • Clustering proceeds in levels
  • Centers for level i form input for level i1

21
k centers
k centers
22
New approach
  • C,OCallaghan,Panigrahy
  • Idea mimic k-center approach
  • Suppose we knew OPT
  • Can we maintain solution with k centers and cost
    O(OPT) in streaming fashion ?

23
Facility location
  • Given collection of points, facility cost f
  • Find subset S of centers
  • Assign each point to closest center
  • Cost sum of point-cluster distances
    f S
  • Contrast with k-median
  • (sort of) Lagrangian relaxation

24
Using facility location for k-median
  • Given k-median instance with optimal value OPT
  • Produce facility location instance by setting
    facility cost f OPT/k
  • Optimal for facility location ?2OPT
  • Given ? approx algorithm for fac locn
  • Fac locn solution of cost ?2?OPT
  • Interpret as k-median solution
  • Cost ?2?OPT, centers ?2?k

25
Online algorithm for facility location
  • Meyerson
  • f facility cost
  • For each point p
  • ? distance of p to closest center
  • Open center at p with probability ?/p
  • Theorem Expected cost of solution O(log n) OPT

26
Using the online algorithm
  • Suppose we have lower bound L on OPT
  • We set f L/k(1log n)
  • Run online facility location algorithm(Online-Fac
    -Locn)
  • Lemma
  • Expected number of centers produced ? k(1log
    n)(14OPT/L)
  • Expected cost ? L4OPT
  • Procedure to check if OPT much larger than L

27
Updating the lower bound
  • With probability at least ½, Online-Fac-Locn
    produces solution with
  • Cost ? 4(L4OPT)
  • centers ? 4k(1log n)(14OPT/L)
  • Run O(log n) invocations of this in parallel
  • Invocation fails if cost exceeds bound, or number
    of centers exceed bound O(k log n)
  • If all invocations fail, update lower bound L

28
Changing phases
  • Increase lower bound to ?L
  • Pick solution produced by invocation that
    finished last
  • Feed (weighted) centers as input to next phase
  • Finally, O(k log n) centers with cost O(OPT)
  • Run offline algorithm on weighted centers to get
    k centers with cost O(OPT)
  • Note storage O(k log2 n) points

29
Clustering with outliers
  • Can exclude ? fraction of the points
  • Find solution to optimize clustering objective on
    remaining (1- ?) fraction of point set
  • Offline C,Khuller,Mount,Narasimhan
  • Streaming C,OCallaghan,Panigrahy

30
Outliers analysis ideas
  • Algorithm Sample data set and apply offline
    clustering algorithm to sample
  • Analysis show that sample is representative of
    data set, i.e.
  • If particular solution excludes ? fraction of
    points in the sample
  • Solution scaled up to entire data set does not
    exclude much more than ? fraction of points

31
Compact sketches for estimating similarity
  • Collection of objects, e.g. mathematical
    representation of documents, images.
  • Implicit similarity/distance function.
  • Want to estimate similarity without looking at
    entire objects.
  • Compute compact sketches of objects so that
    similarity/distance can be estimated from them.

32
Some known techniques
  • Dimension reduction techniques, random
    projections Johnson, Lindenstrauss
  • Preserves l2 distances between vectors
  • Stable distributions Indyk
  • Preserves l1 distances between vectors

33
Similarity Preserving Hash Functions
  • Similarity function sim(x,y) ?0,1(distance
    function 1-sim(x,y))
  • Family of hash functions F with probability
    distribution such that

34
Applications
  • Compact representation scheme for estimating
    similarity
  • Approximate nearest neighbor search
    Indyk,Motwani Kushilevitz,Ostrovsky,Rabani

35
Document similarity
  • Eliminating near duplicate documents from a
    search engine index
  • Can measure similarity of documents by looking at
    document text
  • Too much data !
  • gt2 billion documents, say avg. size 4K
  • Replace documents by short sketches (say 8
    bytes), so that similarity can be estimated

36
Estimating Set Similarity
  • Broder,Manasse,Glassman,Zweig
  • Broder,C,Frieze,Mitzenmacher
  • Collection of subsets

37
Minwise Independent Permutations
38
Existence and Construction of Similarity
Preserving Hash Functions
C 02
  • When do such hash functions exist ?
  • Necessary condition Distance function
    1-sim(x,y) must be embeddable in the Hamming cube
  • Insight Techniques used in approximation
    algorithms yield similarity preserving hash
    functions

39
Random Hyperplane Rounding based Hash Functions
  • Collection of vectors
  • Pick random hyperplane through origin (normal
    vector )
  • Goemans,Williamson

40
Geometrically speaking
41
More Geometry
Sketches of vectors are bit sequences Angle
between vectors estimated by hamming distance of
bit sequences
42
Other results
  • Similarity preserving hash function for
    approximating Earth Mover Metric, used in graphcs
    and vision.

43
Earth Mover Distance (EMD)
P
Q
EMD(P,Q)
44
Relaxation of requirements
  • Estimate distance measure, not similarity measure
    in 0,1.
  • Allow hash functions to map objects to points in
    metric space and measure Ed(h(P),h(Q).(previous
    ly d(x,y) 1 if x ?y)
  • Estimator will approximate EMD.

Ed(h(P),h(Q) ? O(log n log log n) EMD(P,Q)
Write a Comment
User Comments (0)
About PowerShow.com