Algorithmic Techniques for Clustering in the Streaming Data Model Moses Charikar - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Algorithmic Techniques for Clustering in the Streaming Data Model Moses Charikar

Description:

We set f = L/k(1 log n) Run online facility location algorithm (Online-Fac-Locn) Lemma: ... Run offline algorithm on weighted centers to get k centers with cost O(OPT) ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 45

Provided by: mos7

Category:

more less

Transcript and Presenter's Notes

Title: Algorithmic Techniques for Clustering in the Streaming Data Model Moses Charikar

1
Algorithmic Techniques for Clustering in the
Streaming Data ModelMoses Charikar

Princeton University

2
Clustering

Given very large collection of objects
Objects could be web pages, news stories, images,
customer profiles, etc
Objective cluster the objects
Disjoint partition into clusters
Similar/related objects in the same cluster
Dissimilar objects in different clusters

3
Formalizing similarity

Intuitive notion of objects being similar or
related
Formalized by similarity/distance function
between objects
Similarity is transitive (sort of)
Distance function satisfies triangle inequality

4
What is a good clustering ?

Intuitively, clusters should be tightly knit
Optimization viewpoint of clustering
Quality of clustering measured by objective
function
Lower objective function ? higher quality
Goal Find a clustering that minimizes objective
function

5
Clustering objective functions

Typically, associate each cluster with cluster
center (representative)
Goal partition into k clusters
Equivalently, find k centers and assign points to
centers
Clustering is good if points are close to cluster
centers
Common clustering objectives measure distances of
points to cluster centers

6
Clustering objective functions

Maximum cluster radius (k-center)
Sum of distances of points to cluster centers
(k-median)
Sum of cluster radii (k-sumradii)

7
Complexity of Optimization

Resulting optimization problems areNP-Hard to
solve exactly
Unreasonable to expect that we can find optimum
solution
Settle for heuristic algorithms
Heuristics with provable guarantees

8
Approximation algorithms

Algorithm with guarantee that
Given an instance with optimum solution of value
OPT
Algorithm produces solution of value at most
?OPT
We say that algorithm has approximation ratio ?
Goal Design approximation algorithm with small
approximation ratio

9
Offline vs. Streaming

Offline model
Find good clustering solution in polynomial time
Arbitrary access to data
Streaming model
Produce implicit description of clusters (i.e.
cluster centers additional info) in one pass,
using small amount of space.

10
Input representation

Measure space requirement in terms of number of
objects stored
What if objects themselves are large ?
Schemes to represent objects compactly
Similarity/distance of objects can be estimated
from their compact representations

11
Talk outline

Streaming algorithms for clustering
K-center
K-median
Clustering formulations with outliers
Compact representations for estimating
similarity/distance

12
K-center

Given collection of points
Pick k cluster centers
Assign each point to closest center
Minimize maximum point-center distance
Offline 2-approximationHochbaum, Shmoys
Dyer, FriezeGonzalez

13
Offline algorithm

Suppose optimal radius is OPT
Process points sequentially
Maintain set of centers S(Initially S first
point)
Consider next point p
If p is within distance 2OPT of some center in S,
add to corresponding cluster
Else, add p as new center in S

14
Analysis

Assuming we know OPT
Guarantee on solution cost
Radius of each cluster is at most 2OPT
Guarantee on number of centers
Distance between points in S is gt2OPT
Every point in S must be in a distinct cluster in
optimal solution
S can have at most k points

15
Streaming algorithm

Start with very low guess on OPT
Run offline algorithm
If we get gt k centers, guess was too low
Increase guess, merge clusters
Algorithm runs in phases
ri guess used in phase i
ri1 2 ri

16
Phase transitions

End of phase i
k1 points with pairwise distance gt 2ri
Each cluster of radius lt 4ri
Beginning of phase i1
ri1 2 ri
Pick arbitrary center c, merge clusters whose
center within 2ri1 from c (repeat)
New point p
Add to cluster if within 2ri1 from center
Else, add p to set of centers (create new cluster)

17
2ri1
4ri
Radius of new clusters ? 2ri14ri 4 ri1
18
Approximation guarantee

Clusters in phase i1 have radius lt 4ri1
OPT gt ri
Approximation ratio 4ri1/ri 8
Note storage required is k
Ratio can be improved
More sophisticated algorithm
Randomization
C,Chekuri,Feder,Motwani

19
k-median

Given collection of points
Pick k cluster centers
Assign each point to closest center
Minimize sum of point-center distances
Offline 3? approximation Arya, etal
LP rounding, primal dual, local search

20
Previous streaming algorithm

Guha,Mishra,Motwani,OCallaghan
Storage n?, approximation ratio 2O(1/?)
Apply offline algorithm to cluster blocks of n?
points
Clustering proceeds in levels
Centers for level i form input for level i1

21
k centers
k centers
22
New approach

C,OCallaghan,Panigrahy
Idea mimic k-center approach
Suppose we knew OPT
Can we maintain solution with k centers and cost
O(OPT) in streaming fashion ?

23
Facility location

Given collection of points, facility cost f
Find subset S of centers
Assign each point to closest center
Cost sum of point-cluster distances
f S
Contrast with k-median
(sort of) Lagrangian relaxation

24
Using facility location for k-median

Given k-median instance with optimal value OPT
Produce facility location instance by setting
facility cost f OPT/k
Optimal for facility location ?2OPT
Given ? approx algorithm for fac locn
Fac locn solution of cost ?2?OPT
Interpret as k-median solution
Cost ?2?OPT, centers ?2?k

25
Online algorithm for facility location

Meyerson
f facility cost
For each point p
? distance of p to closest center
Open center at p with probability ?/p
Theorem Expected cost of solution O(log n) OPT

26
Using the online algorithm

Suppose we have lower bound L on OPT
We set f L/k(1log n)
Run online facility location algorithm(Online-Fac
-Locn)
Lemma
Expected number of centers produced ? k(1log
n)(14OPT/L)
Expected cost ? L4OPT
Procedure to check if OPT much larger than L

27
Updating the lower bound

With probability at least ½, Online-Fac-Locn
produces solution with
Cost ? 4(L4OPT)
centers ? 4k(1log n)(14OPT/L)
Run O(log n) invocations of this in parallel
Invocation fails if cost exceeds bound, or number
of centers exceed bound O(k log n)
If all invocations fail, update lower bound L

28
Changing phases

Increase lower bound to ?L
Pick solution produced by invocation that
finished last
Feed (weighted) centers as input to next phase
Finally, O(k log n) centers with cost O(OPT)
Run offline algorithm on weighted centers to get
k centers with cost O(OPT)
Note storage O(k log2 n) points

29
Clustering with outliers

Can exclude ? fraction of the points
Find solution to optimize clustering objective on
remaining (1- ?) fraction of point set
Offline C,Khuller,Mount,Narasimhan
Streaming C,OCallaghan,Panigrahy

30
Outliers analysis ideas

Algorithm Sample data set and apply offline
clustering algorithm to sample
Analysis show that sample is representative of
data set, i.e.
If particular solution excludes ? fraction of
points in the sample
Solution scaled up to entire data set does not
exclude much more than ? fraction of points

31
Compact sketches for estimating similarity

Collection of objects, e.g. mathematical
representation of documents, images.
Implicit similarity/distance function.
Want to estimate similarity without looking at
entire objects.
Compute compact sketches of objects so that
similarity/distance can be estimated from them.

32
Some known techniques

Dimension reduction techniques, random
projections Johnson, Lindenstrauss
Preserves l2 distances between vectors
Stable distributions Indyk
Preserves l1 distances between vectors

33
Similarity Preserving Hash Functions

Similarity function sim(x,y) ?0,1(distance
function 1-sim(x,y))
Family of hash functions F with probability
distribution such that

34
Applications

Compact representation scheme for estimating
similarity
Approximate nearest neighbor search
Indyk,Motwani Kushilevitz,Ostrovsky,Rabani

35
Document similarity

Eliminating near duplicate documents from a
search engine index
Can measure similarity of documents by looking at
document text
Too much data !
gt2 billion documents, say avg. size 4K
Replace documents by short sketches (say 8
bytes), so that similarity can be estimated

36
Estimating Set Similarity

Broder,Manasse,Glassman,Zweig
Broder,C,Frieze,Mitzenmacher
Collection of subsets

37
Minwise Independent Permutations
38
Existence and Construction of Similarity
Preserving Hash Functions
C 02

When do such hash functions exist ?
Necessary condition Distance function
1-sim(x,y) must be embeddable in the Hamming cube
Insight Techniques used in approximation
algorithms yield similarity preserving hash
functions

39
Random Hyperplane Rounding based Hash Functions

Collection of vectors
Pick random hyperplane through origin (normal
vector )
Goemans,Williamson

40
Geometrically speaking
41
More Geometry
Sketches of vectors are bit sequences Angle
between vectors estimated by hamming distance of
bit sequences
42
Other results

Similarity preserving hash function for
approximating Earth Mover Metric, used in graphcs
and vision.

43
Earth Mover Distance (EMD)
P
Q
EMD(P,Q)
44
Relaxation of requirements

Estimate distance measure, not similarity measure
in 0,1.
Allow hash functions to map objects to points in
metric space and measure Ed(h(P),h(Q).(previous
ly d(x,y) 1 if x ?y)
Estimator will approximate EMD.

Ed(h(P),h(Q) ? O(log n log log n) EMD(P,Q)

Write a Comment

User Comments (0)