Title: Clustering%20Data%20Streams
1Clustering Data Streams
- A presentation by George Toderici
2Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
3Goals of the paper
- Since the k-Median problem is NP-hard this paper
attempts to create an approximation with the
following constraints - Minimize memory usage
- Minimize CPU usage
- Work both on metric spaces and the special case
of Euclidean space
4Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
5Notation Reminder
O(g(n)) running time is upper bounded by
g(n) O(g(n)) running time is lower bounded by
g(n) o(g(n)) running time is asymptotically
negligible ?(g(n)) memory usage is upper
bounded by g(n) not commonly used Soft-Oh
6Paper-specific Notation
- cij is the distance between points i and j
- di the number of points associated with median i
- NOTE Do not confuse c and d. Presumably the
distance has been chosen to be cij because
distance can be treated as a cost. It would
have been more intuitive to have it called d
from the word distance.
7Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
8Clustering with little memory
- Algorithm SmallSpace(S)
- Divide S into l disjoint pieces X1Xl
- For each Xi find O(k) centers in it. Assign each
point to its closest center. - Let X be the set of O(lk) centers obtained where
each center is weighed by the number of points
assigned to it - Cluster X to find k centers
9SmallSpace (2)
K
10SmallSpace analysis
- Since we are interested in using as little memory
as possible, l has to be chosen so that both each
partition of S and X fit in main memory.
However, no such l may exist if S is very large. - We will use this algorithm as a starting point
and improve it so that it will satisfy all
requirements.
11Theorem 1
- Given an instance of the k-median problem with a
solution of cost C, where the medians may not
belong to the set of input points, there exists a
solution of cost 2C where all the medians belong
to the set of input points (metric space
requirement).
12Theorem 1 Proof
- Consider the figure
- The distance from (4) closest to the true mean
to any other point (i) in the data is bounded by
cimcm4 triangle inequality - Therefore, the maximum cost for the median will
be at most two times the cost of the median
clustering with no constraints (worst case)
13Theorem 2
- Consider a set of n points partitioned into
x1,,xl (disjoint sets). The sum of the optimum
solution values for the k-median problem on the l
sets of points is at most twice the cost of the
optimum k-median problem solution for all n
points.
14Theorem 2 Proof
- This is Theorem 1, but on l clusters.
- Apply theorem 2 l times, and obtain a maximum
cost which is two times the cost in the case when
it is allowed to have medians which are not part
of the data
15Theorem 3 (SmallSize Step 2)
- If the sum of the costs of the l optimum
k-median solutions for x1,,xl is C and if C is
the cost of the optimum k-median solution on the
entire set S, then there exists a solution of
cost at most 2(CC) to a the new weighted
instance X.
16Theorem 3 Proof (1)
- Let i be a point in X (a median obtained by
SmallSpaces) - Let the point to which i is assigned to in the
optimum continuous solution be ?(i), and the
number of points assigned to i be di - Then the cost of X is
17Theorem 3 Proof (2)
- Let i be a point in the set S. then let i(i) be
the median in X to which it was assigned by
SmallSpace. - Then the cost of X can be written as
- Let the median assigned to i in the optimal
continuous solution on S be ?(i)
18Theorem 3 Proof (3)
- Because ? is optimal for X, the cost is no more
than - The last sum evaluates to C C for the
continuous case or 2(C C) in the metric space
case - Reminder The sum of the costs of the l optimum
k-median solutions for x1,,xl is C and C is the
cost of the optimum k-median solution on the
entire set S
19Theorem 4 (SmallSize step 2, 4)
- If we modify step 2 to use a bicriteria
approximation algorithm (a,b) where at most ak
medians are output with a cost of at most b times
the optimal k-Median solutions, and then - Modify Step 4 to run a c-approximation algorithm,
then - Theorem 4 The algorithm SmallSpace has an
approximation factor of 2c(1b)2b not proven
here
20SmallerSpace
- Algorithm SmallerSpace(S,i)
- Divide S into l disjoint pieces X1Xl
- For each Xi find O(k) centers in it. Assign each
point to its closest center. - Let X be the O(lk) centers obtained in (2) where
each center is weighed by the number of points
assigned to it - Call SmallerSpace(X, i-1)
21SmallerSpace 2
A small factor is lost in the approximation with
each level of divide and conquer
- In general, if Memoryne, need 1/e levels,
approximation factor 2O(1/e) - If n1012 and M106, then regular 2-level
algorithm - If n1012 and M103 then need 4 levels,
approximation factor 24 -
k
22SmallerSpace Analysis
- Theorem 5 For a constant i, SmallerSpace(S,i)
gives a constant factor approximation to the
k-median problem. - Proof The approximation at level j is
Aj2Aj-1(2b1) 2b (Theorem 2,4) which has the
solution Ajc(2(b1))j which is O(1) if j is
constant.
23SmallerSpace Analysis (2)
- Then, since all intermediate medians X must be
stored in memory, the number of subsets l that we
partition S into is limited. - In fact, we need lk lt M, and such an l may not
exist (where M is the memory size)
24Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
25Datastream model
- Datastream set of ordered points x1,,xi ,, xn
- Algorithm performance is measured as the number
of passes on the data given the constraints of
available memory - Usually the number of points is extremely large
so it is impossible to fit all of them in memory - Usually once a point has been read it is very
expensive to read it again. Most algorithms
assume the data will not be available for a
second pass.
26Data Stream Algorithm
- Input the first m points use a bicriterion
algorithm to reduce these to O(k) (e.g., 2k)
points. Weigh each intermediate median by the
number of points assign to it. (depending on
algorithm used this can take O(m2) or O(mk)) - Repeat (1) until we have seen m2/(2k) of the
original data points. - Cluster these m first-level medians into 2k
second-level medians
27Data Stream Algorithm (2)
- Maintain at most m level-i medians, and on seeing
m, generate 2k level-i1 medians with the weight
of the new median as the sum of the weights of
the intermediate medians. - When we have seen all data points or when we
decide to stop we cluster all intermediate
medians into k final medians
28Data Stream Algorithm (3)
Level 2
M-gtK
2k
Level 3
Level i
M-gtK
M-gtK
2k
FinalK
2k
29Data Stream Algorithm Analysis
- The algorithm requires O(log(n/m)log(m/k)) levels
- If k much smaller than m, and m O(n?) for ? lt
1 - ?(n?) space
- O(n1 ?) run time
- up to a O(21/?) approximation factor (constant
factor approximation)
30Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
31Randomized Algorithm
- Draw a sample of size s (nk)1/2
- Find k medians from these s points using a primal
dual algorithm - Assign each of the original points to its closest
median - Collect n/s points with the largest assignment
distance - Find k medians from among these n/s points
- At this point we have 2k medians
32Randomized Algorithm Analysis
- The algorithm gives a O(1) approximation with 2k
medians with constant probability. - O(log n) passes for high probability results
- time and space
- Space can be improved to O((nk)1/2)
33Full Algorithm
- Input the first O(M/k) points then use the
randomized algorithm to find 2k intermediate
median points - Use a local search algorithm to cluster O(M)
intermediate median points of level i to 2k
medians of level i1 - Use the primal dual algorithm to cluster the
final O(k) medians into k medians
34Full Algorithm (2)
- The full algorithm is still one pass (we call the
randomized algorithm only once per input set) - Step 1 is
- Step 2 is O(nk)
- Therefore, the final cost is
35Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
36Lower Bounds
- Consider a clustering where the distance between
two points is 1 if they belong to the same
cluster and 0 otherwise - An algorithm is not constant factor if it does
not discover a clustering of cost 0 - Finding such a clustering is equivalent to the
following in a complete k-partite graph G for
some k, find the k-partition of vertices of G
into independent sets. - The best algorithm to find that requires ?(nk)
queries and therefore lower bounds any c.f.
clustering algorithm
37Deterministic Algorithms A1
- Partition the n original points into p1 subsets
- Apply the primal dual algorithm to each subset
(O(an2) for each) - Apply it again to the p1k weighted points
obtained at (2) to get the final k medians
38A1 Details
- If we choose the number of subsets p1 (n/k)2/3
we have - O(n4/3k2/3) runtime and space
- 4c2 4c approximation factor by Theorem 4, where
c is the approximation given by the primal-dual
algorithm
39Deterministic Algorithms A2
- Split the dataset into p2 partitions
- Apply A1 on each of them
- Apply A1 on all the intermediate medians at (2)
40A2 Details
- If we choose the number of subsets p1 (n/k)4/5
in order to minimize the running time we have - O(n16/15k14/15) runtime and space
- We can see a trend!
41Deterministic Algorithm
- Create algorithm Ai that calls Ai-1 on pi
partitions - Then the complexity in both time and space of
this algorithm will be
42Deterministic Algorithm (2)
- The approximation factor grows with i, however
- We can set i?(log log log n) in order to get the
exponent of n in the running time to be 1.
43Deterministic Algorithm (2)
- This gives an algorithm running in
- space and time.
-
44Talk Outline
- Goals of the paper
- Notation reminder
- Clustering With Little Memory
- Data Stream Model
- Clustering with Data Streams
- Lower Bounds and Deterministic Algorithms
- Conclusion
45Conclusion
- We have presented a variety of algorithms
optimized to address the problem of clustering in
systems where the amount of data is huge - All the algorithms presented are just
approximations to the k-means problem
46References
- Eric W. Weisstein. "Complete k-Partite Graph."
From MathWorld--A Wolfram Web Resource.
http//mathworld.wolfram.com/Completek-PartiteGrap
h.html - http//theory.stanford.edu/nmishra/CS361-2002/lec
ture9-nina.ppt