Maintaining Variance and k-Medians over Data Stream Windows - PowerPoint PPT Presentation

About This Presentation

Maintaining Variance and k-Medians over Data Stream Windows


Maintaining Variance and k-Medians over Data Stream Windows ... Each bucket can be spilt into 1/t groups. where each contains medians at level j. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 52
Provided by: mathT


Transcript and Presenter's Notes

Title: Maintaining Variance and k-Medians over Data Stream Windows

Maintaining Variance and k-Medians over Data
Stream Windows
  • Paper by Brian Babcock, Mayur Datar, Rajeev
    Motwani and Liadan OCallaghan.

Presentation by Anat Rapoport December 2003.
Characteristics of the data stream
  • Data elements arrive continually
  • Only the most recent N elements are used when
    answering queries
  • Single linear scan algorithm (can only have one
  • Store only the summery of the data seen thus far.

  • Two important and related problems
  • Variance
  • K-median clustering

Problem 1 (Variance)
  • Given a stream of numbers, maintain at every
    instant the variance of the last N values
  • where denotes the mean of the
    last N values

Problem 1 (Variance)
  • We cannot buffer the entire sliding window in
  • So we cannot compute the variance exactly at
    every instant
  • We will solve this problem approximately.
  • We use memory and provide an
    estimate with relative error of at most e
  • The time required per new element is amortized

Extend to k-median
  • Given a multiset X of objects in a metric space M
    with distance function l the k-median problem is
    to pick k points c1,,ck?M so as to minimize
  • where C(x) is the closest of c1,,ck to x.
  • if C(x)ci then x is said to be assigned to ci
    and l(x, ci) is called the assignment distance of
  • The objective function is the sum of the
    assignment distances.

Problem 2 (SWKM)
  • Given a stream of points from a metric space M
    with distance function l, window size N, and
    parameter k, maintain at every instant t a median
    set c1,,ck?M minimizing
  • where Xt is the multiset of N most recent
    points at time t

Exponential Histogram
  • From last week Maintaining simple statistics
    over sliding windows
  • The exponential histogram estimates a class of
    aggregated functions over sliding windows
  • Their result applies to any function f satisfying
    the following properties for all multisets X,Y

Where EH goes wrong
  • EH can estimate any function f defined over
    windows which satisfies
  • Positive
  • Polynomialy bounded
  • Composable
  • Weakly additive
  • where Cf 1 is a constant

Weakly Additive condition not valid for
variance, k-medians
Failure of Weak Additivity
Variance of each bucket is small
The idea
  • Summarize intervals of the data stream using
    composable synopses
  • For efficient memory use adjacent intervals are
    combined, when it doesnt increase the error
  • The synopsis of the last interval in the sliding
    window is inaccurate. Some points have expired
  • We will find a way to estimate this interval

  • Corresponds to the position of an active data
    element in the current window
  • We do not make explicit updates
  • We use a wraparound counter of logN bits
  • Timestamp can be extracted by comparison with the
    counter value of the current arrival

  • We store the data elements in the buckets of the
  • Every bucket stores the synopsis structure for a
    contiguous set of elements
  • The partition is based on arrival time
  • The bucket also has a timestamp, of the most
    recent data element in it
  • When the timestamp reaches N1 we drop the bucket

  • Buckets are numbered B1,,Bm
  • B1 the most recent
  • Bm the oldest
  • t1,,tm denote the bucket timestamp
  • All buckets but Bm have only active data elements

Maintaining variance over sliding windows
  • We would like to estimate the variance with
    relative error of at most e
  • Maintain for each bucket Bi, besides its
    timestamp ti, also
  • number of elements ni
  • mean µi
  • variance Vi

  • Define another set of buckets B1,, Bj that
    represent the suffixes of the data stream.
  • The bucket Bm represents all the points that
    arrived after the oldest non-expired bucket
  • The statistics for these buckets are computed

data structure exponential histogram
Window size N

most recent

most recent

Combination rule
  • In the algorithm we will need to combine adjacent
  • Consider two buckets Bi and Bj that get combined
    to form a new bucket Bij
  • The statistics for Bij are

Lemma 1
  • The bucket combination procedure correctly
    computes ni,j, µi,j, Vi,j for the new bucket

  • Note that ni,j, µi,j, are correctly computed from
    the definitions of count and average
  • Define diµi,-µi,j djµj,-µi,j

(No Transcript)
Main Solution Idea
  • More careful estimation of last buckets
  • Decompose variance into two parts
  • Internal variance within bucket
  • External variance between buckets

Estimation of the variance over the current
active window
  • Let Bm refer to the non-expired portion of the
    bucket Bm (the set of active elements)
  • The estimation for nm, µm, Vm
  • nmESTN1-tm (exact)
  • µmEST µm
  • Vm EST Vm/2
  • The statistics for Bm,m are sufficient for
    computing the variance at time t.

Estimation of the variance over the current
active window
  • The estimate for Bm can be found in o(1) time if
    we keep statistics for Bm
  • The error is due to the error in the estimation
    statistics for Bm
  • Theorem Relative error e, provided Vm
    (e2/9) Vm
  • Aim Maintain Vm (e2/9) Vm using as few
    buckets as possible

Algorithm sketch
  • for every new element
  • insert the new element
  • to an existing bucket or to a new bucket
  • if Bms timestamp gt N delete it
  • if there are two adjacent buckets with small
    combined variance combine them to one bucket

Algorithm 1 (insert xt)
  • 1. if xtµ1 then insert xt to B1, by
    incrementing n1 by 1. Otherwise, create a new
    bucket for xt. The new bucket becomes B1 with
    v10 µ1 xt, n1 1. An old bucket Bi becomes
  • 2. if Bms timestampgtN, delete the bucket.
    Bucket Bm-1 becomes the new oldest bucket.
    Maintain the statistics of Bm-1 (instead of
    Bm), which can be computed using the previously
    maintained statistics for Bm and Bm-1.
    (deletion of buckets also works)

Algorithm 1 (insert xt)
  • 3. Let k9/e2 and Vi,i-1 is the variance
    combination of buckets Bi and Bi-1. While there
    exist an index igt2 such that kVi,i-1Vi-1 find
    the smallest i and combine the buckets according
    to the combination rule. The statistics for Bi
    can be computed incrementally from the statistics
    for Bi-1 and Bi-1
  • 4. Output estimated variance at time t
    according to the estimation procedure. ? Vm,m

  • Invariant 1 For every bucket Bi, 9/e2Vi Vi
  • Ensures that the relative error is e
  • Invariant 2 For each ilt1, for every bucket Bi,
    9/e2Vi,i-1 gt Vi-1
  • This invariant insures that the total number of
    buckets is small ? O((1/e2)log NR2)
  • Each bucket requires constant space

Lemma 2
  • The number of buckets maintained at any point in
    time by an algorithm that preserves Invariant 2
  • O(1/e2logNR2 )
  • where R is an upper bound on the absolute value
    of the data elements.

Proof sketch
  • From the combination rule the variance of the
    union of two buckets is no less then the sum of
    the individual variances.
  • Algorithm that preserves invariant 2, the
    variance of the suffix bucket Bi doubles after
    every O(1/e2) buckets.
  • Total number of buckets no more then O(1/e2
    logV) where V is the variance of the last N
    points. V is no more than NR2. ? O(1/e2 log NR2)

Running time improvement
  • The algorithm requires O(1/e2logNR2 ) time per
    new element.
  • Most time is spent in step 3 where we make the
    sweep to combine buckets.
  • The time is proportional to the size of the
    histogram O(1/e2logNR2 ).
  • The trick skip step 3 until we have seen
    T(1/e2logNR2 ).
  • This ensures that the time of the algorithm is
    amortized O(1).
  • May violate invariant 2 temporarily, but we
    restore it every T(1/e2logNR2 ) data points, when
    we execute step 3.

Variance algorithm summery
  • O(1/e2logNR2 ) time per new element
  • O(1/e2 log NR2) memory
  • with error of at most e

Clustering on sliding windows
Clustering Data Streams
  • Based on k-median problem
  • Data stream points from metric space.
  • Find k clusters in the stream such that the sum
    of distances from data points to their closest
    center is minimized

Clustering Data Streams
  • Constant factor approximation algorithms
  • A simple two step algorithm
  • step1 For each set of Mnt points, Si,
    find O(k) centers in S1, , SM
  • -- Local clustering Assign each
    point in Si to its closest center
  • step2 Let S be centers for S1, ,
    SM with each center weighted by
    number of points assigned to it.
  • Cluster S to find k
  • The solution cost is lt 2optimal solution cost
  • tlt0.5 is a parameter which trades off space bound
    with approximation factor of 2O(1/t)

One-pass algorithm first phase
One-pass algorithm second phase
Restate the algorithm
input data stream
Repeat 1/t times
Nt points
level-0 medians
find O(k) mediansstore it with weight discard Nt
level-1 medians
Nt medians with associated weight
find O(k) medians
level-2 medians
The idea
  • In general, whenever there are nt medians at
    level i they are clustered to form level (i1)

level-(i1) medians
level-i medians
data points
data structure exponential histogram
  • each bucket consists of a collection of data
    points or intermediate medians.

Point representation
  • Each point is represented by a triple
  • (p(x),w(x), c(x)).
  • p(x) - identifier of x (coordinate)
  • w(x) - weight of x, the number of points it
  • c(x) - cost of x. An estimate of the sum of costs
    l(x,y) of all the leaves y in the tree which x is
    the root of x.
  • w(x) S(w(y1), w(y2),,w(yi))
  • c(x) S(c(y)w(y)?l(x,y)) ,for all y assigned
    to x
  • if x is a level-0 median w(x)
    1, c(x)0
  • Thus, c(x) is an overestimate of the true cost
    of x

Bucket cost function
  • We maintain medians at intermediate levels
  • When ever there are Nt medians at the same level
    we cluster them into O(k) medians at the next
    higher level
  • Each bucket can be spilt into 1/t groups
  • where each contains medians at level
  • Each group contains at most Nt medians

Bucket cost function
  • Buckets cost function is an estimate of the cost
    of clustering the points represented by the
  • Consider bucket Bi. Let be
    the set of medians in the bucket
  • Cost function for Bi
  • Where C(x) ?c1,ck is the
    median closest to x

  • Let Bi and Bj be two adjacent buckets that need
    to be combined to form Bi,j
  • Let
    be the groups of medians from the two buckets.
  • if then cluster the points from
    and set it to be empty.
  • C0 set of O(k) medians obtained by clustering
  • and so on After
    at most 1/t unions we get Bi,j
  • Now we compute the new buckets cost

Answer a query
  • Consider buckets B1Bm-1
  • Each contain at most 1/t Nt medians, all contain
    at most 1/t Nt medians
  • Cluster them to produce k medians
  • Cluster bucket Bm to get k additional medians
  • Present the 2k medians as the answer

algorithm Insert xt
  • if number of level-0 medians in B1ltk, add
    the point xt as a level-0 median in bucket B1.
    else create a new bucket B1 to contain xt
    and renumber the existing buckets
  • if bucket Bms time stamp gt N, delete it
    now, Bm-1 becomes the last bucket.
  • Make a sweep over the buckets from most recent to
    least recent while there exists an index igt2
    such that f(Bi,i-1)2f(Bi-1), find the smallest
    such i and combine buckets Bi and Bi-1 using the
    combination procedure described above.

  • Invariant 3. For every bucket Bi
  • f(Bi)2f(Bi)
  • Ensures a solution with 2k median whose cost is
    within multiplicative factor of 2O(1/t) of the
    cost of the optimal k-median solution.
  • Invariant 4. For every bucket Bi (igt1),
  • f(Bi,i-1)gt2f(Bi-1)
  • Ensures that the number of buckets never exceeds
  • We assume that cost is bounded by poly(N)
  • ?O(1/tlogN) in the article

Running time improvement
  • After each element arrives we check if invariant
    3 holds.
  • In order to reduce time we can execute bucket
    combination only after some amount of points
    accumulated in bucket B1, Only after it fills we
    check for the invariant.
  • We assume that the algorithm is not called after
    each new entry. Instead, it maintains enough
    statistics to produce statistics when a query

Producing exactly k clusters
  • With each median, we estimate within a constant
    factor the number of active data points that are
    assigned to it.
  • We dont cluster Bm and Bm separately but
    cluster the medians from all the buckets
    together. However the weights of medians form Bm
    are adjusted so that they reflect only the active
    data points.

  • The goal of such algorithms is to maintain
    statistics or information for the last N set of
    entries that is growing over real time.
  • The variance algorithm uses O(1/e2logNR2) memory
    and maintains an estimate of the variance with
    relative error of at most e and amortized O(1)
    time per new element
  • The k-median algorithm provides a 2O(1/t)
    approximation for tlt0.5. It uses O(1/tlogN)
    memory and requires O(1) amortized time per new

  • More questions/comments can be sent to
Write a Comment
User Comments (0)