Sampling From a Moving Window Over Streaming Data - PowerPoint PPT Presentation

About This Presentation
Title:

Sampling From a Moving Window Over Streaming Data

Description:

Once the element with that index arrives, store it and choose the index that ... the problem of maintaining a sample over a moving window from a data stream ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 19
Provided by: BrianB105
Category:

less

Transcript and Presenter's Notes

Title: Sampling From a Moving Window Over Streaming Data


1
Sampling From a Moving Window Over Streaming Data
  • Brian BabcockMayur DatarRajeev Motwani

Stanford University
Speaker
2
Continuous Data Streams
  • Data streams arise in a number of applications
  • IP packets in a network
  • Call records (telecom)
  • Cash register data (retail sales)
  • Sensor networks
  • Large volumes of data
  • Online processing
  • Data is read once and discarded
  • Memory is limited

3
Why Moving Windows?
  • Timeliness matters
  • Old/obsolete data is not useful
  • Scalability matters
  • Querying the entire history may be impractical
  • Solution restrict queries to a window of recent
    data
  • As new data arrives, old data expires
  • Addresses timeliness and scalability

4
Two Types of Windows
  • Sequence-Based
  • The most recent n elements from the data stream
  • Assumes a (possibly implicit) sequence number for
    each element
  • Timestamp-Based
  • All elements from the data stream in the last m
    units of time (e.g. last 1 week)
  • Assumes a (possibly implicit) arrival timestamp
    for each element
  • Sequence-based is the focus for most of the talk

5
Sampling From a Data Stream
  • Inputs
  • Sample size k
  • Window size n gtgt k (alternatively, time duration
    m)
  • Stream of data elements that arrive online
  • Output
  • k elements chosen uniformly at random from the
    last n elements (alternatively, from all elements
    that have arrived in the last m time units)
  • Goal
  • maintain a data structure that can produce the
    desired output at any time upon request

6
A Simple, Unsatisfying Approach
  • Choose a random subset Xx1, ,xk,
    X?0,1,,n-1
  • The sample always consists of the non-expired
    elements whose indexes are equal to x1, ,xk
    (modulo n)
  • Only uses O(k) memory
  • Technically produces a uniform random sample of
    each window, but unsatisfying because the sample
    is highly periodic
  • Unsuitable for many real applications,
    particularly those with periodicity in the data

7
Another Simple Approach Oversample
  • As each element arrives remember it with
    probability p ck/n log n otherwise discard it
  • Discard elements when they expire
  • When asked to produce a sample, choose k elements
    at random from the set in memory
  • Expected memory usage of O(k log n)
  • Uses O(k log n) memory whp
  • The algorithm can fail if less than k elements
    from a window are remembered however whp this
    will not happen

8
Reservoir Sampling
  • Classic online algorithm due to Vitter (1985)
  • Maintains a fixed-size uniform random sample
  • Size of the data stream need not be known in
    advance
  • Data structure reservoir of k data elements
  • As the ith data element arrives
  • Add it to the reservoir with probability p k/i,
    discarding a randomly chosen data element from
    the reservoir to make room
  • Otherwise (with probability 1-p) discard it

9
Why It Doesnt Work With Moving Windows
  • Suppose an element in the reservoir expires
  • Need to replace it with a randomly-chosen element
    from the current window
  • However, in the data stream model we have no
    access to past data
  • Could store the entire window but this would
    require O(n) memory

10
Chain-Sample
  • Include each new element in the sample with
    probability 1/min(i,n)
  • As each element is added to the sample, choose
    the index of the element that will replace it
    when it expires
  • When the ith element expires, the window will be
    (i1in), so choose the index from this range
  • Once the element with that index arrives, store
    it and choose the index that will replace it in
    turn, building a chain of potential
    replacements
  • When an element is chosen to be discarded from
    the sample, discard its chain as well

11
Example
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
12
Memory Usage of Chain-Sample
  • Let T(x) denote the expected length of the chain
    from the element with index i when the most
    recent index is ix
  • T(x)
  • The expected length of each chain is less than
    T(n) ? e ? 2.718
  • Expected memory usage is O(k)

13
Memory Usage of Chain-Sample
  • Chain consists of hops with lengths 1n
  • Chain of length ? j can be represented by
    partition of n into j ordered integer parts
  • j-1 hops with sum less than n plus a remainder
  • Each such partition has probability n-j
  • Number of such partitions is (n) lt (ne/j)j
  • Probability of any such partition is small
    O(n-c)when j O(k log n)
  • Uses O(k log n) memory whp

j
14
Comparison of Algorithms
Algorithm Expected High-Probability
Periodic O(k) O(k)
Oversample O(k log n) O(k log n)
Chain-Sample O(k) O(k log n)
  • Chain-sample is preferable to oversampling
  • Better expected memory usage O(k) vs. O(k log
    n)
  • Same high-probability memory bound of O(k log n)
  • No chance of failure due to sample size shrinking
    below k

15
Timestamp-Based Windows
  • Window at time t consists of all elements whose
    arrival timestamp is at least t t-m
  • The number of elements in the window is not known
    in advance and may vary over time
  • None of the previous algorithms will work
  • All require windows with a constant, known number
    of elements

16
Priority-Sample
  • We describe priority-sample for k1
  • Assign each element a randomly-chosen priority
  • The element with the highest priority is the
    sample
  • An element is ineligible if there is another
    element with a later timestamp and higher
    priority
  • Only store eligible, non-expired elements

17
Memory Usage of Priority-Sample
  • Imagine that the elements were stored in a
    treap totally ordered by arrival timestamp and
    heap-ordered by priority
  • The eligible elements would represent the right
    spine of the treap
  • We only store the eligible elements
  • Therefore expected memory usage is O(log n), or
    O(k log n) for samples of size k
  • O(k log n) is also an upper bound (whp)

18
Conclusion
  • Our contributions
  • Introduced the problem of maintaining a sample
    over a moving window from a data stream
  • Developed the Chain-Sample algorithm for this
    problem with sequence-based windows
  • Developed the Priority-Sample algorithm for this
    problem with timestamp-based windows
  • Future work
  • What else can be computed in sublinear space over
    moving windows on data streams?
  • For example The next talk!
Write a Comment
User Comments (0)
About PowerShow.com