Sampling From a Moving Window Over Streaming Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Sampling From a Moving Window Over Streaming Data

1
Sampling From a Moving Window Over Streaming Data

Stanford University
Speaker
2
Continuous Data Streams

3
Why Moving Windows?

4
Two Types of Windows

Sequence-Based
The most recent n elements from the data stream
Assumes a (possibly implicit) sequence number for
each element
Timestamp-Based
All elements from the data stream in the last m
units of time (e.g. last 1 week)
Assumes a (possibly implicit) arrival timestamp
for each element
Sequence-based is the focus for most of the talk

5
Sampling From a Data Stream

Inputs
Sample size k
Window size n gtgt k (alternatively, time duration
m)
Stream of data elements that arrive online
Output
k elements chosen uniformly at random from the
last n elements (alternatively, from all elements
that have arrived in the last m time units)
Goal
maintain a data structure that can produce the
desired output at any time upon request

6
A Simple, Unsatisfying Approach

Choose a random subset Xx1, ,xk,
X?0,1,,n-1
The sample always consists of the non-expired
elements whose indexes are equal to x1, ,xk
(modulo n)
Only uses O(k) memory
Technically produces a uniform random sample of
each window, but unsatisfying because the sample
is highly periodic
Unsuitable for many real applications,
particularly those with periodicity in the data

7
Another Simple Approach Oversample

As each element arrives remember it with
probability p ck/n log n otherwise discard it
Discard elements when they expire
When asked to produce a sample, choose k elements
at random from the set in memory
Expected memory usage of O(k log n)
Uses O(k log n) memory whp
The algorithm can fail if less than k elements
from a window are remembered however whp this
will not happen

8
Reservoir Sampling

Classic online algorithm due to Vitter (1985)
Maintains a fixed-size uniform random sample
Size of the data stream need not be known in
advance
Data structure reservoir of k data elements
As the ith data element arrives
Add it to the reservoir with probability p k/i,
discarding a randomly chosen data element from
the reservoir to make room
Otherwise (with probability 1-p) discard it

9
Why It Doesnt Work With Moving Windows

10
Chain-Sample

Include each new element in the sample with
probability 1/min(i,n)
As each element is added to the sample, choose
the index of the element that will replace it
when it expires
When the ith element expires, the window will be
(i1in), so choose the index from this range
Once the element with that index arrives, store
it and choose the index that will replace it in
turn, building a chain of potential
replacements
When an element is chosen to be discarded from
the sample, discard its chain as well

11
Example
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
12
Memory Usage of Chain-Sample

Let T(x) denote the expected length of the chain
from the element with index i when the most
recent index is ix
T(x)
The expected length of each chain is less than
T(n) ? e ? 2.718
Expected memory usage is O(k)

13
Memory Usage of Chain-Sample

Chain consists of hops with lengths 1n
Chain of length ? j can be represented by
partition of n into j ordered integer parts
j-1 hops with sum less than n plus a remainder
Each such partition has probability n-j
Number of such partitions is (n) lt (ne/j)j
Probability of any such partition is small
O(n-c)when j O(k log n)
Uses O(k log n) memory whp

j
14
Comparison of Algorithms
Algorithm Expected High-Probability
Periodic O(k) O(k)
Oversample O(k log n) O(k log n)
Chain-Sample O(k) O(k log n)

15
Timestamp-Based Windows

Window at time t consists of all elements whose
arrival timestamp is at least t t-m
The number of elements in the window is not known
in advance and may vary over time
None of the previous algorithms will work
All require windows with a constant, known number
of elements

16
Priority-Sample

We describe priority-sample for k1
Assign each element a randomly-chosen priority
The element with the highest priority is the
sample
An element is ineligible if there is another
element with a later timestamp and higher
priority
Only store eligible, non-expired elements

17
Memory Usage of Priority-Sample

Imagine that the elements were stored in a
treap totally ordered by arrival timestamp and
heap-ordered by priority
The eligible elements would represent the right
spine of the treap
We only store the eligible elements
Therefore expected memory usage is O(log n), or
O(k log n) for samples of size k
O(k log n) is also an upper bound (whp)

18
Conclusion

Our contributions
Introduced the problem of maintaining a sample
over a moving window from a data stream
Developed the Chain-Sample algorithm for this
problem with sequence-based windows
Developed the Priority-Sample algorithm for this
problem with timestamp-based windows
Future work
What else can be computed in sublinear space over
moving windows on data streams?
For example The next talk!

Write a Comment

User Comments (0)

About PowerShow.com

Sampling From a Moving Window Over Streaming Data PowerPoint PPT Presentation