Title: Sampling From a Moving Window Over Streaming Data
1Sampling From a Moving Window Over Streaming Data
- Brian BabcockMayur DatarRajeev Motwani
Stanford University
Speaker
2Continuous Data Streams
- Data streams arise in a number of applications
- IP packets in a network
- Call records (telecom)
- Cash register data (retail sales)
- Sensor networks
- Large volumes of data
- Online processing
- Data is read once and discarded
- Memory is limited
3Why Moving Windows?
- Timeliness matters
- Old/obsolete data is not useful
- Scalability matters
- Querying the entire history may be impractical
- Solution restrict queries to a window of recent
data - As new data arrives, old data expires
- Addresses timeliness and scalability
4Two Types of Windows
- Sequence-Based
- The most recent n elements from the data stream
- Assumes a (possibly implicit) sequence number for
each element - Timestamp-Based
- All elements from the data stream in the last m
units of time (e.g. last 1 week) - Assumes a (possibly implicit) arrival timestamp
for each element - Sequence-based is the focus for most of the talk
5Sampling From a Data Stream
- Inputs
- Sample size k
- Window size n gtgt k (alternatively, time duration
m) - Stream of data elements that arrive online
- Output
- k elements chosen uniformly at random from the
last n elements (alternatively, from all elements
that have arrived in the last m time units) - Goal
- maintain a data structure that can produce the
desired output at any time upon request
6A Simple, Unsatisfying Approach
- Choose a random subset Xx1, ,xk,
X?0,1,,n-1 - The sample always consists of the non-expired
elements whose indexes are equal to x1, ,xk
(modulo n) - Only uses O(k) memory
- Technically produces a uniform random sample of
each window, but unsatisfying because the sample
is highly periodic - Unsuitable for many real applications,
particularly those with periodicity in the data
7Another Simple Approach Oversample
- As each element arrives remember it with
probability p ck/n log n otherwise discard it - Discard elements when they expire
- When asked to produce a sample, choose k elements
at random from the set in memory - Expected memory usage of O(k log n)
- Uses O(k log n) memory whp
- The algorithm can fail if less than k elements
from a window are remembered however whp this
will not happen
8Reservoir Sampling
- Classic online algorithm due to Vitter (1985)
- Maintains a fixed-size uniform random sample
- Size of the data stream need not be known in
advance - Data structure reservoir of k data elements
- As the ith data element arrives
- Add it to the reservoir with probability p k/i,
discarding a randomly chosen data element from
the reservoir to make room - Otherwise (with probability 1-p) discard it
9Why It Doesnt Work With Moving Windows
- Suppose an element in the reservoir expires
- Need to replace it with a randomly-chosen element
from the current window - However, in the data stream model we have no
access to past data - Could store the entire window but this would
require O(n) memory
10Chain-Sample
- Include each new element in the sample with
probability 1/min(i,n) - As each element is added to the sample, choose
the index of the element that will replace it
when it expires - When the ith element expires, the window will be
(i1in), so choose the index from this range - Once the element with that index arrives, store
it and choose the index that will replace it in
turn, building a chain of potential
replacements - When an element is chosen to be discarded from
the sample, discard its chain as well
11Example
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
12Memory Usage of Chain-Sample
- Let T(x) denote the expected length of the chain
from the element with index i when the most
recent index is ix -
- T(x)
- The expected length of each chain is less than
T(n) ? e ? 2.718 - Expected memory usage is O(k)
13Memory Usage of Chain-Sample
- Chain consists of hops with lengths 1n
- Chain of length ? j can be represented by
partition of n into j ordered integer parts - j-1 hops with sum less than n plus a remainder
- Each such partition has probability n-j
- Number of such partitions is (n) lt (ne/j)j
- Probability of any such partition is small
O(n-c)when j O(k log n) - Uses O(k log n) memory whp
j
14Comparison of Algorithms
Algorithm Expected High-Probability
Periodic O(k) O(k)
Oversample O(k log n) O(k log n)
Chain-Sample O(k) O(k log n)
- Chain-sample is preferable to oversampling
- Better expected memory usage O(k) vs. O(k log
n) - Same high-probability memory bound of O(k log n)
- No chance of failure due to sample size shrinking
below k
15Timestamp-Based Windows
- Window at time t consists of all elements whose
arrival timestamp is at least t t-m - The number of elements in the window is not known
in advance and may vary over time - None of the previous algorithms will work
- All require windows with a constant, known number
of elements
16Priority-Sample
- We describe priority-sample for k1
- Assign each element a randomly-chosen priority
- The element with the highest priority is the
sample - An element is ineligible if there is another
element with a later timestamp and higher
priority - Only store eligible, non-expired elements
17Memory Usage of Priority-Sample
- Imagine that the elements were stored in a
treap totally ordered by arrival timestamp and
heap-ordered by priority - The eligible elements would represent the right
spine of the treap - We only store the eligible elements
- Therefore expected memory usage is O(log n), or
O(k log n) for samples of size k - O(k log n) is also an upper bound (whp)
18Conclusion
- Our contributions
- Introduced the problem of maintaining a sample
over a moving window from a data stream - Developed the Chain-Sample algorithm for this
problem with sequence-based windows - Developed the Priority-Sample algorithm for this
problem with timestamp-based windows - Future work
- What else can be computed in sublinear space over
moving windows on data streams? - For example The next talk!