Title: Adaptive Monitoring of Bursty Data Streams
1Adaptive Monitoring of Bursty Data Streams
- Brian Babcock, Shivnath Babu, Mayur Datar, and
Rajeev Motwani
2Monitoring Data Streams
- Lots of data arrives as continuous data streams
- Network traffic, web clickstreams, financial data
feeds, sensor data, etc. - We could load it into a database and query it
- But processing streaming data has advantages
- Timeliness
- Detect interesting events in real time
- Take appropriate action immediately
- Performance
- Avoid use of (slow) secondary storage
- Can process higher volumes of data more cheaply
3Network Traffic Monitoring
- Security (e.g. intrusion detection)
- Network performance troubleshooting
- Traffic management (e.g. routing policy)
Internet
4Data Streams are Bursty
- Data stream arrival rates are often
- Fast
- Irregular
- Examples
- Network traffic (IP, telephony, etc.)
- E-mail messages
- Web page access patterns
- Peak rate much higher than average rate
- 1-2 orders of magnitude
- Impractical to provision system for peak rate
5Bursts Create Backlogs
- Arrival rate temporarily exceeds throughput
- Queues of unprocessed elements build up
- Two options when memory fills up
- Page to disk
- Slows system, lowers throughput
- Admission control (i.e. drop packets)
- Data is lost, answer quality suffers
- Neither option is very appealing ?
6Two Approaches to Bursts
- Minimize memory usage
- Reduce memory used to buffer data backlog ?
avoid running out of memory - Schedule query operators so as to release memory
quickly during bursts - Sometimes this is not enough
- Shed load intelligently to minimize inaccuracy
- Use approximate query answering techniques
- Some queries are harder than others to
approximate - Give hard queries more data and easy queries less
7Outline
- Problem Formalization
- Intuition Behind the Solution
- Chain Scheduling Algorithm
- Near-Optimality of Chain Scheduling
- Experimental Results
8Problem Formalization
- Inputs
- Data flow path(s) consisting of sequences of
operators - For each operator we know
- Execution time (per block)
- Selectivity
Time t4
Time t2
Selectivity s4
Selectivity s2
Time t1
Time t3
Selectivity s1
Selectivity s3
Stream
Stream
9Progress charts
(0,1)
Opt1
Block Size
(1,0.5)
Opt2
(4,0.25)
Opt3
(0,0)
(6,0)
Time
10Problem Formalization
- Inputs
- Data flow path(s) consisting of sequences of
operators - For each operator we know
- Execution time (per block)
- Selectivity
- At each time step
- Blocks of tuples may arrive at initial input
queue(s) - Scheduler selects one block of tuples
- Selected block moves one step on its progress
chart - Objective
- Minimize peak memory usage(sum of queue sizes)
Time t4
Time t2
Selectivity s4
Selectivity s2
Time t1
Time t3
Selectivity s1
Selectivity s3
Stream
Stream
11Main Solution Idea
- Fast, selective operators release memory quickly
- Therefore, to minimize memory
- Give preference to fast, selective operators
- Postpone slow, unselective operators
- Greedy algorithm
- Operator priority selectivity per unit time
(si/ti) - Always schedule the highest-priority available
operator - Greedy doesnt quite work
- A good operator that follows a bad operator
rarely runs - The bad operator doesnt get scheduled
- Therefore there is no input available for the
good operator
12Bad Example for Greedy
Tuples build up here
Opt1
Opt2
Opt3
Block Size
Time
13Chain Scheduling Algorithm
Opt1
Block Size
Opt2
Lower envelope
Opt3
Time
14Chain Scheduling Algorithm
- Calculate lower envelope
- Priority slope of lower envelope segment
- Always schedule highest-priority available
operator - Break ties using operator order in pipeline
- Favor later operators
15FIFO Example
(0,1)
Opt1
(1,0.5)
Block Size
Opt2
(4,0.25)
Opt3
(0,0)
(6,0)
Time
16Chain Example
(0,1)
Opt1
(1,0.5)
Block Size
Opt2
(4,0.25)
Lower envelope
Opt3
(0,0)
(6,0)
Time
17Memory Usage
18Chain is Near-Optimal
Theorem Given a system with k queries, all
operator selectivities 1, Let C(t) of
blocks of memory used by Chain at time t. At
every time t, any algorithm must use C(t) - k
memory.
- Memory usage within small constant of optimal
algorithm that knows the future - Proof sketch
- Greedy scheduling is optimal for convex progress
charts - Best operators are immediately available
- Lower envelope is convex
- Lower envelope closely approximates actual
progress chart - Details on next slide
19Lemma Lower Envelope is Close to Actual
Progress Chart
- At most one block in the middle of each lower
envelope segment - Due to tie-breaking rule
- (Lower envelope 1) gives upper bound on actual
memory usage - Additive error of 1 block per query
20Performance Comparison
spike in memory due to burst
21Outline
- Operator Scheduling
- Load Shedding
- Motivation for Load Shedding
- Problem Formalization
- Load Shedding Algorithm
- Experimental Results
22Why Load Shedding?
- Data rate during the burst can be too fast for
the system to keep up - Chain Scheduling helps to minimize memory usage,
but CPU may be the bottleneck - Timely, approximate answers are often more useful
than delayed, exact answers - Solution When there is too much data to handle,
process as much as possible and drop the rest - Goal Minimize inaccuracy in answers while
keeping up with the data
23Related Approaches
- Our focus sliding window aggregation queries
- Goal is minimizing inaccuracy in answers
- Previous work considered related questions
- Maximize output rate from sliding window
joinsKang, Naughton, and Viglas - ICDE 03 - Maximize quality of service function for
selection queriesTatbul, Cetintemel, Zdonik,
Cherniak, Stonebraker-VLDB 03
24Problem Setting
Q1
Q2
Q3
S
S
S
?
?
?
?
R
S1
S2
25Inputs to the Problem
Q1
Q2
Q3
S
S
S
?
?
?
?
R
S1
S2
26Load Shedding via Random Drops
(time, selectivity)
(t3, s3)
Load rt1 rs1t2 rs1s2t3
(t2, s2)
Sampling Rate p
(t1, s1)
Need Load 1
Stream Rate r
27Problem Statement
- Relative error is metric of choice Estimate -
Actual - Actual
- Goal Minimize the maximum relative error across
queries, subject to Load 1 - Want low error with high probability
28Quantifying Effects of Load Shedding
Scale answer by 1/(p1p2)
S3
Sampling Rate p2
?2
Sampling Rate p1
?1
- Product of sampling rates determines answer
quality
29Relating Load Shedding and Error
- Equation derived from Hoeffding bound
- Constant Ci depends on
- Variance of aggregated attribute
- Sliding window size
30Choosing Target Sampling Rates
31Calculate Ratio of Sampling Rates
- Minimize maximum relative error ? Equal relative
error across queries - Express all sampling rates in terms of common
variable ?
32Placing Load Shedders
Target .8?
Target.6?
S
S
?
?
Sampling Rate .75 .6? /.8?
?
Sampling Rate .8?
33Experimental Results
34Experimental Results
35Conclusion
- Fluctuating data stream arrival rates create
challenges - Temporary system overload during bursts
- Chain scheduling helps minimize memory usage
- Main idea give priority to fast, selective
operators - Careful load shedding preserves answer quality
- Relate target sampling rates for all queries
- Place random drop operators based on target
sampling rates - Adjust sampling rates to achieve desired load
36Thanks for Listening!
- http//www-db.stanford.edu/stream