Adaptive Monitoring of Bursty Data Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Adaptive Monitoring of Bursty Data Streams

Description:

Operator priority = selectivity per unit time (si/ti) ... Given a system with k queries, all operator selectivities 1, ... priority to fast, selective operators ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 37

Provided by: BrianB105

Learn more at: http://www-cs-students.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Monitoring of Bursty Data Streams

1
Adaptive Monitoring of Bursty Data Streams

Brian Babcock, Shivnath Babu, Mayur Datar, and
Rajeev Motwani

2
Monitoring Data Streams

Lots of data arrives as continuous data streams
Network traffic, web clickstreams, financial data
feeds, sensor data, etc.
We could load it into a database and query it
But processing streaming data has advantages
Timeliness
Detect interesting events in real time
Take appropriate action immediately
Performance
Avoid use of (slow) secondary storage
Can process higher volumes of data more cheaply

3
Network Traffic Monitoring

Security (e.g. intrusion detection)
Network performance troubleshooting
Traffic management (e.g. routing policy)

Internet
4
Data Streams are Bursty

Data stream arrival rates are often
Fast
Irregular
Examples
Network traffic (IP, telephony, etc.)
E-mail messages
Web page access patterns
Peak rate much higher than average rate
1-2 orders of magnitude
Impractical to provision system for peak rate

5
Bursts Create Backlogs

Arrival rate temporarily exceeds throughput
Queues of unprocessed elements build up
Two options when memory fills up
Page to disk
Slows system, lowers throughput
Admission control (i.e. drop packets)
Data is lost, answer quality suffers
Neither option is very appealing ?

6
Two Approaches to Bursts

Minimize memory usage
Reduce memory used to buffer data backlog ?
avoid running out of memory
Schedule query operators so as to release memory
quickly during bursts
Sometimes this is not enough
Shed load intelligently to minimize inaccuracy
Use approximate query answering techniques
Some queries are harder than others to
approximate
Give hard queries more data and easy queries less

7
Outline

Operator Scheduling

Problem Formalization
Intuition Behind the Solution
Chain Scheduling Algorithm
Near-Optimality of Chain Scheduling
Experimental Results

Load Shedding

8
Problem Formalization

Inputs
Data flow path(s) consisting of sequences of
operators
For each operator we know
Execution time (per block)
Selectivity

Time t4
Time t2
Selectivity s4
Selectivity s2
Time t1
Time t3
Selectivity s1
Selectivity s3
Stream
Stream
9
Progress charts
(0,1)
Opt1
Block Size
(1,0.5)
Opt2
(4,0.25)
Opt3
(0,0)
(6,0)
Time
10
Problem Formalization

Inputs
Data flow path(s) consisting of sequences of
operators
For each operator we know
Execution time (per block)
Selectivity
At each time step
Blocks of tuples may arrive at initial input
queue(s)
Scheduler selects one block of tuples
Selected block moves one step on its progress
chart
Objective
Minimize peak memory usage(sum of queue sizes)

Time t4
Time t2
Selectivity s4
Selectivity s2
Time t1
Time t3
Selectivity s1
Selectivity s3
Stream
Stream
11
Main Solution Idea

Fast, selective operators release memory quickly
Therefore, to minimize memory
Give preference to fast, selective operators
Postpone slow, unselective operators
Greedy algorithm
Operator priority selectivity per unit time
(si/ti)
Always schedule the highest-priority available
operator
Greedy doesnt quite work
A good operator that follows a bad operator
rarely runs
The bad operator doesnt get scheduled
Therefore there is no input available for the
good operator

12
Bad Example for Greedy
Tuples build up here
Opt1
Opt2
Opt3
Block Size
Time
13
Chain Scheduling Algorithm
Opt1
Block Size
Opt2
Lower envelope
Opt3
Time
14
Chain Scheduling Algorithm

Calculate lower envelope
Priority slope of lower envelope segment
Always schedule highest-priority available
operator
Break ties using operator order in pipeline
Favor later operators

15
FIFO Example
(0,1)
Opt1
(1,0.5)
Block Size
Opt2
(4,0.25)
Opt3
(0,0)
(6,0)
Time
16
Chain Example
(0,1)
Opt1
(1,0.5)
Block Size
Opt2
(4,0.25)
Lower envelope
Opt3
(0,0)
(6,0)
Time
17
Memory Usage
18
Chain is Near-Optimal
Theorem Given a system with k queries, all
operator selectivities 1, Let C(t) of
blocks of memory used by Chain at time t. At
every time t, any algorithm must use C(t) - k
memory.

Memory usage within small constant of optimal
algorithm that knows the future
Proof sketch
Greedy scheduling is optimal for convex progress
charts
Best operators are immediately available
Lower envelope is convex
Lower envelope closely approximates actual
progress chart
Details on next slide

19
Lemma Lower Envelope is Close to Actual
Progress Chart

At most one block in the middle of each lower
envelope segment
Due to tie-breaking rule
(Lower envelope 1) gives upper bound on actual
memory usage
Additive error of 1 block per query

20
Performance Comparison
spike in memory due to burst
21
Outline

Operator Scheduling
Load Shedding
Motivation for Load Shedding
Problem Formalization
Load Shedding Algorithm
Experimental Results

22
Why Load Shedding?

Data rate during the burst can be too fast for
the system to keep up
Chain Scheduling helps to minimize memory usage,
but CPU may be the bottleneck
Timely, approximate answers are often more useful
than delayed, exact answers
Solution When there is too much data to handle,
process as much as possible and drop the rest
Goal Minimize inaccuracy in answers while
keeping up with the data

23
Related Approaches

Our focus sliding window aggregation queries
Goal is minimizing inaccuracy in answers
Previous work considered related questions
Maximize output rate from sliding window
joinsKang, Naughton, and Viglas - ICDE 03
Maximize quality of service function for
selection queriesTatbul, Cetintemel, Zdonik,
Cherniak, Stonebraker-VLDB 03

24
Problem Setting
Q1
Q2
Q3
S
S
S
?
?
?
?
R
S1
S2
25
Inputs to the Problem
Q1
Q2
Q3
S
S
S
?
?
?
?
R
S1
S2
26
Load Shedding via Random Drops
(time, selectivity)
(t3, s3)
Load rt1 rs1t2 rs1s2t3
(t2, s2)
Sampling Rate p
(t1, s1)
Need Load 1
Stream Rate r
27
Problem Statement

Relative error is metric of choice Estimate -
Actual
Actual
Goal Minimize the maximum relative error across
queries, subject to Load 1
Want low error with high probability

28
Quantifying Effects of Load Shedding
Scale answer by 1/(p1p2)
S3
Sampling Rate p2
?2
Sampling Rate p1
?1

Product of sampling rates determines answer
quality

29
Relating Load Shedding and Error

Equation derived from Hoeffding bound
Constant Ci depends on
Variance of aggregated attribute
Sliding window size

30
Choosing Target Sampling Rates
31
Calculate Ratio of Sampling Rates

Minimize maximum relative error ? Equal relative
error across queries
Express all sampling rates in terms of common
variable ?

32
Placing Load Shedders
Target .8?
Target.6?
S
S
?
?
Sampling Rate .75 .6? /.8?
?
Sampling Rate .8?
33
Experimental Results
34
Experimental Results
35
Conclusion