Adaptive Monitoring of Bursty Data Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive Monitoring of Bursty Data Streams

Description:

Operator priority = selectivity per unit time (si/ti) ... Given a system with k queries, all operator selectivities 1, ... priority to fast, selective operators ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 37
Provided by: BrianB105
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Monitoring of Bursty Data Streams


1
Adaptive Monitoring of Bursty Data Streams
  • Brian Babcock, Shivnath Babu, Mayur Datar, and
    Rajeev Motwani

2
Monitoring Data Streams
  • Lots of data arrives as continuous data streams
  • Network traffic, web clickstreams, financial data
    feeds, sensor data, etc.
  • We could load it into a database and query it
  • But processing streaming data has advantages
  • Timeliness
  • Detect interesting events in real time
  • Take appropriate action immediately
  • Performance
  • Avoid use of (slow) secondary storage
  • Can process higher volumes of data more cheaply

3
Network Traffic Monitoring
  • Security (e.g. intrusion detection)
  • Network performance troubleshooting
  • Traffic management (e.g. routing policy)

Internet
4
Data Streams are Bursty
  • Data stream arrival rates are often
  • Fast
  • Irregular
  • Examples
  • Network traffic (IP, telephony, etc.)
  • E-mail messages
  • Web page access patterns
  • Peak rate much higher than average rate
  • 1-2 orders of magnitude
  • Impractical to provision system for peak rate

5
Bursts Create Backlogs
  • Arrival rate temporarily exceeds throughput
  • Queues of unprocessed elements build up
  • Two options when memory fills up
  • Page to disk
  • Slows system, lowers throughput
  • Admission control (i.e. drop packets)
  • Data is lost, answer quality suffers
  • Neither option is very appealing ?

6
Two Approaches to Bursts
  • Minimize memory usage
  • Reduce memory used to buffer data backlog ?
    avoid running out of memory
  • Schedule query operators so as to release memory
    quickly during bursts
  • Sometimes this is not enough
  • Shed load intelligently to minimize inaccuracy
  • Use approximate query answering techniques
  • Some queries are harder than others to
    approximate
  • Give hard queries more data and easy queries less

7
Outline
  • Operator Scheduling
  • Problem Formalization
  • Intuition Behind the Solution
  • Chain Scheduling Algorithm
  • Near-Optimality of Chain Scheduling
  • Experimental Results
  • Load Shedding

8
Problem Formalization
  • Inputs
  • Data flow path(s) consisting of sequences of
    operators
  • For each operator we know
  • Execution time (per block)
  • Selectivity

Time t4
Time t2
Selectivity s4
Selectivity s2
Time t1
Time t3
Selectivity s1
Selectivity s3
Stream
Stream
9
Progress charts
(0,1)
Opt1
Block Size
(1,0.5)
Opt2
(4,0.25)
Opt3
(0,0)
(6,0)
Time
10
Problem Formalization
  • Inputs
  • Data flow path(s) consisting of sequences of
    operators
  • For each operator we know
  • Execution time (per block)
  • Selectivity
  • At each time step
  • Blocks of tuples may arrive at initial input
    queue(s)
  • Scheduler selects one block of tuples
  • Selected block moves one step on its progress
    chart
  • Objective
  • Minimize peak memory usage(sum of queue sizes)

Time t4
Time t2
Selectivity s4
Selectivity s2
Time t1
Time t3
Selectivity s1
Selectivity s3
Stream
Stream
11
Main Solution Idea
  • Fast, selective operators release memory quickly
  • Therefore, to minimize memory
  • Give preference to fast, selective operators
  • Postpone slow, unselective operators
  • Greedy algorithm
  • Operator priority selectivity per unit time
    (si/ti)
  • Always schedule the highest-priority available
    operator
  • Greedy doesnt quite work
  • A good operator that follows a bad operator
    rarely runs
  • The bad operator doesnt get scheduled
  • Therefore there is no input available for the
    good operator

12
Bad Example for Greedy
Tuples build up here
Opt1
Opt2
Opt3
Block Size
Time
13
Chain Scheduling Algorithm
Opt1
Block Size
Opt2
Lower envelope
Opt3
Time
14
Chain Scheduling Algorithm
  • Calculate lower envelope
  • Priority slope of lower envelope segment
  • Always schedule highest-priority available
    operator
  • Break ties using operator order in pipeline
  • Favor later operators

15
FIFO Example
(0,1)
Opt1
(1,0.5)
Block Size
Opt2
(4,0.25)
Opt3
(0,0)
(6,0)
Time
16
Chain Example
(0,1)
Opt1
(1,0.5)
Block Size
Opt2
(4,0.25)
Lower envelope
Opt3
(0,0)
(6,0)
Time
17
Memory Usage
18
Chain is Near-Optimal
Theorem Given a system with k queries, all
operator selectivities 1, Let C(t) of
blocks of memory used by Chain at time t. At
every time t, any algorithm must use C(t) - k
memory.
  • Memory usage within small constant of optimal
    algorithm that knows the future
  • Proof sketch
  • Greedy scheduling is optimal for convex progress
    charts
  • Best operators are immediately available
  • Lower envelope is convex
  • Lower envelope closely approximates actual
    progress chart
  • Details on next slide

19
Lemma Lower Envelope is Close to Actual
Progress Chart
  • At most one block in the middle of each lower
    envelope segment
  • Due to tie-breaking rule
  • (Lower envelope 1) gives upper bound on actual
    memory usage
  • Additive error of 1 block per query

20
Performance Comparison
spike in memory due to burst
21
Outline
  • Operator Scheduling
  • Load Shedding
  • Motivation for Load Shedding
  • Problem Formalization
  • Load Shedding Algorithm
  • Experimental Results

22
Why Load Shedding?
  • Data rate during the burst can be too fast for
    the system to keep up
  • Chain Scheduling helps to minimize memory usage,
    but CPU may be the bottleneck
  • Timely, approximate answers are often more useful
    than delayed, exact answers
  • Solution When there is too much data to handle,
    process as much as possible and drop the rest
  • Goal Minimize inaccuracy in answers while
    keeping up with the data

23
Related Approaches
  • Our focus sliding window aggregation queries
  • Goal is minimizing inaccuracy in answers
  • Previous work considered related questions
  • Maximize output rate from sliding window
    joinsKang, Naughton, and Viglas - ICDE 03
  • Maximize quality of service function for
    selection queriesTatbul, Cetintemel, Zdonik,
    Cherniak, Stonebraker-VLDB 03

24
Problem Setting
Q1
Q2
Q3
S
S
S
?
?
?
?
R
S1
S2
25
Inputs to the Problem
Q1
Q2
Q3
S
S
S
?
?
?
?
R
S1
S2
26
Load Shedding via Random Drops
(time, selectivity)
(t3, s3)
Load rt1 rs1t2 rs1s2t3
(t2, s2)
Sampling Rate p
(t1, s1)
Need Load 1
Stream Rate r
27
Problem Statement
  • Relative error is metric of choice Estimate -
    Actual
  • Actual
  • Goal Minimize the maximum relative error across
    queries, subject to Load 1
  • Want low error with high probability

28
Quantifying Effects of Load Shedding
Scale answer by 1/(p1p2)
S3
Sampling Rate p2
?2
Sampling Rate p1
?1
  • Product of sampling rates determines answer
    quality

29
Relating Load Shedding and Error
  • Equation derived from Hoeffding bound
  • Constant Ci depends on
  • Variance of aggregated attribute
  • Sliding window size

30
Choosing Target Sampling Rates
31
Calculate Ratio of Sampling Rates
  • Minimize maximum relative error ? Equal relative
    error across queries
  • Express all sampling rates in terms of common
    variable ?

32
Placing Load Shedders
Target .8?
Target.6?
S
S
?
?
Sampling Rate .75 .6? /.8?
?
Sampling Rate .8?
33
Experimental Results
34
Experimental Results
35
Conclusion
  • Fluctuating data stream arrival rates create
    challenges
  • Temporary system overload during bursts
  • Chain scheduling helps minimize memory usage
  • Main idea give priority to fast, selective
    operators
  • Careful load shedding preserves answer quality
  • Relate target sampling rates for all queries
  • Place random drop operators based on target
    sampling rates
  • Adjust sampling rates to achieve desired load

36
Thanks for Listening!
  • http//www-db.stanford.edu/stream
Write a Comment
User Comments (0)
About PowerShow.com