Operator Scheduling and Approximate Answers in Data Stream Processing PowerPoint PPT Presentation

presentation player overlay
1 / 20
About This Presentation
Transcript and Presenter's Notes

Title: Operator Scheduling and Approximate Answers in Data Stream Processing


1
Operator Scheduling and Approximate Answers in
Data Stream Processing
  • Shariq Rizvi
  • rizvi_at_cse.iitb.ac.in

2
Why the Data Stream Model?
  • Continuously arriving data
  • Stock prices
  • Network monitors
  • Log-records or click-streams
  • Continuous Queries
  • What was the highest closing price for ATT
    shares in the last 1 week
  • Need extensions to relational model

3
Challenges
  • Multiple, continuous, unbounded, rapid data
    streams
  • Continuous timely answers to long-running
    queries semantics?
  • Formal query language
  • Support for streams and relations
  • Limited Resources memory, CPU, network
    bandwidth

4
Motwani et al CIDR2003
  • CQL extension to SQL
  • Select Count()
  • From Requests S Range 1 Day Preceding sample
    (10)
  • Where S.Domain stanford.edu
  • The result of a query Q at time t is obtained
    by taking all relations at time t, all streams up
    to time t converted to relations by their window
    specifications, and applying conventional
    relational semantics. The result is a stream if
    the outermost operator is Istream or Dstream,
    otherwise it remains as a relation.

5
Streams and Relations
Range 1 Day Preceding
Insert Stream Delete Stream
6
Query Processing
Queue
Synopsis
7
Issues
  • Exploiting constraints on streams
  • Orders join Fulfillments
  • clustered-arrival, ordered-arrival
  • Scheduling of Operators

O2
O1
8
Approximations
  • Static
  • Window reduction
  • Sampling rate reduction
  • Dynamic
  • Synopsis compression
  • Load shedding

9
Chain Operator Scheduling for Memory
Minimization SIGMOD03
  • Near-optimal memory usage single stream queries
    (SPJ)
  • Does well for sliding window joins multiple
    streams
  • Key Concepts
  • Operator Path of a tuple
  • Selectivity of an operator
  • Progress Chart of tuple size

10
A Progress Chart
  • (t,s)

11
Another Progress Chart
  • Lower Envelope
  • The concept
  • A property

12
Scheduling Single Stream Queries
  • n distinct operator paths
  • P1, , Pn Progress Charts
  • P1, , Pn Lower Envelopes
  • Chain At any time instant, consider all tuples
    that
  • are currently in the system. Of these, schedule
    for a
  • single time unit the tuple that lies on the
    segment with
  • the steepest slope in its lower envelope
    simulation. If
  • there are multiple such tuples, select the tuple
    which
  • has the earliest arrival time.

13
Scheduling Multiple Stream Queries
  • Sliding-Window Joins
  • Two input streams to a join operator
  • How to define the (t,s) value?
  • t, s ?R ?s tR ts SW(S) SW(R)
  • Apply single stream strategy
  • Additional blocking condition

14
Performance Single stream, Two operators
15
Approximate Join Processing over Streams
SIGMOD2003
  • Sliding-window join of two streams
  • Load-shedding k-truncated join
  • Static relation version
  • Modeled as a bipartite graph
  • Maximize edges after dropping k nodes
  • Stream version distribution of join attribute
  • Optimal dynamic programming solution

16
Offline Algorithm
R 1, 1, 1, 3, 2 S 2, 3, 1, 1, 3
17
Offline Vs. Online Algorithm
  • Offline Knowledge of future
  • Online
  • Assumes arrival probabilities
  • Maximizes expected output size

18
Currently
  • Streaming Queries over Streaming Data VLDB2002
  • Continuously Adaptive Query Processing Eddies
  • Approximate Frequency Counts over Data Streams
    VLDB2002

19
Conclusions
  • Data stream model relevant
  • Issues in DSM
  • Operator scheduling
  • Approximate join processing
  • Blatant promises

20
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com