Title: Operator Scheduling and Approximate Answers in Data Stream Processing
1Operator Scheduling and Approximate Answers in
Data Stream Processing
- Shariq Rizvi
- rizvi_at_cse.iitb.ac.in
2Why the Data Stream Model?
- Continuously arriving data
- Stock prices
- Network monitors
- Log-records or click-streams
- Continuous Queries
- What was the highest closing price for ATT
shares in the last 1 week - Need extensions to relational model
3Challenges
- Multiple, continuous, unbounded, rapid data
streams - Continuous timely answers to long-running
queries semantics? - Formal query language
- Support for streams and relations
- Limited Resources memory, CPU, network
bandwidth
4Motwani et al CIDR2003
- CQL extension to SQL
- Select Count()
- From Requests S Range 1 Day Preceding sample
(10) - Where S.Domain stanford.edu
- The result of a query Q at time t is obtained
by taking all relations at time t, all streams up
to time t converted to relations by their window
specifications, and applying conventional
relational semantics. The result is a stream if
the outermost operator is Istream or Dstream,
otherwise it remains as a relation. -
5Streams and Relations
Range 1 Day Preceding
Insert Stream Delete Stream
6Query Processing
Queue
Synopsis
7Issues
- Exploiting constraints on streams
- Orders join Fulfillments
- clustered-arrival, ordered-arrival
- Scheduling of Operators
O2
O1
8Approximations
- Static
- Window reduction
- Sampling rate reduction
- Dynamic
- Synopsis compression
- Load shedding
9Chain Operator Scheduling for Memory
Minimization SIGMOD03
- Near-optimal memory usage single stream queries
(SPJ) - Does well for sliding window joins multiple
streams - Key Concepts
- Operator Path of a tuple
- Selectivity of an operator
- Progress Chart of tuple size
10A Progress Chart
11Another Progress Chart
- Lower Envelope
- The concept
- A property
12Scheduling Single Stream Queries
- n distinct operator paths
- P1, , Pn Progress Charts
- P1, , Pn Lower Envelopes
- Chain At any time instant, consider all tuples
that - are currently in the system. Of these, schedule
for a - single time unit the tuple that lies on the
segment with - the steepest slope in its lower envelope
simulation. If - there are multiple such tuples, select the tuple
which - has the earliest arrival time.
13Scheduling Multiple Stream Queries
- Sliding-Window Joins
- Two input streams to a join operator
- How to define the (t,s) value?
- t, s ?R ?s tR ts SW(S) SW(R)
- Apply single stream strategy
- Additional blocking condition
14Performance Single stream, Two operators
15Approximate Join Processing over Streams
SIGMOD2003
- Sliding-window join of two streams
- Load-shedding k-truncated join
- Static relation version
- Modeled as a bipartite graph
- Maximize edges after dropping k nodes
- Stream version distribution of join attribute
- Optimal dynamic programming solution
16Offline Algorithm
R 1, 1, 1, 3, 2 S 2, 3, 1, 1, 3
17Offline Vs. Online Algorithm
- Offline Knowledge of future
- Online
- Assumes arrival probabilities
- Maximizes expected output size
18Currently
- Streaming Queries over Streaming Data VLDB2002
- Continuously Adaptive Query Processing Eddies
- Approximate Frequency Counts over Data Streams
VLDB2002
19Conclusions
- Data stream model relevant
- Issues in DSM
- Operator scheduling
- Approximate join processing
- Blatant promises
20