Stream Data Management System Prototypes - PowerPoint PPT Presentation

About This Presentation

Title:

Stream Data Management System Prototypes

Description:

Selectivity: s(b), sel(b) Computation time: c(b), cost(b) General Optimization Techniques ... Queue sizes, throughput, overall memory usage, and join selectivity. ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 49

Provided by: sia78

Learn more at: http://oak.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Stream Data Management System Prototypes

1
Stream Data Management System Prototypes

Ying Sheng, Richard Sia
June 1, 2004
Professor Carlo Zaniolo
CS 240B
Spring 2004

2
Outline

Motivation of DSMS
Aurora (Brown, Brandeis, MIT)
Model
Operator Scheduling
Storage/Memory Management
QoS issue
STREAM (Stanford)
System Architecture
Query Language
Query Plans and Execution
Performance Issues
Approximation Techniques
STREAM Interface
Conclusion

3
Motivation

HADP ? DAHP
Continuous data and static queries
Monitoring using sensor
Military
Traffic
Environment
Financial analysis
Object tracking

4
Aurora
5
Aurora Model

General Purpose DSMS
Continuous stream data comes
Flow through a set of operators
Output to application or materialized

6
Aurora Model

Components
Storage manager
Scheduler
Load Shedder
Router
QoS Monitor
GUI

7
Aurora Model

3 kinds of query supported
Continuous
View
Ad-Hoc Query

8
Aurora Model

8 primitive operators (Box)
Windowed
Slide
Tumble
Latch
Resample
Non-windowed
Filter
Map
GroupBy
Join

9
Aurora Operator Optimization

Each operator associated with
Selectivity s(b), sel(b)
Computation time c(b), cost(b)
General Optimization Techniques
Pushing projection upstream
Combining boxes
Reordering boxes

10
Aurora Operator Optimization

Case 1 cost of a?b
c(a) s(a)c(b)
Case 2 cost of b?a
c(b) s(b)c(a)
Criteria for switching box position
c(a)s(a)c(b) gt c(b)s(b)c(a)

a
b
b
a
11
Aurora Operator Scheduling

Scheduling by OS
One thread per box, shift the job to OS
Easier to program
Aurora Scheduler
Single thread for the scheduler
The scheduler pick a box with highest priority
and call the box to consume tuples from queue
Allow finer control of resource
Scalable !

12
Aurora Operator Scheduling
13
Aurora Operator Scheduling

Problem which box to execute next?
Min-Cost (MC)
Reduce computation cost
Min-Latency (ML)
Return result as soon as possible
Min-Memory (MM)
Reduce memory usage of queue

14
Aurora Operator Scheduling

Example

b4
b2
streams
application
b5
b3
b1
b6
Downstream
15
Aurora Operator Scheduling

Min-Cost
Objective avoid overhead of calling boxes
Min-Latency
Prefer box which can produce tuples in the output
at a shorter period of time
Min-Memory
Give preference to box which will consume more
tuples with less computation time
Similar to Chain Operator Scheduling
More atOperator Scheduling in a Data Stream
Manager, VLDB 2003

16
Aurora Storage/Memory Management

Manage the queue in front of each box
2 boxes sharing the same queue
windowed operator
The initial queue size is 128 KB
Queues are managed as a circular queue
If overflow, double the queue size, or vice versa

17
Aurora Storage/Memory Management

Swap in/out between memory / disk based on
priority of boxes using it
Work with Operator Scheduler to exchange box
priority and buffer-state information
Connection Point Management
A B-tree indexed on timestamp is built to support
random access of tuples by ad-hoc query

18
Aurora Storage/Memory Management
19
Aurora QoS Issue

Different queries/applications have different QoS
requirement
Stock market monitoring
Average temperature of a set of sensor
QoS Graph

20
Latency-based QoS Graph
Critical Point
QoS
cost(D(b))
est(b)
0
time
eol(b)
latency(b)
b
D(b)
21
Aurora QoS-driven Scheduling

Assign priority to each box based on
priority (b) utility (b), est (b)
utility (b) gradient (eol (b))
How is the QoS degrading by the time the tuple
leave the system when we process it now.
est (b)
How soon it will exhibit another performance
degradation if we dont process it now.
Performance
200 queries/application, each with 5 boxes
Round robin - 0.43
QoS driven scheduling 0.85

22
Aurora Current Status

Main components of a DSMS are introduced
Operator scheduler
Memory/storage management
QoS concept in stress environment
Load shedding
Implemented in C, with Java-based GUI
Dependent on a few software/library
More?
Distributed architecture Aurora
Fault tolerance or disaster recovery ?

23
STREAM
24
STREAM Introduction

General-purpose prototype DSMS
Supports data streams and stored relations
Declarative language for registering continuous
queries
Flexible query plans and execution strategies
Aggressive sharing of state and computation among
queries

25
STREAM Introduction

Designed to cope with
Stream rates that may be high, variable, bursty
Continuous query loads that may be high, volatile
Primary coping techniques
Graceful approximation as necessary
Careful resource allocation and use
Continuous self-monitoring and reoptimization

26
STREAM System Architecture
DSMS
Scratch Store
Stored Relations
27
STREAM Query Language

Continuous Query Language CQL
Extends SQL with
Streams as new data type
Stream Unbounded bag of pairs lttuple, timestampgt
Relation time-varying bags of tuples
Continuous instead of one-time semantics
Three classes of operators
Relation-to-relation
Stream-to-relation
Relation-to-stream

28
STREAM CQL Operators

Relation-to-relation
SQL constructs
Stream-to-relation
Tuple-based sliding window Rows N, Rows
Unbounded
Time-based sliding window Range ?, Now
Partitioned sliding window Partition By A1,Ak
Rows N
Relation-to-stream
Istream insert stream
Dstream delete stream
Rstream relation stream

29
STREAM Example Query 1

Two example streams
Orders (orderID, customer, cost)
Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe
Select Sum(O.cost)
From Orders O, Fulfillments F Range 1 Day
Where O.orderID F.orderID And F.clerk Sue
And O.customer Joe

30
STREAM Example Query 2

Using a 10 sample of the Fulfillments stream,
take the 5 most recent fulfillments for each
clerk and return the maximum cost
Select F.clerk, Max(O.cost)
From Orders O, Fulfillments F Partition By clerk
Rows 5 10 Sample
Where O.orderID F.orderID
Group By F.clerk

31
STREAM Simplified Query 2

Result is a relation, updated as stream elements
arrive
Select F.clerk, Max(O.cost)
From O, F Rows 100
Where O.orderID F.orderID
Group By F.clerk

32
STREAM Simplified Query 2

Result is streamed Emits ltclerk, maxgt stream
element whenever max changes for a clerk (or new
clerk)
Select Istream(F.clerk, Max(O.cost))
From O, F Rows 100
Where O.orderID F.orderID
Group By F.clerk

33
STREAM Example Query 3

Relation CurPrice(stock, price)
Average price over last day for each stock
Select stock, Avg(price)
From Istream(CurPrice) Range 1 Day
Group By stock
Istream provides history of CurPrice
Window on history (back to relation), group and
aggregate

34
STREAM Query plans and Execution

When a continuous query is registered, generate a
query plan
New plan merged with existing plans
Users can also create manipulate plans directly
Plans composed of three main components
Operators
Flag insertion(), deletion (-)
Elements tuple-timestamp-flag tuples
Streams only elements
Relations both and - elements
Queues
Enforce nondecreasing timestamps (heartbeats)
Mechanisms for buffering tuples
States (Synopses)
Global scheduler for plan execution

35
STREAM States

States (Synopses)
Summarize elements seen so far (exact or
approximate) for operators requiring history
To implement windows
Example synopsis join
Sliding-window join
Approximation of full join

36
STREAM Simple Query Plan
Select From S1 Rows 1000, S2 Range
2 Minutes Where S1.A S2.A And S1.A gt 10
37
STREAM Performance Issues

Synopsis Sharing
Eliminate data redundancy
Exploiting Constraints
Selectively discard data to reduce state
Operator Scheduling
Reduce queue sizes

38
STREAM Synopsis Sharing

Eliminate redundancy by
replacing the nearly identical synopses with
light weight stubs
a single store to hold the actual tuples
Store tracks the progress of each stub, presents
the appropriate view to each stub.
The store contains the union of its corresponding
stubs

39
STREAM Synopsis Sharing
Select From S1 Rows 1000, S2 Range
2 Minutes Where S1.A S2.A And S1.A gt
10 Select A, Max(B) From S1 Rows 200 Group
By A
40
STREAM Exploiting Constraints

Specify an adherence parameter k to capture how
closely a given stream or sets of streams adheres
to a constraint of that type
Referential integrity k-constraint
Ordered-arrival k-constraint
Clustered-arrival k-constraint
Query execution plans reduce or eliminate sate
based on k-constraints
If constraint violated, get approximate result

41
STREAM Operator Scheduling

Goal minimize total queue size for
unpredictable, bursty stream arrival patterns
Chain Scheduling Algorithm
Mark the first operator in the plan as the
current operator
Find the block of consecutive operators starting
at the current operator that maximizes the
reduction in total queue size per unit time.
Mark the first operator following this block as
the current operator and repeat Step 2 until
all operators have been assigned to chains.
Chains are scheduled according to the greedy
algorithm, but within a chain, execution proceeds
in FIFO order.
Proven within constant factor of any
clairvoyant strategy, i.e., the optimal
strategy based on knowledge of future input, for
some queries
Empirical results large savings over naive
strategies for many queries
But minimizing queue sizes is at odds with
minimizing latency

42
STREAM Approximation

CPU-Limited Approximation
Insufficient CPU time to process each stream
element due to the high data arrival rate.
load-shedding
sampling operators
Approximate by probabilistically dropping
elements before they are processed
Memory-Limited Approximation
The total state required for all registered
queries exceeds available memory.
The system selectively shrinks or discards
synopses.

43
STREAM Query Interface

View the structure of query plans the their
component entities.
View the detailed properties of each entity.
Dynamically adjust entity properties.
View monitoring graphs that display time-varying
entity properties plotted dynamically against
time.
Queue sizes, throughput, overall memory usage,
and join selectivity.

44
STREAM Query Plan Monitoring
45
STREAM Current Status

Version 1.0 up and running
Includes a new monitoring and adaptive query
processing infrastructure StreaMon
Executor runs query plans to produce results.
Profiler collects and maintains statistics about
stream and plan characteristics.
Reoptimizer ensures that the plans and memory
structures are the most efficient for current
characteristics.
Web demo available at http//shark.stanford.edu80
80/
Future Directions
Distributed Stream Processing
Crash Recovery
Improved Approximation
Classification of Applications

46
Conclusion

Ideal DSMS
Well defined and flexible query language
User-friendly interface
Scalable
Operator scheduling
Storage management
Synopsis sharing
Approximation
Quality assurance
Fault tolerant

47
References

R. Motwani et al., Query Processing,
Approximation, and Resource Management in a Data
Stream Management System, in proceedings of the
1st CIDR Conference, 2003.
S. Madden et al., Continuously Adaptive
Continuous Queries over Streams, in proceedings
of SIGMOD Conference, 2002
D. Carney et al., Monitoring Streams - A New
Class of Data Management Applications, in
Proceedings of VLDB conference, 2002.
D. Carney et al., Operator Scheduling in a Data
Stream Manager, in Proceedings of VLDB
conference, 2003
Stanford STREAM Project Website
http//www-db.stanford.edu/stream/index.html
Aurora Project Website http//www.cs.brown.edu/re
search/aurora