Stream Data Management System Prototypes - PowerPoint PPT Presentation

About This Presentation
Title:

Stream Data Management System Prototypes

Description:

Selectivity: s(b), sel(b) Computation time: c(b), cost(b) General Optimization Techniques ... Queue sizes, throughput, overall memory usage, and join selectivity. ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 49
Provided by: sia78
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Stream Data Management System Prototypes


1
Stream Data Management System Prototypes
  • Ying Sheng, Richard Sia
  • June 1, 2004
  • Professor Carlo Zaniolo
  • CS 240B
  • Spring 2004

2
Outline
  • Motivation of DSMS
  • Aurora (Brown, Brandeis, MIT)
  • Model
  • Operator Scheduling
  • Storage/Memory Management
  • QoS issue
  • STREAM (Stanford)
  • System Architecture
  • Query Language
  • Query Plans and Execution
  • Performance Issues
  • Approximation Techniques
  • STREAM Interface
  • Conclusion

3
Motivation
  • HADP ? DAHP
  • Continuous data and static queries
  • Monitoring using sensor
  • Military
  • Traffic
  • Environment
  • Financial analysis
  • Object tracking

4
Aurora
5
Aurora Model
  • General Purpose DSMS
  • Continuous stream data comes
  • Flow through a set of operators
  • Output to application or materialized

6
Aurora Model
  • Components
  • Storage manager
  • Scheduler
  • Load Shedder
  • Router
  • QoS Monitor
  • GUI

7
Aurora Model
  • 3 kinds of query supported
  • Continuous
  • View
  • Ad-Hoc Query

8
Aurora Model
  • 8 primitive operators (Box)
  • Windowed
  • Slide
  • Tumble
  • Latch
  • Resample
  • Non-windowed
  • Filter
  • Map
  • GroupBy
  • Join

9
Aurora Operator Optimization
  • Each operator associated with
  • Selectivity s(b), sel(b)
  • Computation time c(b), cost(b)
  • General Optimization Techniques
  • Pushing projection upstream
  • Combining boxes
  • Reordering boxes

10
Aurora Operator Optimization
  • Case 1 cost of a?b
  • c(a) s(a)c(b)
  • Case 2 cost of b?a
  • c(b) s(b)c(a)
  • Criteria for switching box position
  • c(a)s(a)c(b) gt c(b)s(b)c(a)

a
b
b
a
11
Aurora Operator Scheduling
  • Scheduling by OS
  • One thread per box, shift the job to OS
  • Easier to program
  • Aurora Scheduler
  • Single thread for the scheduler
  • The scheduler pick a box with highest priority
    and call the box to consume tuples from queue
  • Allow finer control of resource
  • Scalable !

12
Aurora Operator Scheduling
13
Aurora Operator Scheduling
  • Problem which box to execute next?
  • Min-Cost (MC)
  • Reduce computation cost
  • Min-Latency (ML)
  • Return result as soon as possible
  • Min-Memory (MM)
  • Reduce memory usage of queue

14
Aurora Operator Scheduling
  • Example

b4
b2
streams
application
b5
b3
b1
b6
Downstream
15
Aurora Operator Scheduling
  • Min-Cost
  • Objective avoid overhead of calling boxes
  • Min-Latency
  • Prefer box which can produce tuples in the output
    at a shorter period of time
  • Min-Memory
  • Give preference to box which will consume more
    tuples with less computation time
  • Similar to Chain Operator Scheduling
  • More atOperator Scheduling in a Data Stream
    Manager, VLDB 2003

16
Aurora Storage/Memory Management
  • Manage the queue in front of each box
  • 2 boxes sharing the same queue
  • windowed operator
  • The initial queue size is 128 KB
  • Queues are managed as a circular queue
  • If overflow, double the queue size, or vice versa

17
Aurora Storage/Memory Management
  • Swap in/out between memory / disk based on
    priority of boxes using it
  • Work with Operator Scheduler to exchange box
    priority and buffer-state information
  • Connection Point Management
  • A B-tree indexed on timestamp is built to support
    random access of tuples by ad-hoc query

18
Aurora Storage/Memory Management
19
Aurora QoS Issue
  • Different queries/applications have different QoS
    requirement
  • Stock market monitoring
  • Average temperature of a set of sensor
  • QoS Graph

20
Latency-based QoS Graph
Critical Point
QoS
cost(D(b))
est(b)
0
time
eol(b)
latency(b)
b
D(b)
21
Aurora QoS-driven Scheduling
  • Assign priority to each box based on
  • priority (b) utility (b), est (b)
  • utility (b) gradient (eol (b))
  • How is the QoS degrading by the time the tuple
    leave the system when we process it now.
  • est (b)
  • How soon it will exhibit another performance
    degradation if we dont process it now.
  • Performance
  • 200 queries/application, each with 5 boxes
  • Round robin - 0.43
  • QoS driven scheduling 0.85

22
Aurora Current Status
  • Main components of a DSMS are introduced
  • Operator scheduler
  • Memory/storage management
  • QoS concept in stress environment
  • Load shedding
  • Implemented in C, with Java-based GUI
  • Dependent on a few software/library
  • More?
  • Distributed architecture Aurora
  • Fault tolerance or disaster recovery ?

23
STREAM
24
STREAM Introduction
  • General-purpose prototype DSMS
  • Supports data streams and stored relations
  • Declarative language for registering continuous
    queries
  • Flexible query plans and execution strategies
  • Aggressive sharing of state and computation among
    queries

25
STREAM Introduction
  • Designed to cope with
  • Stream rates that may be high, variable, bursty
  • Continuous query loads that may be high, volatile
  • Primary coping techniques
  • Graceful approximation as necessary
  • Careful resource allocation and use
  • Continuous self-monitoring and reoptimization

26
STREAM System Architecture
DSMS
Scratch Store
Stored Relations
27
STREAM Query Language
  • Continuous Query Language CQL
  • Extends SQL with
  • Streams as new data type
  • Stream Unbounded bag of pairs lttuple, timestampgt
  • Relation time-varying bags of tuples
  • Continuous instead of one-time semantics
  • Three classes of operators
  • Relation-to-relation
  • Stream-to-relation
  • Relation-to-stream

28
STREAM CQL Operators
  • Relation-to-relation
  • SQL constructs
  • Stream-to-relation
  • Tuple-based sliding window Rows N, Rows
    Unbounded
  • Time-based sliding window Range ?, Now
  • Partitioned sliding window Partition By A1,Ak
    Rows N
  • Relation-to-stream
  • Istream insert stream
  • Dstream delete stream
  • Rstream relation stream

29
STREAM Example Query 1
  • Two example streams
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)
  • Total cost of orders fulfilled over the last day
    by clerk Sue for customer Joe
  • Select Sum(O.cost)
  • From Orders O, Fulfillments F Range 1 Day
  • Where O.orderID F.orderID And F.clerk Sue
    And O.customer Joe

30
STREAM Example Query 2
  • Using a 10 sample of the Fulfillments stream,
    take the 5 most recent fulfillments for each
    clerk and return the maximum cost
  • Select F.clerk, Max(O.cost)
  • From Orders O, Fulfillments F Partition By clerk
    Rows 5 10 Sample
  • Where O.orderID F.orderID
  • Group By F.clerk

31
STREAM Simplified Query 2
  • Result is a relation, updated as stream elements
    arrive
  • Select F.clerk, Max(O.cost)
  • From O, F Rows 100
  • Where O.orderID F.orderID
  • Group By F.clerk

32
STREAM Simplified Query 2
  • Result is streamed Emits ltclerk, maxgt stream
    element whenever max changes for a clerk (or new
    clerk)
  • Select Istream(F.clerk, Max(O.cost))
  • From O, F Rows 100
  • Where O.orderID F.orderID
  • Group By F.clerk

33
STREAM Example Query 3
  • Relation CurPrice(stock, price)
  • Average price over last day for each stock
  • Select stock, Avg(price)
  • From Istream(CurPrice) Range 1 Day
  • Group By stock
  • Istream provides history of CurPrice
  • Window on history (back to relation), group and
    aggregate

34
STREAM Query plans and Execution
  • When a continuous query is registered, generate a
    query plan
  • New plan merged with existing plans
  • Users can also create manipulate plans directly
  • Plans composed of three main components
  • Operators
  • Flag insertion(), deletion (-)
  • Elements tuple-timestamp-flag tuples
  • Streams only elements
  • Relations both and - elements
  • Queues
  • Enforce nondecreasing timestamps (heartbeats)
  • Mechanisms for buffering tuples
  • States (Synopses)
  • Global scheduler for plan execution

35
STREAM States
  • States (Synopses)
  • Summarize elements seen so far (exact or
    approximate) for operators requiring history
  • To implement windows
  • Example synopsis join
  • Sliding-window join
  • Approximation of full join

36
STREAM Simple Query Plan
Select From S1 Rows 1000, S2 Range
2 Minutes Where S1.A S2.A And S1.A gt 10
37
STREAM Performance Issues
  • Synopsis Sharing
  • Eliminate data redundancy
  • Exploiting Constraints
  • Selectively discard data to reduce state
  • Operator Scheduling
  • Reduce queue sizes

38
STREAM Synopsis Sharing
  • Eliminate redundancy by
  • replacing the nearly identical synopses with
    light weight stubs
  • a single store to hold the actual tuples
  • Store tracks the progress of each stub, presents
    the appropriate view to each stub.
  • The store contains the union of its corresponding
    stubs

39
STREAM Synopsis Sharing
Select From S1 Rows 1000, S2 Range
2 Minutes Where S1.A S2.A And S1.A gt
10 Select A, Max(B) From S1 Rows 200 Group
By A
40
STREAM Exploiting Constraints
  • Specify an adherence parameter k to capture how
    closely a given stream or sets of streams adheres
    to a constraint of that type
  • Referential integrity k-constraint
  • Ordered-arrival k-constraint
  • Clustered-arrival k-constraint
  • Query execution plans reduce or eliminate sate
    based on k-constraints
  • If constraint violated, get approximate result

41
STREAM Operator Scheduling
  • Goal minimize total queue size for
    unpredictable, bursty stream arrival patterns
  • Chain Scheduling Algorithm
  • Mark the first operator in the plan as the
    current operator
  • Find the block of consecutive operators starting
    at the current operator that maximizes the
    reduction in total queue size per unit time.
  • Mark the first operator following this block as
    the current operator and repeat Step 2 until
    all operators have been assigned to chains.
  • Chains are scheduled according to the greedy
    algorithm, but within a chain, execution proceeds
    in FIFO order.
  • Proven within constant factor of any
    clairvoyant strategy, i.e., the optimal
    strategy based on knowledge of future input, for
    some queries
  • Empirical results large savings over naive
    strategies for many queries
  • But minimizing queue sizes is at odds with
    minimizing latency

42
STREAM Approximation
  • CPU-Limited Approximation
  • Insufficient CPU time to process each stream
    element due to the high data arrival rate.
  • load-shedding
  • sampling operators
  • Approximate by probabilistically dropping
    elements before they are processed
  • Memory-Limited Approximation
  • The total state required for all registered
    queries exceeds available memory.
  • The system selectively shrinks or discards
    synopses.

43
STREAM Query Interface
  • View the structure of query plans the their
    component entities.
  • View the detailed properties of each entity.
  • Dynamically adjust entity properties.
  • View monitoring graphs that display time-varying
    entity properties plotted dynamically against
    time.
  • Queue sizes, throughput, overall memory usage,
    and join selectivity.

44
STREAM Query Plan Monitoring
45
STREAM Current Status
  • Version 1.0 up and running
  • Includes a new monitoring and adaptive query
    processing infrastructure StreaMon
  • Executor runs query plans to produce results.
  • Profiler collects and maintains statistics about
    stream and plan characteristics.
  • Reoptimizer ensures that the plans and memory
    structures are the most efficient for current
    characteristics.
  • Web demo available at http//shark.stanford.edu80
    80/
  • Future Directions
  • Distributed Stream Processing
  • Crash Recovery
  • Improved Approximation
  • Classification of Applications

46
Conclusion
  • Ideal DSMS
  • Well defined and flexible query language
  • User-friendly interface
  • Scalable
  • Operator scheduling
  • Storage management
  • Synopsis sharing
  • Approximation
  • Quality assurance
  • Fault tolerant

47
References
  • R. Motwani et al., Query Processing,
    Approximation, and Resource Management in a Data
    Stream Management System, in proceedings of the
    1st CIDR Conference, 2003.
  • S. Madden et al., Continuously Adaptive
    Continuous Queries over Streams, in proceedings
    of SIGMOD Conference, 2002
  • D. Carney et al., Monitoring Streams - A New
    Class of Data Management Applications, in
    Proceedings of VLDB conference, 2002.
  • D. Carney et al., Operator Scheduling in a Data
    Stream Manager, in Proceedings of VLDB
    conference, 2003
  • Stanford STREAM Project Website
    http//www-db.stanford.edu/stream/index.html
  • Aurora Project Website http//www.cs.brown.edu/re
    search/aurora

48
End
Write a Comment
User Comments (0)
About PowerShow.com