Title: Sangeetha Seshadri
1Data Stream Processing An Overview
CS 4440 Lecture 6
- Sangeetha Seshadri
- sangeeta_at_cc.gatech.edu
2Agenda
- Data Streams
- What are they?
- Why now? Applications..
- DSMS Architecture Issues
- Query Processing
3Data Streams What and Where?
- Continuous, unbounded, rapid, time-varying
streams of data elements (tuples). - Occur in a variety of modern applications
- Network monitoring and traffic engineering
- Sensor networks, RFID tags
- Telecom call records
- Financial applications
- Web logs and click-streams
- Manufacturing processes
- DSMS Data Stream Management System
4 DBMS versus DSMS
- Persistent relations
- One-time queries
- Random access
- Access plan determined by query processor and
physical DB design
- Transient streams (and persistent relations)
- Continuous queries
- Sequential access
- Unpredictable data characteristics and arrival
patterns
5Continuous Queries
- One time queries Run once to completion over
the current data set. - Continuous queries Issued once and then
continuously evaluated over the data. - Example
- Notify me when the temperature drops below X
- Tell me when prices of stock Y gt 300
6The (Simplified) Big Picture
DSMS
Scratch Store
Stored Relations
7(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
8Triggers?
- Recall triggers in traditional DBMSs?
- Why not use triggers to process continuous
queries over data streams?
9Making Things Concrete
BOB
ALICE
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
DSMS
event start or end
10Query 1 (self-join)
- Find all outgoing calls longer than 2 minutes
- SELECT O1.call_ID, O1.caller
- FROM Outgoing O1, Outgoing O2
- WHERE (O2.time O1.time gt 2
- AND O1.call_ID O2.call_ID
- AND O1.event start
- AND O2.event end)
- Result requires unbounded storage
- Can provide result as data stream
- Can output after 2 min, without seeing end
11Query 2 (join)
- Pair up callers and callees
- SELECT O.caller, I.callee
- FROM Outgoing O, Incoming I
- WHERE O.call_ID I.call_ID
- Can still provide result as data stream
- Requires unbounded temporary storage
- unless streams are near-synchronized
12Query 3 (group-by aggregation)
- Total connection time for each caller
- SELECT O1.caller, sum(O2.time O1.time)
- FROM Outgoing O1, Outgoing O2
- WHERE (O1.call_ID O2.call_ID
- AND O1.event start
- AND O2.event end)
- GROUP BY O1.caller
- Cannot provide result in (append-only) stream
- Output updates?
- Provide current value on demand?
- Memory?
13DSMS Architecture Issues
- Data streams and stored relations Architectural
differences. - Declarative language for registering continuous
queries - Flexible query plans and execution strategies
- Centralized ? Distributed ?
14Agenda
- Data Streams
- What are they?
- Why now? Applications..
- DSMS Architecture Issues
- Query Processing
15DSMS Issues
- Relation Tuple Set or Sequence?
- Updates Modifications or Appends?
- Query Answer Exact or Approximate?
- Query Evaluation One of multiple Pass?
- Query Plan Fixed or Adaptive?
16Architectural Issues
- Resource (memory, disk, per-tuple computation)
rich - Extremely sophisticated query processing,
analysis - Useful to audit query results of data stream
systems. - Query Evaluation Arbitrary
- Query Plan Fixed.
- Resource (memory, per-tuple computation) limited
- Reasonably complex, near real time, query
processing - Useful to identify what data to populate in
database - Query Evaluation One pass
- Query Plan Adaptive
N.Koudas, D. Srivastava (2003) ATT Labs-Research
17STREAM System Challenges
- Must cope with
- Stream rates that may be high,variable, bursty
- Stream data that may be unpredictable, variable
- Continuous query loads that may be high, variable
18STREAM System Challenges
- Must cope with
- Stream rates that may be high,variable, bursty
- Stream data that may be unpredictable, variable
- Continuous query loads that may be high, variable
- Overload
19STREAM System Challenges
- Must cope with
- Stream rates that may be high,variable, bursty
- Stream data that may be unpredictable, variable
- Continuous query loads that may be high, variable
- Overload need to use resources very carefully.
- Changing conditions adaptive strategy.
20Query Model
User/Application
DSMS
21Agenda
- Data Streams
- What are they?
- Why now? Applications..
- DSMS Architecture Issues
- Query Processing
- Language
- Operators
- Optimization
- Multi-Query Optimization
22Stream Query Language
- SQL extension
- Queries reference/produce relations or streams
- Examples GSQL Gigascope, CQL STREAM
Stream or Finite Relation
Stream or Finite Relation
Stream Query Language
23Example Continuous Query Language CQL
- Start with SQL
- Then add
- Streams as new data type
- Continuous instead of one-time semantics
- Windows on streams (derived from SQL-99)
- Sampling on streams (basic)
24Impact of Limited Memory
- Continuous streams grow unboundedly
- Queries may require unbounded memory
- One solution Approximate query evaluation
25Approximate Query Evaluation
- Why?
- Handling load streams coming too fast
- Avoid unbounded storage and computation
- Ad hoc queries need approximate history
- How? Sliding windows, synopsis, samples,
load-shed - Major Issues?
- Metric for set-valued queries
- Composition of approximate operators
- How is it understood/controlled by user?
- Integrate into query language
- Query planning and interaction with resource
allocation - Accuracy-efficiency-storage tradeoff and global
metric
26Windows
- Mechanism for extracting a finite relation from
an infinite stream - Various window proposals for restricting operator
scope. - Windows based on ordering attribute (e.g. time)
- Windows based on tuple counts
- Windows based on explicit markers (e.g.
punctuations) - Variants (e.g., partitioning tuples in a window)
Window specifications
streamify
Stream
Stream
Finite relations manipulated using SQL
N.Koudas, D. Srivastava (2003) ATT Labs-Research
27Windows
Start time
Current time
t1
t2
t3
t4
t5
time
Sliding Window
time
Tumbling Window
N.Koudas, D. Srivastava (2003) ATT Labs-Research
28Query Operators
- Selections - Where clause
- Projections - Select clause
- Joins - From clause
- Group-by (Aggregations) Group-by clause
29Query Operators
- Selections and projections on streams -
straightforward - Local per-element operators
- Projection may need to include ordering
attribute. - Joins Problematic
- May need to join tuples that are arbitrarily far
apart. - Equijoin on stream ordering attributes may be
tractable. - Majority of the work focuses on joins using
windows.
30Blocking Operators
- Blocking
- No output until entire input seen
- Streams input never ends
- Simple Aggregates output update stream
- Set Output (sort, group-by)
- Root could maintain output data structure
- Intermediate nodes try non-blocking analogs
- Join
- Apply sliding-window restrictions
31Optimization in DSMS
- Traditionally table based cardinalities used in
query optimizer. - Goal of query optimizer Minimize the size of
intermediate results. - Problematic in a streaming environment All
streams are unbounded infinite size! - Need novel optimization objectives that are
relevant when the input sources are streams.
N.Koudas, D. Srivastava (2003) ATT Labs-Research
32Query Optimization in DSMS
- Novel notions of optimization
- Stream rate based e.g. NiagaraCQ
- Resource-based e.g. STREAM
- QoS based e.g. Aurora
- Continuous adaptive optimization
- Possibilities that objectives cannot be met
- Resource constraints
- Bursty arrivals under limited processing
capabilities.
N.Koudas, D. Srivastava (2003) ATT Labs-Research
33Stream Projects
- Amazon/Cougar (Cornell) sensors
- Aurora (Brown/MIT) sensor monitoring, dataflow
- Hancock (ATT) telecom streams
- Niagara (OGI/Wisconsin) Internet XML databases
- OpenCQ (Georgia) triggers, incr. view
maintenance - Stream (Stanford) general-purpose DSMS
- Tapestry (Xerox) pub/sub content-based
filtering - Telegraph (Berkeley) adaptive engine for
sensors - Tribeca (Bellcore) network monitoring
34Optimizing Multiple Distributed Stream Queries
Using Hierarchical Network Partitions
- Sangeetha Seshadri
- Jointly with Vibhore Kumar, Brian F. Cooper,
Ling Liu and Karsten Schwan - College of Computing
- Georgia Tech
- Yahoo! Research
- IPDPS07
- March 29th 2007
35Talk Outline
- Motivation
- Challenges
- Our Approach
- Experimental Results
- Future Work
36Distributed Data Stream Systems
Can low-capacity flights be cancelled?
Flight information
What is the status of my flight?
Weather
Web sources
Centralized DB
Local Weather
Travel Agent
37Motivation
- Lots of data produced in lots of places
- Examples operational information systems,
scientific collaborations, web traffic data,
financial applications - Centralized processing does not scale
38Challenges
- Choosing efficient deployments.
- Fast and efficient initial deployments.
- Utilize reuse opportunities.
- Handling dynamic nature of system.
- Queries arrive or leave.
- Nodes join (recover) or leave (fail).
- Network conditions change.
- Data conditions (e.g. rate) changes.
39Approach Outline
40Query Planning
C
B ? C
B
(B ? C) ? A
Sink
(A ? B) ? C
A ? B
A
SELECT FROM A ? B ? C
41Query Deployment
A ? B
(A ? B) ? C
Sink3
C
N4
Sink1
N3
N1
A
Sink4
N2
N5
Sink2
B
Sink5
42An Illustrative Example..
SELECT FROM A ? C
SELECT FROM A ? B ? C
43Why an integrated approach?
- Integrated approach decreases cost by gt 50
- Setup 64 node network, 100 queries over 5 stream
sources each. Y-axis represents communication
costs.
44Problem
- Massive Search Space.
- Example 5 stream sources, 64 nodes
- 2,880,000,000 (approx) plans considered.
- Lemma 1
- Our Solution
- Trade some optimality for smaller search space
45Solution
- Organize the nodes into a virtual Network
Hierarchy. - Operator reuse through Stream Advertisements
- Two approximation based algorithms
- Top-Down
- Bottom-Up
46Optimization Metric
- Minimize network usage
- Network usage total amount of data in transit
at any point in time. - Encapsulates both bandwidth and latency of links.
47Network Hierarchy
- Cluster network nodes based on cost.
- User defined parameter maxcs
Coordinator Nodes
48Stream Advertisements for Reuse
A, C and A ? C
B
Coordinator Nodes
B
A
A ? C
C
49Optimization Algorithms
Top-Down
Bottom-Up
50Planning algorithms
A ? B ? C ? D
C ? D
A ? B ?
D
C
B
A
C ? D
A ? B
?
51Top-Down Algorithm Features
- Reduced search space
- Search space reduced by a factor ß.
- (h height of hierarchy, N network size, K
number of sources). - User defined parameter maxcs allows to tune
trade-off between search space and
sub-optimality. - Operators re-used when beneficial through stream
advertisements.
52Planning algorithms
A ? B
A ? B
A ? B
? C ? D
D
C
B
A
A ? B
A ? B ? C ? D
53Bottom-Up Algorithm Features
- Reduced search space.
- Deploys only sub-queries within current cluster.
- Analytical bounds Search space reduced by factor
ß. - Operators re-used when beneficial.
- But, may choose sub-optimal join-orders.
54Experiments
- Simulation and prototype based experiments.
- 128 node network Used GT-ITM internetwork
topology generator. - Uniformly random workload generator 10 sources,
100 queries, 2-5 join operators, random sink
placements.
55Cost with Bottom-Up Algorithm
56Comparison with existing approaches
57Comparison of Search Space
58Future Work
- We have built a prototype based on IFLOW a
distributed data stream system built at Georgia
Tech. - Aggregations
- Modifying existing deployments at runtime
- Relaxing filter conditions
- Modifying join ordering at runtime.
59Related Work
- Distributed query optimization
- Distributed INGRES, R, SDD-1
- Stream data processing engines
- Centralized - STREAM, Aurora, TelegraphCQ
- Distributed - Borealis, Flux
60Conclusion
- Integrated approach to query optimization
- Hierarchical clustering of network and stream
advertisements. - Approximation based algorithms
- Top-Down
- Bottom-Up
- Design Highlights
- Trade some optimality for smaller search space.
- Decrease search space while offering bounds on
the sub-optimality.
61For further information
- http//www.cc.gatech.edu/sangeeta
- Contact sangeeta_at_cc.gatech.edu
Thank You!
62Deployment Times
63Example
- Simple use-case for pushing down selections
- Query 1
- SELECT FLIGHTS.Number, FLIGHTS.Status
CARRIER_CODES.Name - FROM FLIGHTS, CARRIER_CODES
- WHERE FLIGHTS.Departing ATLANTA
- AND FLIGHTS.Carrier_Code CARRIER_CODES.Code
- AND FLIGHTS.Departure_terminal TERMINAL
SOUTH - Query 2
- SELECT FLIGHTS.Number, FLIGHTS.Status,
CARRIER_CODES.Name - FROM FLIGHTS, CARRIER_CODES
- WHERE FLIGHTS.Departing ATLANTA
- AND FLIGHTS.Carrier_Code CARRIER_CODES.Code
- AND FLIGHTS.Departure_terminal TERMINAL
NORTH'
64The Big Picture
- Large number of possibilities
- System Model
- Stream processing systems (SQL-style queries)
- Pub-sub systems
- Runtime annotators (keyword-based queries).
- Trade-offs Cost with
- Search space
- Reliability
- Availability.
- Adaptivity
- Admission Control
- Moving operators
- Dropping data
- Migrating plans.
65Real Enterprise Workload
- Delta Airlines Operational information system
- Q1 (15) Terminal Overhead Display (Lifetime
12 hours) - Q2 (80) Gate Agent Query (Lifetime 2 hours)
- Q3 (5) Ad-hoc flight status monitoring queries
(Lifetime 6 hours)
66Real Enterprise Workload
67Backups
68Data Model
- Append-only
- Call records
- Updates
- Stock tickers
- Deletes
- Transactional data
- Meta-Data
- Control signals, punctuations
- System Internals probably need all above
69Aurora/STREAM Overview
Output streams
Synopses
Query Plans
Running Op
Ready Op
Applications register continuous queries
p
x
Waiting Op
s
s
x
Users issue continuous and ad-hoc queries
Historical Storage
Administrator monitors query execution and
adjusts run-time parameters
Input streams
70Sliding Window Approximation
0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1
0
- Why?
- Approximation technique for bounded memory
- Natural in applications (emphasizes recent data)
- Well-specified and deterministic semantics
- Issues
- Extend relational algebra, SQL, query
optimization - Algorithmic work
- Timestamps?
71Adaptivity (Telegraph)
Output Queues
STeMs for join
R
grouped filter (R.A)
EDDY
S
grouped filter (S.B)
R x S x T
T
Input Streams
- Runtime Adaptivity
- Multi-query Optimization
- Framework implements arbitrary schemes
72Query-Split Scheme (Niagara)
trig.Act.i
trig.Act.j
scan
scan
file i
file j
split
Symbol Const.Value
join
Quotes.XML
constant table
scan
scan
- Aggregate subscription for efficiency
- Split evaluate trigger only when file updated
- Triggers multi-query optimization
73Shared Predicates Niagara, Telegraph
gt
7
Predicates for R.A
1
11
R.A gt 1 R.A gt 7 R.A gt 11 R.A lt 3 R.A lt 5 R.A
6 R.A 8 R.A ? 9
Agt7
Agt11
Agt1
Tuple A8
lt
3
Alt3
Alt5
6 8
?
9