Title: Data Streams and Continuous Query Systems
1Data Streams and Continuous Query Systems
- CS 240B Professor Zaniolo
- Eric Sytwu
- Joseph Joswig
2Outline
- Review of Data Streams
- NiagaraCQ
- TelegraphCQ
- Conclusion
- Bibliography
3Data Sets VS Data Streams
- Data Sets
- Infrequently changing data
- Ex. Employee personnel table, contact database,
library system - Data Streams
- Data arriving continuously
- Ex. Stock streamer, sensor networks, weather
monitoring system
4Traditional Database Query
- In a traditional query, the query engine returns
a subset of the data that is currently in the
system.
End User / Application
Query
Results
Query Processor
Static Data Sets
5Continuous Queries
- Continuous queries are persistent queries that
allow users to get new results as new information
enters the system.
6Niagara CQ
7Goal
- Allow users to obtain new results from a database
without having to issue the same query
repeatedly. - Develop a system that will allow a large number
of users to be able to register continuous
queries using a high level language like XML-QL
8Whats wrong with previous continuous querying
systems?
- Previous group optimization efforts focused on
finding an optimal plan for a small number of
queries. - Computationally too expensive to handle a handle
a large number of queries - Not designed for the web, which is constantly
changing
9Benefits of NiagaraCQ
- Based on group optimization
- Grouped queries can share computation
- Common execution plans of grouped queries reside
in memory, saving on I/O costs compared to
executing each query separately.
10How do we get the benefits?
- Incremental group optimization
- Groups are created for existing queries according
to their signatures, which represent similar
structures among the queries. - Each individual query in a query group shares the
results from the execution of the group plan - When a new query is submitted, the group
optimizer considers existing groups as potential
optimization choices, the new query is merged
into an existing group
11Example
- XML-QL query
- Expression signature
Quotes.Quote.Symbol in quotes.xml
constant
12Query Plan
Trigger Action I
Trigger Action J
Select SymbolINTC
Select SymbolMSFT
File Scan
File Scan
Quotes.xml
Quotes.xml
13Group Plan
Trigger Action I
Trigger Action J
Split
Join
SymbolConstant value
File Scan
File Scan
Quotes.xml
Constant Table
14Materialized Intermediate Files
- Query split with intermediate files
Trigger_Act_j
Trigger_Act_i
File Scan
.
File Scan
File_j
File_i
Split
15General Selection Predicates
- Attribute op Constant
- Attribute path expression without wildcards
- Op , lt, gt
16Join Operators
- A join signature in or approach contains the
names of the two data sources and the predicated
for the join. Join queries are grouped with the
same join signature.
17Processing Continuous Queries
- 1. CQM adds continuous queries with file and
timer information to enable ED to monitor events - 2. ED asks DM to monitor changes to files
- 3.When a timer event happens, ED asks DM the last
modified time of files - 4.DM informs ED of changes to push-based data
sources - 5.If file changes and timer events are satisfied.
ED provides CQM with a list of firing CQs - 6.CQM invokes QE to execute firing CQs.
- 7.File scan operator calls DM to retrieve
selected documents. - 8.DM only returns data changes between last fire
time and current fire time.
18Experimental results
- Peformed on a Sun Ultra 6000 with 1GB RAM running
JDK1.2 on Solaris 2.6
19Experimental Results
20Analysis of NiagaraCQ
- Pros
- Scalable to large number of queries, users
- Works with both change and timer based continuous
queries - Better performance, less I/O required to execute
queries. - Cons
- No dynamic re-grouping of groups, eventually,
groups become sub-optimal - Assumes queries have common structure, not always
the case - Incremental grouping works only for select and
join as of now. Eventually, aggregation may be
included.
21Niagara in Review
- The goal was to develop an Internet-scale
continuous query system using group optimization
based on the assumption that many continuous
queries on the Internet will have some
similarities. - Proposed novel incremental grouping methodology
- Supports both timer-based and changed based
queries.
22TelegraphCQ
23TelegraphCQ Design Overview
- Focus Continuously Adaptive Query Processing of
high volume and highly variable data streams. - Large scale
- Deeply networked nature
- Unpredictability of the environment
- Need for close user interaction
- Data constantly moving and changing
24TelegraphCQ Restrictions
- Data is pushed to the query processor
- Data arrival rate can be high and bursty
- On the fly processing, data can be stored, but
real-time one pass analysis is important - Ordering of data is of significant importance.
25Design Goals
- scheduling and resource management for groups of
queries - support for out-of-core (non main memory) data
- variable adaptivity
- dynamic QoS support
- parallel cluster-based processing and distributed
computation.
26TelegraphCQ
- Complete Redesign and Re-implementation of
Telegraph system with focus on focus on support
for shared, continuous query processing over
query and data streams. - Distinguish it from the Telegraph projects
broader focus on adaptive dataflow in general,
and to emphasize the challenges we are addressing
in our new implementation.
27Telegraph Module Types
- Ingress and Caching
- Interface with external data sources
- TeSS HTML/XML Screenscraper
- TelNape Interfaces with popular P2P networks
- Local caching to hide network delays
- Query Processing
- pipelined, non-blocking versions of standard
relational operators such as joins, selections,
projections, grouping and aggregation, and
duplicate elimination. - State Module (SteMs)
- Adaptive Routing
- ability to re-optimize the plan on a continuous
basis while a queryis running. - Eddies
- Flux (Fault-tolerant, Load-balancing eXchange)
Opaque dataflow module handles buffering and
reordering of streams
28Eddies
- Role Continuously route tuples among a set of
other modules according to a routing policy - Intercept tuples and choose the order that they
travel between modules - Eddy can shut down each module when the end of
all of its input streams is reached and the
modules have completed current processing. - Not designed as general purpose scheduler, no
enforcement of resource management policies - Multiple eddies run as parallel threads on
queries with disjoint sets of tables and streams.
29Adaptive Processing W/Eddies SteMs
- SteM - temporary repository of tuples,
essentially corresponding to half of a
traditional join operator. - It stores homogeneous tuples (i.e., tuples
spanning the same set of tables) formed during
query processing. - Supports insert (build), search (probe), and
optionally delete (eviction) operations. - Two kinds of tuples can be routed to a SteM.
- When a tuple t in T (a build tuple) is routed to
SteMT , t is added to the set of tuples in SteMT. - When a tuple p ? T (a probe tuple) is routed to
SteMT , SteMT returns concatenated matches for it
to the Eddy. These concatenated matches are the
tuples in p join SteMT that satisfy all query
predicates that can be evaluated on the columns
in p and T.
SteMsS
SteMsT
ST matches
S probe
T probe
Eddy
S build
T build
S
T
30Fjords
- Inter Module Communications API
- Form the links between modules
- Supports a mixture of Push (streaming) and Pull
(static) operations for query plans - Allows modules to ignore the specifics of the
data source. - Supports non-Blocking dequeue operations
31System Specifications
- Build on PostgreSQL platform
- process per connection model
- Coded in C/C
32Example Landmark Query
- The input windows of these queries have a fixed
beginning point in the timeline, and a forward
moving endpoint. - Example Select all the days after the hundredth
trading day, on which the closing price of MSFT
has been greater than 50. Keep this query
standing in the system for a thousand trading
days. - SELECT closingPrice, timestamp
- FROM ClosingStockPrices
- WHERE stockSymbol MSFT
- And closingPrice gt 50.00
- for (t 101 t lt 1100 t )
-
- WindowIs(ClosingStockPrices, 101, t)
MSFT 101 60
MSFT 102 48
MSFT 103 52
MSFT 104 60
33Example Sliding Window Query
- The input windows of these queries have forward
moving beginning and end points. - Example On every third trading day starting
today, calculate the average closing price of
MSFT for the three most recent trading days. Keep
the query standing for fifty trading days. - Select AVG(closingPrice)
- From ClosingStockPrices
- Where stockSymbol MSFT
- for (t ST t lt ST 50 t 3 )
-
- WindowIs(ClosingStockPrices, t - 2, t)
MSFT 101 60
MSFT 102 48
MSFT 103 52
MSFT 104 56
MSFT 105 55
MSFT 106 58
MSFT 107 52
MSFT 108 60
34Example Temporal Band Join Query
- These queries join tuples in one stream with
tuples in another based on timestamp. - Example For the five most recent trading days
starting today, select all stocks that closed
higher than MSFT on a given day. Keep the query
standing for twenty trading days. - Select c2.
- FROM ClosingStockPrices as c1,
- ClosingStockPrices as c2
- WHERE c1.stockSymbol MSFT and
- c2.stockSymbol! MSFT and
- c2.closingPrice gt c1.closingPrice and
- c2.timestamp c1.timestamp
- for (t ST t lt ST 20 t )
-
- WindowIs(c1, t - 4, t)
- WindowIs(c2, t - 4, t)
35Pros and Cons of System
- Pros
- Focus on extreme adaptability
- New code is multithreaded to help boost system
parallelism and enhance performance particularly
in multiprocessor scenarios. - Cons
- Code not fully multi-threaded, existing
PostgreSQL - Queries separated into classes for processing
based on disjoint footprints. - Still in early development stages
- Issues still need to be solved
- no extensive performance analysis
36TelegraphCQ Future Work
- Egress Modules
- Include fault tolerance in delivery of results,
ie in mobile networks - Improved interface with overlay networks
- Cluster and Distributed Implementations
- Extension of FLuX module
- Integration with TAG system
37Conclusion and Review
- NiagaraCQ
- NiagaraCQ is a system that establishes
scalability with a general strategy of
incremental group optimization. - TelegraphCQ
- TelegraphCQ is a system that combines prior work
in Fjords, Eddies, and PSoup in order to query
streaming data on large scales - Other Data Streaming solutions
- Aurora
- STREAM
- StreamMill
38Thank You for your time!
39Bibliography
- J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ A
Scalable Continuous Query System for Internet
Databases. In Proc. Of the ACM SIGMOD Conf. on
Management of Data, 2000. - Xiaoning Wang, NiagaraCQ presentation.
www.cs.wpi.edu/cs561/s03/talks/niagara-cq.ppt - Chandrasekaran, et al. TelegraphCQ Continuous
Dataflow Processing for an Uncertain World. UC
Berkeley. 2003 CIDR Conference.