Data Streams and Continuous Query Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Data Streams and Continuous Query Systems

Description:

Niagara CQ. Goal: ... Niagara in Review ... www.cs.wpi.edu/~cs561/s03/talks/niagara-cq.ppt ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 40

Provided by: billj4

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Streams and Continuous Query Systems

1
Data Streams and Continuous Query Systems

CS 240B Professor Zaniolo
Eric Sytwu
Joseph Joswig

2
Outline

Review of Data Streams
NiagaraCQ
TelegraphCQ
Conclusion
Bibliography

3
Data Sets VS Data Streams

Data Sets
Infrequently changing data
Ex. Employee personnel table, contact database,
library system
Data Streams
Data arriving continuously
Ex. Stock streamer, sensor networks, weather
monitoring system

4
Traditional Database Query

In a traditional query, the query engine returns
a subset of the data that is currently in the
system.

End User / Application
Query
Results
Query Processor
Static Data Sets
5
Continuous Queries

Continuous queries are persistent queries that
allow users to get new results as new information
enters the system.

6
Niagara CQ
7
Goal

Allow users to obtain new results from a database
without having to issue the same query
repeatedly.
Develop a system that will allow a large number
of users to be able to register continuous
queries using a high level language like XML-QL

8
Whats wrong with previous continuous querying
systems?

Previous group optimization efforts focused on
finding an optimal plan for a small number of
queries.
Computationally too expensive to handle a handle
a large number of queries
Not designed for the web, which is constantly
changing

9
Benefits of NiagaraCQ

Based on group optimization
Grouped queries can share computation
Common execution plans of grouped queries reside
in memory, saving on I/O costs compared to
executing each query separately.

10
How do we get the benefits?

Incremental group optimization
Groups are created for existing queries according
to their signatures, which represent similar
structures among the queries.
Each individual query in a query group shares the
results from the execution of the group plan
When a new query is submitted, the group
optimizer considers existing groups as potential
optimization choices, the new query is merged
into an existing group

11
Example

XML-QL query
Expression signature

Quotes.Quote.Symbol in quotes.xml
constant
12
Query Plan

Query plan

Trigger Action I
Trigger Action J
Select SymbolINTC
Select SymbolMSFT
File Scan
File Scan
Quotes.xml
Quotes.xml
13
Group Plan

Group plan

Trigger Action I
Trigger Action J
Split
Join
SymbolConstant value
File Scan
File Scan
Quotes.xml
Constant Table
14
Materialized Intermediate Files

Query split with intermediate files

Trigger_Act_j
Trigger_Act_i
File Scan
.
File Scan
File_j
File_i
Split
15
General Selection Predicates

Attribute op Constant
Attribute path expression without wildcards
Op , lt, gt

16
Join Operators

A join signature in or approach contains the
names of the two data sources and the predicated
for the join. Join queries are grouped with the
same join signature.

17
Processing Continuous Queries

1. CQM adds continuous queries with file and
timer information to enable ED to monitor events
2. ED asks DM to monitor changes to files
3.When a timer event happens, ED asks DM the last
modified time of files
4.DM informs ED of changes to push-based data
sources
5.If file changes and timer events are satisfied.
ED provides CQM with a list of firing CQs
6.CQM invokes QE to execute firing CQs.
7.File scan operator calls DM to retrieve
selected documents.
8.DM only returns data changes between last fire
time and current fire time.

18
Experimental results

Peformed on a Sun Ultra 6000 with 1GB RAM running
JDK1.2 on Solaris 2.6

19
Experimental Results
20
Analysis of NiagaraCQ

Pros
Scalable to large number of queries, users
Works with both change and timer based continuous
queries
Better performance, less I/O required to execute
queries.
Cons
No dynamic re-grouping of groups, eventually,
groups become sub-optimal
Assumes queries have common structure, not always
the case
Incremental grouping works only for select and
join as of now. Eventually, aggregation may be
included.

21
Niagara in Review

The goal was to develop an Internet-scale
continuous query system using group optimization
based on the assumption that many continuous
queries on the Internet will have some
similarities.
Proposed novel incremental grouping methodology
Supports both timer-based and changed based
queries.

22
TelegraphCQ
23
TelegraphCQ Design Overview

Focus Continuously Adaptive Query Processing of
high volume and highly variable data streams.
Large scale
Deeply networked nature
Unpredictability of the environment
Need for close user interaction
Data constantly moving and changing

24
TelegraphCQ Restrictions

Data is pushed to the query processor
Data arrival rate can be high and bursty
On the fly processing, data can be stored, but
real-time one pass analysis is important
Ordering of data is of significant importance.

25
Design Goals

scheduling and resource management for groups of
queries
support for out-of-core (non main memory) data
variable adaptivity
dynamic QoS support
parallel cluster-based processing and distributed
computation.

26
TelegraphCQ

Complete Redesign and Re-implementation of
Telegraph system with focus on focus on support
for shared, continuous query processing over
query and data streams.
Distinguish it from the Telegraph projects
broader focus on adaptive dataflow in general,
and to emphasize the challenges we are addressing
in our new implementation.

27
Telegraph Module Types

Ingress and Caching
Interface with external data sources
TeSS HTML/XML Screenscraper
TelNape Interfaces with popular P2P networks
Local caching to hide network delays
Query Processing
pipelined, non-blocking versions of standard
relational operators such as joins, selections,
projections, grouping and aggregation, and
duplicate elimination.
State Module (SteMs)
Adaptive Routing
ability to re-optimize the plan on a continuous
basis while a queryis running.
Eddies
Flux (Fault-tolerant, Load-balancing eXchange)
Opaque dataflow module handles buffering and
reordering of streams

28
Eddies

Role Continuously route tuples among a set of
other modules according to a routing policy
Intercept tuples and choose the order that they
travel between modules
Eddy can shut down each module when the end of
all of its input streams is reached and the
modules have completed current processing.
Not designed as general purpose scheduler, no
enforcement of resource management policies
Multiple eddies run as parallel threads on
queries with disjoint sets of tables and streams.

29
Adaptive Processing W/Eddies SteMs

SteM - temporary repository of tuples,
essentially corresponding to half of a
traditional join operator.
It stores homogeneous tuples (i.e., tuples
spanning the same set of tables) formed during
query processing.
Supports insert (build), search (probe), and
optionally delete (eviction) operations.
Two kinds of tuples can be routed to a SteM.
When a tuple t in T (a build tuple) is routed to
SteMT , t is added to the set of tuples in SteMT.
When a tuple p ? T (a probe tuple) is routed to
SteMT , SteMT returns concatenated matches for it
to the Eddy. These concatenated matches are the
tuples in p join SteMT that satisfy all query
predicates that can be evaluated on the columns
in p and T.

SteMsS
SteMsT
ST matches
S probe
T probe
Eddy
S build
T build
S
T
30
Fjords

Inter Module Communications API
Form the links between modules
Supports a mixture of Push (streaming) and Pull
(static) operations for query plans
Allows modules to ignore the specifics of the
data source.
Supports non-Blocking dequeue operations

31
System Specifications

Build on PostgreSQL platform
process per connection model
Coded in C/C

32
Example Landmark Query

The input windows of these queries have a fixed
beginning point in the timeline, and a forward
moving endpoint.
Example Select all the days after the hundredth
trading day, on which the closing price of MSFT
has been greater than 50. Keep this query
standing in the system for a thousand trading
days.
SELECT closingPrice, timestamp
FROM ClosingStockPrices
WHERE stockSymbol MSFT
And closingPrice gt 50.00
for (t 101 t lt 1100 t )
WindowIs(ClosingStockPrices, 101, t)

MSFT 101 60
MSFT 102 48
MSFT 103 52
MSFT 104 60
33
Example Sliding Window Query

The input windows of these queries have forward
moving beginning and end points.
Example On every third trading day starting
today, calculate the average closing price of
MSFT for the three most recent trading days. Keep
the query standing for fifty trading days.
Select AVG(closingPrice)
From ClosingStockPrices
Where stockSymbol MSFT
for (t ST t lt ST 50 t 3 )
WindowIs(ClosingStockPrices, t - 2, t)

MSFT 101 60
MSFT 102 48
MSFT 103 52
MSFT 104 56
MSFT 105 55
MSFT 106 58
MSFT 107 52
MSFT 108 60
34
Example Temporal Band Join Query

These queries join tuples in one stream with
tuples in another based on timestamp.
Example For the five most recent trading days
starting today, select all stocks that closed
higher than MSFT on a given day. Keep the query
standing for twenty trading days.
Select c2.
FROM ClosingStockPrices as c1,
ClosingStockPrices as c2
WHERE c1.stockSymbol MSFT and
c2.stockSymbol! MSFT and
c2.closingPrice gt c1.closingPrice and
c2.timestamp c1.timestamp
for (t ST t lt ST 20 t )
WindowIs(c1, t - 4, t)
WindowIs(c2, t - 4, t)

35
Pros and Cons of System

Pros
Focus on extreme adaptability
New code is multithreaded to help boost system
parallelism and enhance performance particularly
in multiprocessor scenarios.
Cons
Code not fully multi-threaded, existing
PostgreSQL
Queries separated into classes for processing
based on disjoint footprints.
Still in early development stages
Issues still need to be solved
no extensive performance analysis

36
TelegraphCQ Future Work

Egress Modules
Include fault tolerance in delivery of results,
ie in mobile networks
Improved interface with overlay networks
Cluster and Distributed Implementations
Extension of FLuX module
Integration with TAG system

37
Conclusion and Review

NiagaraCQ
NiagaraCQ is a system that establishes
scalability with a general strategy of
incremental group optimization.
TelegraphCQ
TelegraphCQ is a system that combines prior work
in Fjords, Eddies, and PSoup in order to query
streaming data on large scales
Other Data Streaming solutions
Aurora
STREAM
StreamMill

38
Thank You for your time!
39
Bibliography

J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ A
Scalable Continuous Query System for Internet
Databases. In Proc. Of the ACM SIGMOD Conf. on
Management of Data, 2000.
Xiaoning Wang, NiagaraCQ presentation.
www.cs.wpi.edu/cs561/s03/talks/niagara-cq.ppt
Chandrasekaran, et al. TelegraphCQ Continuous
Dataflow Processing for an Uncertain World. UC
Berkeley. 2003 CIDR Conference.

Write a Comment

User Comments (0)