Data Streams and Continuous Query Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Data Streams and Continuous Query Systems

Description:

Niagara CQ. Goal: ... Niagara in Review ... www.cs.wpi.edu/~cs561/s03/talks/niagara-cq.ppt ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 40
Provided by: billj4
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Streams and Continuous Query Systems


1
Data Streams and Continuous Query Systems
  • CS 240B Professor Zaniolo
  • Eric Sytwu
  • Joseph Joswig

2
Outline
  1. Review of Data Streams
  2. NiagaraCQ
  3. TelegraphCQ
  4. Conclusion
  5. Bibliography

3
Data Sets VS Data Streams
  • Data Sets
  • Infrequently changing data
  • Ex. Employee personnel table, contact database,
    library system
  • Data Streams
  • Data arriving continuously
  • Ex. Stock streamer, sensor networks, weather
    monitoring system

4
Traditional Database Query
  • In a traditional query, the query engine returns
    a subset of the data that is currently in the
    system.

End User / Application
Query
Results
Query Processor
Static Data Sets
5
Continuous Queries
  • Continuous queries are persistent queries that
    allow users to get new results as new information
    enters the system.

6
Niagara CQ
7
Goal
  • Allow users to obtain new results from a database
    without having to issue the same query
    repeatedly.
  • Develop a system that will allow a large number
    of users to be able to register continuous
    queries using a high level language like XML-QL

8
Whats wrong with previous continuous querying
systems?
  • Previous group optimization efforts focused on
    finding an optimal plan for a small number of
    queries.
  • Computationally too expensive to handle a handle
    a large number of queries
  • Not designed for the web, which is constantly
    changing

9
Benefits of NiagaraCQ
  • Based on group optimization
  • Grouped queries can share computation
  • Common execution plans of grouped queries reside
    in memory, saving on I/O costs compared to
    executing each query separately.

10
How do we get the benefits?
  • Incremental group optimization
  • Groups are created for existing queries according
    to their signatures, which represent similar
    structures among the queries.
  • Each individual query in a query group shares the
    results from the execution of the group plan
  • When a new query is submitted, the group
    optimizer considers existing groups as potential
    optimization choices, the new query is merged
    into an existing group

11
Example
  • XML-QL query
  • Expression signature


Quotes.Quote.Symbol in quotes.xml
constant
12
Query Plan
  • Query plan

Trigger Action I
Trigger Action J
Select SymbolINTC
Select SymbolMSFT
File Scan
File Scan
Quotes.xml
Quotes.xml
13
Group Plan
  • Group plan


Trigger Action I
Trigger Action J
Split
Join
SymbolConstant value
File Scan
File Scan
Quotes.xml
Constant Table
14
Materialized Intermediate Files
  • Query split with intermediate files

Trigger_Act_j
Trigger_Act_i
File Scan
.
File Scan
File_j
File_i
Split
15
General Selection Predicates
  • Attribute op Constant
  • Attribute path expression without wildcards
  • Op , lt, gt

16
Join Operators
  • A join signature in or approach contains the
    names of the two data sources and the predicated
    for the join. Join queries are grouped with the
    same join signature.

17
Processing Continuous Queries
  • 1. CQM adds continuous queries with file and
    timer information to enable ED to monitor events
  • 2. ED asks DM to monitor changes to files
  • 3.When a timer event happens, ED asks DM the last
    modified time of files
  • 4.DM informs ED of changes to push-based data
    sources
  • 5.If file changes and timer events are satisfied.
    ED provides CQM with a list of firing CQs
  • 6.CQM invokes QE to execute firing CQs.
  • 7.File scan operator calls DM to retrieve
    selected documents.
  • 8.DM only returns data changes between last fire
    time and current fire time.

18
Experimental results
  • Peformed on a Sun Ultra 6000 with 1GB RAM running
    JDK1.2 on Solaris 2.6

19
Experimental Results
20
Analysis of NiagaraCQ
  • Pros
  • Scalable to large number of queries, users
  • Works with both change and timer based continuous
    queries
  • Better performance, less I/O required to execute
    queries.
  • Cons
  • No dynamic re-grouping of groups, eventually,
    groups become sub-optimal
  • Assumes queries have common structure, not always
    the case
  • Incremental grouping works only for select and
    join as of now. Eventually, aggregation may be
    included.

21
Niagara in Review
  • The goal was to develop an Internet-scale
    continuous query system using group optimization
    based on the assumption that many continuous
    queries on the Internet will have some
    similarities.
  • Proposed novel incremental grouping methodology
  • Supports both timer-based and changed based
    queries.

22
TelegraphCQ
23
TelegraphCQ Design Overview
  • Focus Continuously Adaptive Query Processing of
    high volume and highly variable data streams.
  • Large scale
  • Deeply networked nature
  • Unpredictability of the environment
  • Need for close user interaction
  • Data constantly moving and changing

24
TelegraphCQ Restrictions
  • Data is pushed to the query processor
  • Data arrival rate can be high and bursty
  • On the fly processing, data can be stored, but
    real-time one pass analysis is important
  • Ordering of data is of significant importance.

25
Design Goals
  • scheduling and resource management for groups of
    queries
  • support for out-of-core (non main memory) data
  • variable adaptivity
  • dynamic QoS support
  • parallel cluster-based processing and distributed
    computation.

26
TelegraphCQ
  • Complete Redesign and Re-implementation of
    Telegraph system with focus on focus on support
    for shared, continuous query processing over
    query and data streams.
  • Distinguish it from the Telegraph projects
    broader focus on adaptive dataflow in general,
    and to emphasize the challenges we are addressing
    in our new implementation.

27
Telegraph Module Types
  • Ingress and Caching
  • Interface with external data sources
  • TeSS HTML/XML Screenscraper
  • TelNape Interfaces with popular P2P networks
  • Local caching to hide network delays
  • Query Processing
  • pipelined, non-blocking versions of standard
    relational operators such as joins, selections,
    projections, grouping and aggregation, and
    duplicate elimination.
  • State Module (SteMs)
  • Adaptive Routing
  • ability to re-optimize the plan on a continuous
    basis while a queryis running.
  • Eddies
  • Flux (Fault-tolerant, Load-balancing eXchange)
    Opaque dataflow module handles buffering and
    reordering of streams

28
Eddies
  • Role Continuously route tuples among a set of
    other modules according to a routing policy
  • Intercept tuples and choose the order that they
    travel between modules
  • Eddy can shut down each module when the end of
    all of its input streams is reached and the
    modules have completed current processing.
  • Not designed as general purpose scheduler, no
    enforcement of resource management policies
  • Multiple eddies run as parallel threads on
    queries with disjoint sets of tables and streams.

29
Adaptive Processing W/Eddies SteMs
  • SteM - temporary repository of tuples,
    essentially corresponding to half of a
    traditional join operator.
  • It stores homogeneous tuples (i.e., tuples
    spanning the same set of tables) formed during
    query processing.
  • Supports insert (build), search (probe), and
    optionally delete (eviction) operations.
  • Two kinds of tuples can be routed to a SteM.
  • When a tuple t in T (a build tuple) is routed to
    SteMT , t is added to the set of tuples in SteMT.
  • When a tuple p ? T (a probe tuple) is routed to
    SteMT , SteMT returns concatenated matches for it
    to the Eddy. These concatenated matches are the
    tuples in p join SteMT that satisfy all query
    predicates that can be evaluated on the columns
    in p and T.

SteMsS
SteMsT
ST matches
S probe
T probe
Eddy
S build
T build
S
T
30
Fjords
  • Inter Module Communications API
  • Form the links between modules
  • Supports a mixture of Push (streaming) and Pull
    (static) operations for query plans
  • Allows modules to ignore the specifics of the
    data source.
  • Supports non-Blocking dequeue operations

31
System Specifications
  • Build on PostgreSQL platform
  • process per connection model
  • Coded in C/C

32
Example Landmark Query
  • The input windows of these queries have a fixed
    beginning point in the timeline, and a forward
    moving endpoint.
  • Example Select all the days after the hundredth
    trading day, on which the closing price of MSFT
    has been greater than 50. Keep this query
    standing in the system for a thousand trading
    days.
  • SELECT closingPrice, timestamp
  • FROM ClosingStockPrices
  • WHERE stockSymbol MSFT
  • And closingPrice gt 50.00
  • for (t 101 t lt 1100 t )
  • WindowIs(ClosingStockPrices, 101, t)

MSFT 101 60
MSFT 102 48
MSFT 103 52
MSFT 104 60
33
Example Sliding Window Query
  • The input windows of these queries have forward
    moving beginning and end points.
  • Example On every third trading day starting
    today, calculate the average closing price of
    MSFT for the three most recent trading days. Keep
    the query standing for fifty trading days.
  • Select AVG(closingPrice)
  • From ClosingStockPrices
  • Where stockSymbol MSFT
  • for (t ST t lt ST 50 t 3 )
  • WindowIs(ClosingStockPrices, t - 2, t)

MSFT 101 60
MSFT 102 48
MSFT 103 52
MSFT 104 56
MSFT 105 55
MSFT 106 58
MSFT 107 52
MSFT 108 60
34
Example Temporal Band Join Query
  • These queries join tuples in one stream with
    tuples in another based on timestamp.
  • Example For the five most recent trading days
    starting today, select all stocks that closed
    higher than MSFT on a given day. Keep the query
    standing for twenty trading days.
  • Select c2.
  • FROM ClosingStockPrices as c1,
  • ClosingStockPrices as c2
  • WHERE c1.stockSymbol MSFT and
  • c2.stockSymbol! MSFT and
  • c2.closingPrice gt c1.closingPrice and
  • c2.timestamp c1.timestamp
  • for (t ST t lt ST 20 t )
  • WindowIs(c1, t - 4, t)
  • WindowIs(c2, t - 4, t)

35
Pros and Cons of System
  • Pros
  • Focus on extreme adaptability
  • New code is multithreaded to help boost system
    parallelism and enhance performance particularly
    in multiprocessor scenarios.
  • Cons
  • Code not fully multi-threaded, existing
    PostgreSQL
  • Queries separated into classes for processing
    based on disjoint footprints.
  • Still in early development stages
  • Issues still need to be solved
  • no extensive performance analysis

36
TelegraphCQ Future Work
  • Egress Modules
  • Include fault tolerance in delivery of results,
    ie in mobile networks
  • Improved interface with overlay networks
  • Cluster and Distributed Implementations
  • Extension of FLuX module
  • Integration with TAG system

37
Conclusion and Review
  • NiagaraCQ
  • NiagaraCQ is a system that establishes
    scalability with a general strategy of
    incremental group optimization.
  • TelegraphCQ
  • TelegraphCQ is a system that combines prior work
    in Fjords, Eddies, and PSoup in order to query
    streaming data on large scales
  • Other Data Streaming solutions
  • Aurora
  • STREAM
  • StreamMill

38
Thank You for your time!
39
Bibliography
  • J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ A
    Scalable Continuous Query System for Internet
    Databases. In Proc. Of the ACM SIGMOD Conf. on
    Management of Data, 2000.
  • Xiaoning Wang, NiagaraCQ presentation.
    www.cs.wpi.edu/cs561/s03/talks/niagara-cq.ppt
  • Chandrasekaran, et al. TelegraphCQ Continuous
    Dataflow Processing for an Uncertain World. UC
    Berkeley. 2003 CIDR Conference.
Write a Comment
User Comments (0)
About PowerShow.com