Eddies: Continuously Adaptive Query processing - PowerPoint PPT Presentation

About This Presentation
Title:

Eddies: Continuously Adaptive Query processing

Description:

Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000 Problem Statement Context: large federated and shared-nothing databases ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 60
Provided by: kers151
Category:

less

Transcript and Presenter's Notes

Title: Eddies: Continuously Adaptive Query processing


1
Eddies Continuously Adaptive Query processing
  • R. Avnur, J.M. Hellerstein
  • UCB
  • ACM Sigmod 2000

2
Problem Statement
  • Context large federated and shared-nothing
    databases
  • Problem assumptions made at query optimization
    rarely hold during execution
  • Hypothesis do away with traditional optimizers,
    solve it thru adaptation
  • Focus scheduling in a tuple-based pipeline query
    execution model

3
Problem Statement Refinement
  • Large scale systems are unpredictable, because
  • Hardware and workload complexity,
  • bursty servers networks, heterogenity, hardware
    characteristics
  • Data complexity,
  • Federated database often come without proper
    statistical summaries
  • User Interface Complexity
  • Online aggregation may involve user control

4
Research Laboratory setting
  • Telegraph, a system designed to query all data
    available online
  • River, a low level distributed record management
    system for shared-nothing databases
  • Eddies, a scheduler for dispatching work over
    operators in a query graph

5
The Idea
  • Relational algebra operators consume a stream
    from multiple sources to produce a new stream
  • A priori you dont now how selective- how fast-
    tuples are consumed/produced
  • You have to adapt continuously and learn this
    information on the fly
  • Adapt the order of processing based on these
    lessons

6
The Idea
next
next
next
next
next
next
7
The Idea
  • Standard method derive a spanning tree over the
    query graph
  • Pre-optimize a query plan to determine operator
    pairs and their algorithm, e.g. to exploit access
    paths
  • Re-optimization a query pipeline on the fly
    requires careful state management, coupled with
  • Synchronization barriers
  • Operators have widely differing arrival rates for
    their operands
  • This limits concurrency, e.g. merge-join
    algorithm
  • Moments of symmetry
  • Algorithm provides option to exchange the role of
    the operands without too much complications
  • E.g switching the role of R and S in a
    nested-loop join

8
Nested-loop
R
s
9
Join and sorting
  • Index-joins are asymmetric, you can not easily
    change their role
  • Combine index-join operands as a unit in the
    process
  • Sorting requires look-ahead
  • Merge-joins are combined into unit
  • Ripple joins
  • Break the space into smaller pieces and solve the
    join operation for each piece individually
  • The piece crossings are moments of symmetry

10
The Idea
next
next
next
next
11
Rivers and Eddies
  • Eddies are tuple routers that distribute arriving
    tuples to interested operators
  • What are efficient scheduling policies?
  • Fixed strategy? Random ? Learning?
  • Static Eddies
  • Delivery of tuples to operators can be hardwired
    in the Eddie to reflect a traditional query
    execution plan
  • Naïve Eddie
  • Operators are delivered tuples based on a
    priority queue
  • Intermediate results get highest priority to
    avoid buffer congestion

12
Observations for selections
  • Extended priority queue for the operators
  • Receiving a tuple leads to a credit increment
  • Returning a tuple leads to a credit decrement
  • Priority is determined by weighted lottery
  • Naïve Eddies exhibit back pressure in the tuple
    flow production is limited by the rate of
    consumption at the output
  • Lottery Eddies approach the cost of optimal
    ordering, without a need to a priory determine
    the order
  • Lottery Eddies outperform heuristics
  • Hash-use first, or Index-use first, Naive

13
Observations
  • The dynamics during a run can be controlled by a
    learning scheme
  • Split the processing in steps (windows) to
    re-adjust the weight during tuple delivery
  • Initial delays can not be handled efficiently
  • Research challenges
  • Better learning algorithms to adjust flow
  • Aggressive adjustments
  • Remove pre-optimization
  • Balance hostile parallel environment
  • Deploy eddies to control degree of partitioning
    (and replication)

14
Database streams You only get one chance to look
  • Prof. Dr. Martin Kersten
  • CWI
  • Amsterdam
  • March 2003

15
Database research topic list
  • Indexing, Access methods, data structures
  • Query/transaction processing and optimization
  • Distributed, heterogeneous, mobile databases
  • View maintenance/materialisation
  • Mining data, text, and web
  • Semi-structured data, metadata and XML
  • Temporal, Spatial, Scientific, Statistical,
    Biological DB
  • Data warehousing and OLAP
  • Middleware, Workflow and Security

HOT XML, Semantic Web, P2P, Streams, Biological
16
Outline
  • Introduction to Data Streaming Management System
    (DSMS)
  • A reference architecture for a DSMS
  • Grouping thousands of user queries
  • Merging and abstraction of streams
  • Conclusions

17
The tranquil database scene
  • Traditional DBMS data stored in finite,
    persistent data sets, SQL-based applications to
    manage and access it

OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS
18
The tranquil database scene
  • The user community grows and MANY wants
    up-to-the-second (aggregate) information from the
    database

OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS
19
The tranquil database scene
  • Database entry is taken over by a remote device
    which issues a high-volume of update transactions

OLTP-web application
Ad-hoc reporting
Data entry application
Dataentry application
RDBMS
20
The tranquil database scene
  • Database entry is taken over by MANY remote
    devices which issues a high-volume of update
    transactions

OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS
21
The tranquil database scene
  • Database solutions can not carry the weight

OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS
22
Application domains
  • Personalized financial tickers
  • Personalized information delivery
  • Personalized environment control
  • Business to business middelware
  • Web-services application based on XML exchange
  • Monitoring the real-world environment (pollution,
    traffic)
  • Monitoring the data flow in an ISP
  • Monitoring web-traffic behaviour
  • Monitoring the load on a telecom switch
  • Monitoring external news-feeds

23
Application vision
  • Re-define the role of a DBMS in the complete
    application support line
  • It manages a persistent store
  • It handles and coordinates updates
  • It supports ad-hoc querying
  • Application servers carry the load
  • J2EE, JBOS, Websphere,BEA,.
  • Or partly redesign the DBMS

24
Application domains
  • Personalized financial tickers
  • Personalized information delivery
  • Personalized environment control
  • Business to business middelware
  • Web-services application based on XML exchange
  • Monitoring the real-world environment (pollution,
    traffic)
  • Monitoring the data flow in an ISP
  • Monitoring web-traffic behaviour
  • Monitoring the load on a telecom switch
  • Monitoring external news-feeds

25
Application domains
  • Personalized
  • Personalized
  • Personalized

  • middelware
  • on XML
    exchange
  • Monitoring
  • Monitoring
  • Monitoring
  • Monitoring
  • Monitoring

QUERYING
WEB SERVICES
STREAM UPDATE
26
Continuous queries
  • Continous query the user observes the changes
    made to the database through a query
  • Query registration once
  • Continously up-to-date answers.

Continuous queries
RDBMS
27
Data Streams
  • Data streams
  • The database is in constant bulk load mode
  • The update rate is often non-uniform
  • The entries are time-stamped
  • The source could be web-service, sensor, wrapped
    source

Dataentry application
DSMS
28
DSMS
  • Data Stream Management Systems (DSMS) support
  • high volume update streams and real-time response
  • to ad-hoc complex queries.
  • What can be salvaged from the DBMS core
    technology ?
  • What should be re-designed from scratch ?

Dataentry application
DSMS
29
DBMS versus DSMS
  • Persistent relations
  • Transaction oriented
  • One-time queries
  • Precise query answering
  • Access plan determines physical database design
  • Transient streams
  • Query orientation
  • Continuous queries
  • Best-effort query answering
  • Unpredictable data characteristics

30
Old technology to rescue?
  • Many stream based applications are low-volume
    with simple queries
  • Thus we can live with automatic query refresh
  • Triggers are available for notification of
    changes
  • They are hooked up to simple changes to the
    datastore
  • There is no technology to merge/optimize trigger
    groups

31
Outline of remainder
  • Query processing over multiple streams
  • Organizing hundreds of ad-hoc queries
  • Sensor-network based querying

32
A stream application
  • Widom Consider a network traffic system for an
    ISP
  • with customer link and backbone link and two
    streams
  • keeping track of the IP traffic

33
A stream application
  • Widom Consider a network traffic system for an
    ISP
  • with customer link and backbone link and two
    streams
  • keeping track of the IP traffic
  • TPc(saddr, daddr, id, length, timestamp)
  • TPb(saddr, daddr, id, length, timestamp)

PTc
PTb
34
A stream application
  • Q1 Compute the load on the backbone link averaged
    over one minute period and notify the operator
    when the load exceeds a threshold T
  • Select notifyoperator(sum(length))
  • From PTb
  • Group By getminute(timestamp)
  • Having sum(length) gtT
  • With low stream flow it could be handled with a
    DBMS trigger,
  • Otherwise sample the stream to get an approximate
    answer

35
A stream application
  • Q2 Find the fraction of traffic on the backbone
    link coming from the customer network to check
    cause of congestion.
  • ( Select count()
  • From PTc as C, PTb as B
  • Where C.saddr B.saddr and C.daddrB.daddr
  • and C.idB.id ) /
  • ( Select count() From PTb)
  • Both streams might require an unbounded resource
    to perform the join, which could be avoided with
    an approximate answer and synopsis

36
A stream application
  • Q3 Monitor the 5 source-to-destination pairs in
    terms of traffic on the backbone.
  • With Load As (Select saddr, daddr,sum(length) as
    traffic
  • From PTb Group By saddr,daddr)
  • Select saddr, daddr, traffic
  • From Load as l1
  • Where (Select count() From Load as l2
  • Where l2.traffic ltl1.traffic) gt
  • (Select 0.95count() From Load)
  • Order By Traffic
  • This query contains blocking operators

37
STREAM architecture
Answer
38
  • Q1 Compute the load on the backbone link averaged
    over one minute period and notify the operator
    when the load exceeds a threshold T
  • Select notifyoperator(sum(length))
  • From PTb
  • Group By getminute(timestamp)
  • Having sum(length) gtT

The answer store area simply needs an integer
39
  • Q2 Find the fraction of traffic on the backbone
    link coming from the customer network to check
    cause of congestion.
  • ( Select count()
  • From PTc as C, PTb as B
  • Where C.saddr B.saddr and C.daddrB.daddr
  • and C.idB.id ) /
  • ( Select count() From PTb)

The scratch area should maintain part of the two
streams to implement the join. Or a complete
list of saddr and daddr.
40
Joining two tables
RelA
Nested loop join
RelB
41
Joining two tables
RelA
Nested loop join
RelB
42
Joining two tables
RelA
Nested loop join
RelB
43
Joining two stream
..
PTa
Nested loop join
PTb
..
An unbounded store would be required
44
Joining two stream
..
PTa
merge join
PTb
..
If the streams are ordered a simple merge join is
possible With limited resource requirements
45
Joining two stream
window
..
PTa
histogram
Join synopsis
histogram
PTb
..
A statistical summary could provide an
approximate answer
46
  • Q3 Monitor the 5 source-to-destination pairs in
    terms of traffic on the backbone.
  • With Load As (Select saddr, daddr,sum(length) as
    traffic
  • From PTb Group By saddr,daddr)
  • Select saddr, daddr, traffic
  • From Load as l1
  • Where (Select count() From Load as l2
  • Where l2.traffic ltl1.traffic) gt
  • (Select 0.95count() From Load)
  • Order By Traffic
  • The scratch area should maintain part of the two
    streams to
  • implement the join.

47
Finance
  • DeWitt Consider a financial feed where
    thousands of clients can register arbitrary
    complex continues queries.
  • XML stream querying

XML
48
Finance
  • Q5 Notify me whenever the price of KPN stock
    drops below 6 euro
  • Select notifyUser(name, price)
  • From ticker t1
  • Where t1.name KPN and t1.price lt 6

49
Finance
  • Q5 Notify me whenever the price of KPN stock
    drops by 5 over the last hour
  • Select notifyUser(name, price)
  • From ticker t1,t2
  • Where t1.name KPN and t2.name t1.name
  • and getminutes(t1.timestamp-t2.timestamp) lt60
  • and t1.price lt 0.95 t2.price

50
Finance
  • Q6 Notify me whenever the price of KPN stock
    drops by 5 over the last hour and T-mobile
    remains constant
  • Select notifyUser(name, price)
  • From ticker t1,t2, t3,t4
  • Where t1.name KPN and t2.name t1.name
  • and getminutes(t1.timestamp-t2.timestamp) lt60
  • and t1.price lt 0.95 t2.price
  • and t1.timestampt3.timestamp and
    t2.timestampt4.timestamp
  • and t3.name T-Mobile and t4.name t3.name
  • and getminutes(t3.timestamp-t4.timestamp) lt60
  • and t3.price t4.price

51
Query signatures
  • Traditional SQL applications already use the
    notion of parameterised queries, I.e. some
    constants are replaced by a program variable.
  • Subsequent calls use the same query evaluation
    plan
  • In a DSMS we should recognize such queries as
    quick as possible
  • Organize similar queries into a group
  • Decompose complex queries into smaller queries
  • Manage the amount of intermediate store

52
Finance
  • Queries can be organized in groups using a
    signature and evaluation can be replaced by
    single multi-user request.

Select notifyUser(name, price) From ticker
t1 Where t1.name KPN and t1.price lt 6
Client Name Threshold Price
192.871.12.1 KPN 6
192.777.021 ING 12

53
Finance
  • Queries can be organized in groups using a
    signature and evalution can be replaced by single
    multi-user request.

Select notifyUser(c.client, t1.name,
t1.price) From ticker t1, clients c Where t1.name
c.name and t1.price lt c.price
Client Name Threshold Price
192.871.12.1 KPN 6
192.777.021 ING 12

54
Finance
  • Timer-based queries call for a stream window with
    incremental evaluation
  • Multiple requests can be organized by time-table
    and event detection methods provided by database
    triggers.
  • Select notifyUser(name, price)
  • From ticker t1,t2
  • Where t1.name KPN and t2.name t1.name
  • and getminutes(t1.timestamp-t2.timestamp) lt60
  • and t1.price lt 0.95 t2.price

55
Finance
  • Complex queries can be broken down into
    independent components
  • Select notifyUser(name, price)
  • From ticker t1,t2, t3,t4
  • Where t1.name KPN and t2.name t1.name
  • and getminutes(t1.timestamp-t2.timestamp) lt60
  • and t1.price lt 0.95 t2.price
  • and t1.timestampt3.timestamp and
    t2.timestampt4.timestamp
  • and t3.name T-Mobile and t4.name t3.name
  • and getminutes(t3.timestamp-t4.timestamp) lt60
  • and t3.price t4.price

56
Finance
  • Intermediate results should be materialized. Can
    be integrated in tradition query evaluation
    schemes




t1.timestampt3.timestamp and t2.timestampt4.time
stamp






57
Sensor networks
  • Madden Sensor networks are composed of
    thousands of small devices, interconnected
    through radio links. This network can be queried.
  • Sensors have limited energy
  • Sensors have limited reachability
  • Sensors can be crushed

58
Aggregate Queries Over Ad-Hoc Wireless Sensor
Networks
59
Sensor networks
  • Q7 Give me the traffic density on the A1 for the
    last hour
  • Select avg(t.car)
  • From traffic t
  • Where t.segment in (Select segment From roads
  • Where name A1)
  • Group By gethour(t.timestamp)

60
Sensor networks
  • The sensors should organize themselves into a P2P
    infrastructure
  • An aggregate query is broadcasted through the
    network
  • Each Mote calculates a partial answer and sent it
    to its peers
  • Peers aggregate the information to produce the
    final answer.
  • Problems
  • The energy to broadcast some information is high
  • Tuples and partial results may be dropped

61
Conclusions and outlook
  • Data stream management technology require changes
    in our expectation of a DBMS functionality
  • Queries not necessarily provide a precise answer
  • Queries continue as long as we are interested in
    their approximate result
  • The persistent store not necessarily contains a
    consistent and timeless view on the state of the
    database.

62
Conclusions and outlook
  • Datastream management technology capitalizes upon
    proven DBMS technology
  • DSMS provide a basis for ambient home settings,
    sensor networks, and globe spanning information
    systems
  • It is realistic to expect that some of the
    properties to support efficient datastream
    management will become part of the major products
  • Multi query optimization techniques should be
    added.

63
Literature
  • NiagaraCQ A Scalable Contious Query System for
    Internet Databases, J. Chen, D.J. deWitt, F.
    Tian, Y. Wang, Wisconsin Univ.
  • Streaming Queries over Streaming Data , Sirish
    Chandrasekaran, Michael J. Franklin, Univ
    Berkeley
  • Continous Queries over Data Streams, S.Babu, J.
    Widom, Stanford University
Write a Comment
User Comments (0)
About PowerShow.com