Eddies: Continuously Adaptive Query processing

About This Presentation

Title:

Eddies: Continuously Adaptive Query processing

Description:

Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000 Problem Statement Context: large federated and shared-nothing databases ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 60

Provided by: kers151

Category:

more less

Transcript and Presenter's Notes

Title: Eddies: Continuously Adaptive Query processing

1
Eddies Continuously Adaptive Query processing

R. Avnur, J.M. Hellerstein
UCB
ACM Sigmod 2000

2
Problem Statement

Context large federated and shared-nothing
databases
Problem assumptions made at query optimization
rarely hold during execution
Hypothesis do away with traditional optimizers,
solve it thru adaptation
Focus scheduling in a tuple-based pipeline query
execution model

3
Problem Statement Refinement

Large scale systems are unpredictable, because
Hardware and workload complexity,
bursty servers networks, heterogenity, hardware
characteristics
Data complexity,
Federated database often come without proper
statistical summaries
User Interface Complexity
Online aggregation may involve user control

4
Research Laboratory setting

Telegraph, a system designed to query all data
available online
River, a low level distributed record management
system for shared-nothing databases
Eddies, a scheduler for dispatching work over
operators in a query graph

5
The Idea

Relational algebra operators consume a stream
from multiple sources to produce a new stream
A priori you dont now how selective- how fast-
tuples are consumed/produced
You have to adapt continuously and learn this
information on the fly
Adapt the order of processing based on these
lessons

6
The Idea
next
next
next
next
next
next
7
The Idea

Standard method derive a spanning tree over the
query graph
Pre-optimize a query plan to determine operator
pairs and their algorithm, e.g. to exploit access
paths
Re-optimization a query pipeline on the fly
requires careful state management, coupled with
Synchronization barriers
Operators have widely differing arrival rates for
their operands
This limits concurrency, e.g. merge-join
algorithm
Moments of symmetry
Algorithm provides option to exchange the role of
the operands without too much complications
E.g switching the role of R and S in a
nested-loop join

8
Nested-loop
R
s
9
Join and sorting

Index-joins are asymmetric, you can not easily
change their role
Combine index-join operands as a unit in the
process
Sorting requires look-ahead
Merge-joins are combined into unit
Ripple joins
Break the space into smaller pieces and solve the
join operation for each piece individually
The piece crossings are moments of symmetry

10
The Idea
next
next
next
next
11
Rivers and Eddies

Eddies are tuple routers that distribute arriving
tuples to interested operators
What are efficient scheduling policies?
Fixed strategy? Random ? Learning?
Static Eddies
Delivery of tuples to operators can be hardwired
in the Eddie to reflect a traditional query
execution plan
Naïve Eddie
Operators are delivered tuples based on a
priority queue
Intermediate results get highest priority to
avoid buffer congestion

12
Observations for selections

Extended priority queue for the operators
Receiving a tuple leads to a credit increment
Returning a tuple leads to a credit decrement
Priority is determined by weighted lottery
Naïve Eddies exhibit back pressure in the tuple
flow production is limited by the rate of
consumption at the output
Lottery Eddies approach the cost of optimal
ordering, without a need to a priory determine
the order
Lottery Eddies outperform heuristics
Hash-use first, or Index-use first, Naive

13
Observations

The dynamics during a run can be controlled by a
learning scheme
Split the processing in steps (windows) to
re-adjust the weight during tuple delivery
Initial delays can not be handled efficiently
Research challenges
Better learning algorithms to adjust flow
Aggressive adjustments
Remove pre-optimization
Balance hostile parallel environment
Deploy eddies to control degree of partitioning
(and replication)

14
Database streams You only get one chance to look

Prof. Dr. Martin Kersten
CWI
Amsterdam
March 2003

15
Database research topic list

Indexing, Access methods, data structures
Query/transaction processing and optimization
Distributed, heterogeneous, mobile databases
View maintenance/materialisation
Mining data, text, and web
Semi-structured data, metadata and XML
Temporal, Spatial, Scientific, Statistical,
Biological DB
Data warehousing and OLAP
Middleware, Workflow and Security

HOT XML, Semantic Web, P2P, Streams, Biological
16
Outline

Introduction to Data Streaming Management System
(DSMS)
A reference architecture for a DSMS
Grouping thousands of user queries
Merging and abstraction of streams
Conclusions

17
The tranquil database scene

Traditional DBMS data stored in finite,
persistent data sets, SQL-based applications to
manage and access it

OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS
18
The tranquil database scene

The user community grows and MANY wants
up-to-the-second (aggregate) information from the
database

OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS
19
The tranquil database scene

Database entry is taken over by a remote device
which issues a high-volume of update transactions

OLTP-web application
Ad-hoc reporting
Data entry application
Dataentry application
RDBMS
20
The tranquil database scene

Database entry is taken over by MANY remote
devices which issues a high-volume of update
transactions

OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS
21
The tranquil database scene

Database solutions can not carry the weight

OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS
22
Application domains

Personalized financial tickers
Personalized information delivery
Personalized environment control
Business to business middelware
Web-services application based on XML exchange
Monitoring the real-world environment (pollution,
traffic)
Monitoring the data flow in an ISP
Monitoring web-traffic behaviour
Monitoring the load on a telecom switch
Monitoring external news-feeds

23
Application vision

Re-define the role of a DBMS in the complete
application support line
It manages a persistent store
It handles and coordinates updates
It supports ad-hoc querying
Application servers carry the load
J2EE, JBOS, Websphere,BEA,.
Or partly redesign the DBMS

24
Application domains

Personalized financial tickers
Personalized information delivery
Personalized environment control
Business to business middelware
Web-services application based on XML exchange
Monitoring the real-world environment (pollution,
traffic)
Monitoring the data flow in an ISP
Monitoring web-traffic behaviour
Monitoring the load on a telecom switch
Monitoring external news-feeds

25
Application domains

Personalized
Personalized
Personalized
middelware
on XML
exchange
Monitoring
Monitoring
Monitoring
Monitoring
Monitoring

QUERYING
WEB SERVICES
STREAM UPDATE
26
Continuous queries

Continous query the user observes the changes
made to the database through a query
Query registration once
Continously up-to-date answers.

Continuous queries
RDBMS
27
Data Streams

Data streams
The database is in constant bulk load mode
The update rate is often non-uniform
The entries are time-stamped
The source could be web-service, sensor, wrapped
source

Dataentry application
DSMS
28
DSMS

Data Stream Management Systems (DSMS) support
high volume update streams and real-time response
to ad-hoc complex queries.
What can be salvaged from the DBMS core
technology ?
What should be re-designed from scratch ?

Dataentry application
DSMS
29
DBMS versus DSMS

Persistent relations
Transaction oriented
One-time queries
Precise query answering
Access plan determines physical database design

Transient streams
Query orientation
Continuous queries
Best-effort query answering
Unpredictable data characteristics

30
Old technology to rescue?

Many stream based applications are low-volume
with simple queries
Thus we can live with automatic query refresh
Triggers are available for notification of
changes
They are hooked up to simple changes to the
datastore
There is no technology to merge/optimize trigger
groups

31
Outline of remainder

Query processing over multiple streams
Organizing hundreds of ad-hoc queries
Sensor-network based querying

32
A stream application

Widom Consider a network traffic system for an
ISP
with customer link and backbone link and two
streams
keeping track of the IP traffic

33
A stream application

Widom Consider a network traffic system for an
ISP
with customer link and backbone link and two
streams
keeping track of the IP traffic
TPc(saddr, daddr, id, length, timestamp)
TPb(saddr, daddr, id, length, timestamp)

PTc
PTb
34
A stream application

Q1 Compute the load on the backbone link averaged
over one minute period and notify the operator
when the load exceeds a threshold T
Select notifyoperator(sum(length))
From PTb
Group By getminute(timestamp)
Having sum(length) gtT
With low stream flow it could be handled with a
DBMS trigger,
Otherwise sample the stream to get an approximate
answer

35
A stream application

Q2 Find the fraction of traffic on the backbone
link coming from the customer network to check
cause of congestion.
( Select count()
From PTc as C, PTb as B
Where C.saddr B.saddr and C.daddrB.daddr
and C.idB.id ) /
( Select count() From PTb)
Both streams might require an unbounded resource
to perform the join, which could be avoided with
an approximate answer and synopsis

36
A stream application

Q3 Monitor the 5 source-to-destination pairs in
terms of traffic on the backbone.
With Load As (Select saddr, daddr,sum(length) as
traffic
From PTb Group By saddr,daddr)
Select saddr, daddr, traffic
From Load as l1
Where (Select count() From Load as l2
Where l2.traffic ltl1.traffic) gt
(Select 0.95count() From Load)
Order By Traffic
This query contains blocking operators

37
STREAM architecture
Answer
38

Q1 Compute the load on the backbone link averaged
over one minute period and notify the operator
when the load exceeds a threshold T
Select notifyoperator(sum(length))
From PTb
Group By getminute(timestamp)
Having sum(length) gtT

The answer store area simply needs an integer
39

Q2 Find the fraction of traffic on the backbone
link coming from the customer network to check
cause of congestion.
( Select count()
From PTc as C, PTb as B
Where C.saddr B.saddr and C.daddrB.daddr
and C.idB.id ) /
( Select count() From PTb)

The scratch area should maintain part of the two
streams to implement the join. Or a complete
list of saddr and daddr.
40
Joining two tables
RelA
Nested loop join
RelB
41
Joining two tables
RelA
Nested loop join
RelB
42
Joining two tables
RelA
Nested loop join
RelB
43
Joining two stream
..
PTa
Nested loop join
PTb
..
An unbounded store would be required
44
Joining two stream
..
PTa
merge join
PTb
..
If the streams are ordered a simple merge join is
possible With limited resource requirements
45
Joining two stream
window
..
PTa
histogram
Join synopsis
histogram
PTb
..
A statistical summary could provide an
approximate answer
46

Q3 Monitor the 5 source-to-destination pairs in
terms of traffic on the backbone.
With Load As (Select saddr, daddr,sum(length) as
traffic
From PTb Group By saddr,daddr)
Select saddr, daddr, traffic
From Load as l1
Where (Select count() From Load as l2
Where l2.traffic ltl1.traffic) gt
(Select 0.95count() From Load)
Order By Traffic
The scratch area should maintain part of the two
streams to
implement the join.

47
Finance

DeWitt Consider a financial feed where
thousands of clients can register arbitrary
complex continues queries.
XML stream querying

XML
48
Finance

Q5 Notify me whenever the price of KPN stock
drops below 6 euro
Select notifyUser(name, price)
From ticker t1
Where t1.name KPN and t1.price lt 6

49
Finance

Q5 Notify me whenever the price of KPN stock
drops by 5 over the last hour
Select notifyUser(name, price)
From ticker t1,t2
Where t1.name KPN and t2.name t1.name
and getminutes(t1.timestamp-t2.timestamp) lt60
and t1.price lt 0.95 t2.price

50
Finance

Q6 Notify me whenever the price of KPN stock
drops by 5 over the last hour and T-mobile
remains constant
Select notifyUser(name, price)
From ticker t1,t2, t3,t4
Where t1.name KPN and t2.name t1.name
and getminutes(t1.timestamp-t2.timestamp) lt60
and t1.price lt 0.95 t2.price
and t1.timestampt3.timestamp and
t2.timestampt4.timestamp
and t3.name T-Mobile and t4.name t3.name
and getminutes(t3.timestamp-t4.timestamp) lt60
and t3.price t4.price

51
Query signatures

Traditional SQL applications already use the
notion of parameterised queries, I.e. some
constants are replaced by a program variable.
Subsequent calls use the same query evaluation
plan
In a DSMS we should recognize such queries as
quick as possible
Organize similar queries into a group
Decompose complex queries into smaller queries
Manage the amount of intermediate store

52
Finance

Queries can be organized in groups using a
signature and evaluation can be replaced by
single multi-user request.

Select notifyUser(name, price) From ticker
t1 Where t1.name KPN and t1.price lt 6
Client Name Threshold Price
192.871.12.1 KPN 6
192.777.021 ING 12

53
Finance

Queries can be organized in groups using a
signature and evalution can be replaced by single
multi-user request.

Select notifyUser(c.client, t1.name,
t1.price) From ticker t1, clients c Where t1.name
c.name and t1.price lt c.price
Client Name Threshold Price
192.871.12.1 KPN 6
192.777.021 ING 12

54
Finance

Timer-based queries call for a stream window with
incremental evaluation
Multiple requests can be organized by time-table
and event detection methods provided by database
triggers.
Select notifyUser(name, price)
From ticker t1,t2
Where t1.name KPN and t2.name t1.name
and getminutes(t1.timestamp-t2.timestamp) lt60
and t1.price lt 0.95 t2.price

55
Finance

Complex queries can be broken down into
independent components
Select notifyUser(name, price)
From ticker t1,t2, t3,t4
Where t1.name KPN and t2.name t1.name
and getminutes(t1.timestamp-t2.timestamp) lt60
and t1.price lt 0.95 t2.price
and t1.timestampt3.timestamp and
t2.timestampt4.timestamp
and t3.name T-Mobile and t4.name t3.name
and getminutes(t3.timestamp-t4.timestamp) lt60
and t3.price t4.price

56
Finance

Intermediate results should be materialized. Can
be integrated in tradition query evaluation
schemes

t1.timestampt3.timestamp and t2.timestampt4.time
stamp

57
Sensor networks

Madden Sensor networks are composed of
thousands of small devices, interconnected
through radio links. This network can be queried.
Sensors have limited energy
Sensors have limited reachability
Sensors can be crushed

58
Aggregate Queries Over Ad-Hoc Wireless Sensor
Networks
59
Sensor networks

Q7 Give me the traffic density on the A1 for the
last hour
Select avg(t.car)
From traffic t
Where t.segment in (Select segment From roads
Where name A1)
Group By gethour(t.timestamp)

60
Sensor networks

The sensors should organize themselves into a P2P
infrastructure
An aggregate query is broadcasted through the
network
Each Mote calculates a partial answer and sent it
to its peers
Peers aggregate the information to produce the
final answer.
Problems
The energy to broadcast some information is high
Tuples and partial results may be dropped

61
Conclusions and outlook

Data stream management technology require changes
in our expectation of a DBMS functionality
Queries not necessarily provide a precise answer
Queries continue as long as we are interested in
their approximate result
The persistent store not necessarily contains a
consistent and timeless view on the state of the
database.

62
Conclusions and outlook

Datastream management technology capitalizes upon
proven DBMS technology
DSMS provide a basis for ambient home settings,
sensor networks, and globe spanning information
systems
It is realistic to expect that some of the
properties to support efficient datastream
management will become part of the major products
Multi query optimization techniques should be
added.

63
Literature

NiagaraCQ A Scalable Contious Query System for
Internet Databases, J. Chen, D.J. deWitt, F.
Tian, Y. Wang, Wisconsin Univ.
Streaming Queries over Streaming Data , Sirish
Chandrasekaran, Michael J. Franklin, Univ
Berkeley
Continous Queries over Data Streams, S.Babu, J.
Widom, Stanford University

Write a Comment

User Comments (0)