Title: Eddies: Continuously Adaptive Query processing
 1Eddies Continuously Adaptive Query processing 
- R. Avnur, J.M. Hellerstein 
- UCB 
- ACM Sigmod 2000 
2Problem Statement
- Context large federated and shared-nothing 
 databases
- Problem assumptions made at query optimization 
 rarely hold during execution
- Hypothesis do away with traditional optimizers, 
 solve it thru adaptation
- Focus scheduling in a tuple-based pipeline query 
 execution model
3Problem Statement Refinement
- Large scale systems are unpredictable, because 
- Hardware and workload complexity, 
- bursty servers  networks, heterogenity, hardware 
 characteristics
- Data complexity, 
- Federated database often come without proper 
 statistical summaries
- User Interface Complexity 
- Online aggregation may involve user control
4Research Laboratory setting
- Telegraph, a system designed to query all data 
 available online
- River, a low level distributed record management 
 system for shared-nothing databases
- Eddies, a scheduler for dispatching work over 
 operators in a query graph
5The Idea
- Relational algebra operators consume a stream 
 from multiple sources to produce a new stream
- A priori you dont now how selective- how fast- 
 tuples are consumed/produced
- You have to adapt continuously and learn this 
 information on the fly
- Adapt the order of processing based on these 
 lessons
6The Idea
next
next
next
next
next
next 
 7The Idea
- Standard method derive a spanning tree over the 
 query graph
- Pre-optimize a query plan to determine operator 
 pairs and their algorithm, e.g. to exploit access
 paths
- Re-optimization a query pipeline on the fly 
 requires careful state management, coupled with
- Synchronization barriers 
- Operators have widely differing arrival rates for 
 their operands
- This limits concurrency, e.g. merge-join 
 algorithm
- Moments of symmetry 
- Algorithm provides option to exchange the role of 
 the operands without too much complications
- E.g switching the role of R and S in a 
 nested-loop join
8Nested-loop
R
s 
 9Join and sorting
- Index-joins are asymmetric, you can not easily 
 change their role
- Combine index-join  operands as a unit in the 
 process
- Sorting requires look-ahead 
- Merge-joins are combined into unit 
- Ripple joins 
- Break the space into smaller pieces and solve the 
 join operation for each piece individually
- The piece crossings are moments of symmetry 
10The Idea
next
next
next
next 
 11Rivers and Eddies
- Eddies are tuple routers that distribute arriving 
 tuples to interested operators
- What are efficient scheduling policies? 
- Fixed strategy? Random ? Learning? 
- Static Eddies 
- Delivery of tuples to operators can be hardwired 
 in the Eddie to reflect a traditional query
 execution plan
- Naïve Eddie 
- Operators are delivered tuples based on a 
 priority queue
- Intermediate results get highest priority to 
 avoid buffer congestion
12Observations for selections
- Extended priority queue for the operators 
- Receiving a tuple leads to a credit increment 
- Returning a tuple leads to a credit decrement 
- Priority is determined by weighted lottery 
- Naïve Eddies exhibit back pressure in the tuple 
 flow production is limited by the rate of
 consumption at the output
- Lottery Eddies approach the cost of optimal 
 ordering, without a need to a priory determine
 the order
- Lottery Eddies outperform heuristics 
- Hash-use first, or Index-use first, Naive
13Observations
- The dynamics during a run can be controlled by a 
 learning scheme
- Split the processing in steps (windows) to 
 re-adjust the weight during tuple delivery
- Initial delays can not be handled efficiently 
- Research challenges 
- Better learning algorithms to adjust flow 
- Aggressive adjustments 
- Remove pre-optimization 
- Balance hostile parallel environment 
- Deploy eddies to control degree of partitioning 
 (and replication)
14Database streams You only get one chance to look
- Prof. Dr. Martin Kersten 
- CWI 
- Amsterdam 
- March 2003
15Database research topic list
- Indexing, Access methods, data structures 
- Query/transaction processing and optimization 
- Distributed, heterogeneous, mobile databases 
- View maintenance/materialisation 
- Mining data, text, and web 
- Semi-structured data, metadata and XML 
- Temporal, Spatial, Scientific, Statistical, 
 Biological DB
- Data warehousing and OLAP 
- Middleware, Workflow and Security
HOT XML, Semantic Web, P2P, Streams, Biological 
 16Outline
- Introduction to Data Streaming Management System 
 (DSMS)
- A reference architecture for a DSMS 
- Grouping thousands of user queries 
- Merging and abstraction of streams 
- Conclusions
17The tranquil database scene
- Traditional DBMS  data stored in finite, 
 persistent data sets, SQL-based applications to
 manage and access it
OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS 
 18The tranquil database scene
- The user community grows and MANY wants 
 up-to-the-second (aggregate) information from the
 database
OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS 
 19The tranquil database scene
- Database entry is taken over by a remote device 
 which issues a high-volume of update transactions
OLTP-web application
Ad-hoc reporting
Data entry application
Dataentry application
RDBMS 
 20The tranquil database scene
- Database entry is taken over by MANY remote 
 devices which issues a high-volume of update
 transactions
OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS 
 21The tranquil database scene
- Database solutions can not carry the weight 
OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS 
 22Application domains
- Personalized financial tickers 
- Personalized information delivery 
- Personalized environment control 
- Business to business middelware 
- Web-services application based on XML exchange 
- Monitoring the real-world environment (pollution, 
 traffic)
- Monitoring the data flow in an ISP 
- Monitoring web-traffic behaviour 
- Monitoring the load on a telecom switch 
- Monitoring external news-feeds
23Application vision
- Re-define the role of a DBMS in the complete 
 application support line
- It manages a persistent store 
- It handles and coordinates updates 
- It supports ad-hoc querying 
- Application servers carry the load 
- J2EE, JBOS, Websphere,BEA,. 
- Or partly redesign the DBMS
24Application domains
- Personalized financial tickers 
- Personalized information delivery 
- Personalized environment control 
- Business to business middelware 
- Web-services application based on XML exchange 
- Monitoring the real-world environment (pollution, 
 traffic)
- Monitoring the data flow in an ISP 
- Monitoring web-traffic behaviour 
- Monitoring the load on a telecom switch 
- Monitoring external news-feeds
25Application domains
- Personalized 
- Personalized 
- Personalized 
-  
 middelware
-  on XML 
 exchange
- Monitoring 
- Monitoring 
- Monitoring 
- Monitoring 
- Monitoring
QUERYING
WEB SERVICES
STREAM UPDATE 
 26Continuous queries
- Continous query  the user observes the changes 
 made to the database through a query
- Query registration once 
- Continously up-to-date answers.
Continuous queries
RDBMS 
 27Data Streams
- Data streams 
- The database is in constant bulk load mode 
- The update rate is often non-uniform 
- The entries are time-stamped 
- The source could be web-service, sensor, wrapped 
 source
Dataentry application
DSMS 
 28DSMS
- Data Stream Management Systems (DSMS) support 
- high volume update streams and real-time response 
 
- to ad-hoc complex queries. 
- What can be salvaged from the DBMS core 
 technology ?
- What should be re-designed from scratch ?
Dataentry application
DSMS 
 29DBMS versus DSMS
- Persistent relations 
- Transaction oriented 
- One-time queries 
- Precise query answering 
- Access plan determines physical database design 
- Transient streams 
- Query orientation 
- Continuous queries 
- Best-effort query answering 
- Unpredictable data characteristics 
30Old technology to rescue?
- Many stream based applications are low-volume 
 with simple queries
- Thus we can live with automatic query refresh 
- Triggers are available for notification of 
 changes
- They are hooked up to simple changes to the 
 datastore
- There is no technology to merge/optimize trigger 
 groups
31Outline of remainder
- Query processing over multiple streams 
- Organizing hundreds of ad-hoc queries 
- Sensor-network based querying
32A stream application
- Widom Consider a network traffic system for an 
 ISP
-  with customer link and backbone link and two 
 streams
-  keeping track of the IP traffic 
33A stream application
- Widom Consider a network traffic system for an 
 ISP
-  with customer link and backbone link and two 
 streams
-  keeping track of the IP traffic 
-  TPc(saddr, daddr, id, length, timestamp) 
-  TPb(saddr, daddr, id, length, timestamp)
PTc
PTb 
 34A stream application
- Q1 Compute the load on the backbone link averaged 
 over one minute period and notify the operator
 when the load exceeds a threshold T
- Select notifyoperator(sum(length)) 
- From PTb 
- Group By getminute(timestamp) 
- Having sum(length) gtT 
- With low stream flow it could be handled with a 
 DBMS trigger,
- Otherwise sample the stream to get an approximate 
 answer
35A stream application
- Q2 Find the fraction of traffic on the backbone 
 link coming from the customer network to check
 cause of congestion.
-  ( Select count() 
-  From PTc as C, PTb as B 
-  Where C.saddr  B.saddr and C.daddrB.daddr 
-  and C.idB.id ) / 
- ( Select count() From PTb) 
- Both streams might require an unbounded resource 
 to perform the join, which could be avoided with
 an approximate answer and synopsis
36A stream application
- Q3 Monitor the 5 source-to-destination pairs in 
 terms of traffic on the backbone.
- With Load As (Select saddr, daddr,sum(length) as 
 traffic
-  From PTb Group By saddr,daddr) 
- Select saddr, daddr, traffic 
- From Load as l1 
- Where (Select count() From Load as l2 
-  Where l2.traffic ltl1.traffic) gt 
-  (Select 0.95count() From Load) 
- Order By Traffic 
-  
- This query contains blocking operators
37STREAM architecture
Answer 
 38- Q1 Compute the load on the backbone link averaged 
 over one minute period and notify the operator
 when the load exceeds a threshold T
- Select notifyoperator(sum(length)) 
- From PTb 
- Group By getminute(timestamp) 
- Having sum(length) gtT 
The answer store area simply needs an integer 
 39- Q2 Find the fraction of traffic on the backbone 
 link coming from the customer network to check
 cause of congestion.
-  ( Select count() 
-  From PTc as C, PTb as B 
-  Where C.saddr  B.saddr and C.daddrB.daddr 
-  and C.idB.id ) / 
- ( Select count() From PTb) 
The scratch area should maintain part of the two 
streams to implement the join. Or a complete 
list of saddr and daddr. 
 40Joining two tables
RelA
Nested loop join
RelB 
 41Joining two tables
RelA
Nested loop join
RelB 
 42Joining two tables
RelA
Nested loop join
RelB 
 43Joining two stream
..
PTa
Nested loop join
PTb
..
An unbounded store would be required 
 44Joining two stream
..
PTa
merge join
PTb
..
If the streams are ordered a simple merge join is 
possible With limited resource requirements 
 45Joining two stream
window
..
PTa
histogram
Join synopsis
histogram
PTb
..
A statistical summary could provide an 
approximate answer 
 46- Q3 Monitor the 5 source-to-destination pairs in 
 terms of traffic on the backbone.
- With Load As (Select saddr, daddr,sum(length) as 
 traffic
-  From PTb Group By saddr,daddr) 
- Select saddr, daddr, traffic 
- From Load as l1 
- Where (Select count() From Load as l2 
-  Where l2.traffic ltl1.traffic) gt 
-  (Select 0.95count() From Load) 
- Order By Traffic 
-  
- The scratch area should maintain part of the two 
 streams to
- implement the join.
47Finance
- DeWitt Consider a financial feed where 
 thousands of clients can register arbitrary
 complex continues queries.
- XML stream querying
XML 
 48Finance
- Q5 Notify me whenever the price of KPN stock 
 drops below 6 euro
- Select notifyUser(name, price) 
- From ticker t1 
- Where t1.name  KPN and t1.price lt 6
49Finance
- Q5 Notify me whenever the price of KPN stock 
 drops by 5 over the last hour
- Select notifyUser(name, price) 
- From ticker t1,t2 
- Where t1.name  KPN and t2.name t1.name 
-  and getminutes(t1.timestamp-t2.timestamp) lt60 
-  and t1.price lt 0.95  t2.price
50Finance
- Q6 Notify me whenever the price of KPN stock 
 drops by 5 over the last hour and T-mobile
 remains constant
- Select notifyUser(name, price) 
- From ticker t1,t2, t3,t4 
- Where t1.name  KPN and t2.name t1.name 
-  and getminutes(t1.timestamp-t2.timestamp) lt60 
-  and t1.price lt 0.95  t2.price 
-  and t1.timestampt3.timestamp and 
 t2.timestampt4.timestamp
-  and t3.name  T-Mobile and t4.name t3.name 
-  and getminutes(t3.timestamp-t4.timestamp) lt60 
-  and t3.price  t4.price
51Query signatures
- Traditional SQL applications already use the 
 notion of parameterised queries, I.e. some
 constants are replaced by a program variable.
- Subsequent calls use the same query evaluation 
 plan
- In a DSMS we should recognize such queries as 
 quick as possible
- Organize similar queries into a group 
- Decompose complex queries into smaller queries 
- Manage the amount of intermediate store
52Finance
- Queries can be organized in groups using a 
 signature and evaluation can be replaced by
 single multi-user request.
Select notifyUser(name, price) From ticker 
t1 Where t1.name  KPN and t1.price lt 6
Client Name Threshold Price
192.871.12.1 KPN 6 
192.777.021 ING 12  
 53Finance
- Queries can be organized in groups using a 
 signature and evalution can be replaced by single
 multi-user request.
Select notifyUser(c.client, t1.name, 
t1.price) From ticker t1, clients c Where t1.name 
 c.name and t1.price lt c.price
Client Name Threshold Price
192.871.12.1 KPN 6 
192.777.021 ING 12  
 54Finance
- Timer-based queries call for a stream window with 
 incremental evaluation
- Multiple requests can be organized by time-table 
 and event detection methods provided by database
 triggers.
- Select notifyUser(name, price) 
- From ticker t1,t2 
- Where t1.name  KPN and t2.name t1.name 
-  and getminutes(t1.timestamp-t2.timestamp) lt60 
-  and t1.price lt 0.95  t2.price
55Finance
- Complex queries can be broken down into 
 independent components
- Select notifyUser(name, price) 
- From ticker t1,t2, t3,t4 
- Where t1.name  KPN and t2.name t1.name 
-  and getminutes(t1.timestamp-t2.timestamp) lt60 
-  and t1.price lt 0.95  t2.price 
-  and t1.timestampt3.timestamp and 
 t2.timestampt4.timestamp
- and t3.name  T-Mobile and t4.name t3.name 
-  and getminutes(t3.timestamp-t4.timestamp) lt60 
-  and t3.price  t4.price
56Finance
- Intermediate results should be materialized. Can 
 be integrated in tradition query evaluation
 schemes
t1.timestampt3.timestamp and t2.timestampt4.time
stamp 
 57Sensor networks
- Madden Sensor networks are composed of 
 thousands of small devices, interconnected
 through radio links. This network can be queried.
- Sensors have limited energy 
- Sensors have limited reachability 
- Sensors can be crushed
58Aggregate Queries Over Ad-Hoc Wireless Sensor 
Networks  
 59Sensor networks
- Q7 Give me the traffic density on the A1 for the 
 last hour
- Select avg(t.car) 
- From traffic t 
- Where t.segment in (Select segment From roads 
-  Where name  A1) 
- Group By gethour(t.timestamp)
60Sensor networks
- The sensors should organize themselves into a P2P 
 infrastructure
- An aggregate query is broadcasted through the 
 network
- Each Mote calculates a partial answer and sent it 
 to its peers
- Peers aggregate the information to produce the 
 final answer.
- Problems 
- The energy to broadcast some information is high 
- Tuples and partial results may be dropped 
61Conclusions and outlook
- Data stream management technology require changes 
 in our expectation of a DBMS functionality
- Queries not necessarily provide a precise answer 
- Queries continue as long as we are interested in 
 their approximate result
- The persistent store not necessarily contains a 
 consistent and timeless view on the state of the
 database.
62Conclusions and outlook
- Datastream management technology capitalizes upon 
 proven DBMS technology
- DSMS provide a basis for ambient home settings, 
 sensor networks, and globe spanning information
 systems
- It is realistic to expect that some of the 
 properties to support efficient datastream
 management will become part of the major products
- Multi query optimization techniques should be 
 added.
63Literature
- NiagaraCQ A Scalable Contious Query System for 
 Internet Databases, J. Chen, D.J. deWitt, F.
 Tian, Y. Wang, Wisconsin Univ.
- Streaming Queries over Streaming Data , Sirish 
 Chandrasekaran, Michael J. Franklin, Univ
 Berkeley
- Continous Queries over Data Streams, S.Babu, J. 
 Widom, Stanford University