Title: Characterizing Memory Requirements for Queries over Continuous Data Streams
1Characterizing Memory Requirements for Queries
over Continuous Data Streams
- Arvind Arasu, Brian Babcock, Shivnath Babu, Jon
McAlister, Jennifer Widom
Stanford University
Speaker
2Continuous Data Streams
- Network traffic data
- Transaction logs
- Call records, Web logs, ...
- Financial data
- Sensor networks
- Scientific data
- Astronomy, Biology, ...
3A DBMS for Data Streams?
- Lots of existing work in data streams
- Mostly special-purpose applications
- Were building a general-purpose data stream
management system (DSMS)
http//www-db.stanford.edu/stream/
4RBDMS
DSMS
5Query Execution Model
1. Client registers query
Client
and answers returned to client
?
2. Tuples arrive on streams...
...are read and discarded...
S
T
Limited-size scratch space available
Memory
DSMS
6Our Problem
Given a data stream query, determine how much
memory is required to evaluate it.
7Queries We Consider
- SPJ Queries ?L(?P (S1 x S2 x x Sn))
- Projection is either duplicate-preserving or
duplicate-eliminating - Selection predicates are conjunctions of
- Si.A Op Sj.B -or- Si.A Op k
- Op ?gt, gt , , lt, lt
- All attributes are integers
8An example with no joins
SELECT cust_id FROM orders WHERE amt gt 5
DISTINCT
- Requires boundedmemory
- Remembercust_ids from1000-9999
AND cust_id gt 1000 AND cust_id lt 9999
- Requires no scratch memory
- Each tuple is independent
- Tuples in the answer are streamed away
- Requires unbounded memory
- All cust_ids must be remembered
9An example with an equijoin
SELECT R.prod_id FROM orders O, returns R WHERE
O.order_num R.order_num AND R.prod_id gt
100 AND R.prod_id lt 199
AND O.order_num gt 1000 AND O.order_num lt 1103
10An example with an inequality
SELECT FROM orders O, inventory I WHERE O.amt gt
I.qty AND O.prod_id gt 100 AND O.prod_id lt 300
O.prod_id
DISTINCT O.prod_id
11Locally Totally Ordered Queries
- LTO Queries SPJ queries with additional
predicates applied - For each stream, stipulate a total order for all
attributes in the stream all constants - Only allow tuples whose attribute values follow
that ordering - All SPJ queries can be written as a union of LTO
queries
12Example of an LTO query
Stream S (A, B)
Stream T (C, D)
SELECT S.A, T.C FROM S, T WHERE S.B gt 12
SELECT S.A, T.C FROM S, T WHERE S.B gt 12 AND S.A
S.B AND T.D lt T.C AND T.C lt 12
13MinRef and MaxRef
- For each stream S in the query
- MinRef(S) S.A S.A lt T.B is a necessary
inequality in the predicate
14Bounded-Memory Conditions
- 1. All attributes in the projection list must be
bounded. - 2. All attributes participating in equijoins must
be bounded. - 3. In each stream S, MinRef(S) MaxRef(S)
- 0, for SELECT
- lt 1, for SELECT DISTINCT
15An unbounded example
SELECT DISTINCT T.E FROM S, T WHERE T.E 10
AND S.A lt T.C AND S.B lt T.D
16Conclusion
- We consider SPJ queries over data streams
- We identify which queries can and cannot be
evaluated using bounded memory - For queries than can, we provide an execution
strategy based on synopses. - For queries that cannot, we provide examples of
bad input streams.
Full paper at http//www-db.stanford.edu/??? E-mai
l babcock_at_cs.stanford.edu