Title: New England Database Society NEDS
1- New England Database Society (NEDS)
- Friday, September 24, 2004
- Volen 101, Brandeis University
Sponsored by Sun Microsystems
2The Design of Stream Mill A Data Stream
Management System of Many Uses
- Carlo Zaniolo, C S Department, UCLA
- andY.Bai, Y. Law, R. Luo, H. Thakkar, H.
Wang (IBM)
3Outline
- Design Objectives for Data Stream Management
System (DSMS) - Language Issues
- The Expressive Stream Language ESL
- The Stream Mill System
- Conclusion and discussion
4DSMS Research Projects
- Aurora (Brandeis/Brown/MIT) http//www.cs.brown.ed
u/research/aurora/ - Cougar (Cornell) http//www.cs.cornell.edu/databas
e/cougar/ - Telegraph (Berkeley)- http//telegraph.cs.berkeley
.edu - STREAM (Stanford) http//www-db.stanford.edu/stre
am - Niagara (OGI/Wisconsin)-http//www.cs.wisc.edu/nia
gara/ - OpenCQ (Georgia Tech) http//disl.cc.gatech.edu/
CQ - Tapestry (Xerox) electronic documents stream
filtering - Hancock (ATT) http//www.research.att.com/kfishe
r/hancock/ - Cape (WPI) http//davis.wpi.edu/dsrg/CAPE/home.htm
l - Tribeca (Bellcore) network monitoring
- Stream Mill (UCLA) http//wis.cs.ucla.edu/stream
-mill - And others
5CQLs for DSMS
- Most of DSMS projects use SQL for continuous
queriesfor good reasons, since - Many applications span data streams and DB tables
- A CQL based on SQL will be easier to learn use
- Moreover the fewer the differences the better!
- But DBMS were designed for persistent data and
transient queries---not for persistent queries on
transient data - Adaptation of SQL and its enabling technology
presents difficult research challenges - No Home advantage only success will mute
competition from other CS fields.
6Outline
- Design Objectives for Data Stream Management
System (DSMS) - Language Issues
- The Expressive Stream Language ESL
- The Stream Mill System
- Conclusion and discussion
7Rest of the Talk
- Language Issues
- Designing an Expressive Stream Language
- The Stream Mill System
- The status.
8Language Problems
- Most DSMS projects use SQL queries spanning
both data streams and DBs will be easier. But - Even for persistent data, SQL is far from
perfect.Important application areas poorly
supported include - Data Mining, and we need to mine data streams,
- Sequence queries, and data streams are infinite
time series! - Major new problems for SQL on data stream
applications. - After all, it was designed for persistent data on
secondary store, not for streaming data
9Blocking Operators
- A blocking query operator is one that is unable
to produce the first tuple of the output until it
has seen the entire input Babcock et al.
PODS02 - But continuous queries cannot wait for the end of
the stream must return results while the data is
streaming in. Blocking operators cannot be used. - Only non-blocking (nb) queries and operators can
be used on data streams (i.e. those that return
their results before they have detected the end
of the input). - Current DBMSs make heavy usage of blocking
computations - For operators that are intrinsically blocking
e.g., SQL aggregates, - And for those that are not e.g., sort based
implementation of joins and group by - We only need to be concerned with 1 find a
characterization for blocking nonblocking
independent of implementation.
10Partial Ordering
- Let S t1, ¼, tn be a sequence and 0 k
n. - Then t1, ¼, tk is said to be the presequence
of S, of length k, denoted by Sk. - If, for some k, LSk, we say that L is a
presequence of S, written L ? S - ? Defines a Partial Order reflexive,
antisymmetric and transitive. - ? generalizes to the subset notion when order and
duplicates are immaterial - The empty sequence, , is a subsequence of
every other sequence.
11Operators on Sequences
- S G G(S).
- Gj (S) denotes the cumulative output produced up
to the j-th input tuple included. - S is a sequence of length n. Then G is said to
be - Blocking when Gj(S) for j lt n, and Gn(S)
G(S) - Nonblocking when Gj(S) G(Sj), for every j n.
- Examples SQL-2 aggregates are blocking
- Selection is nonblocking. Ditto for continuous
count (that returns, for each new tuple, the
count so far Gj(S) 1,2, , j independent on
whether jn or jltn)
Operators can be viewed as incremental
transducers
12Characterization of NonBlocking (nb)
- Theorem Queries can be expressed via nonblocking
computations iff they are monotonic. - A query language L can express a given set of
functions on its input (DB, sequences, data
streams) - To avoid blocking queries, only the monotonic
functions expressible by L should be allowed on
data streams. But are ALL of them expressible
using the nb-operators of L ? - L contains nb-operators and blocking operators
only the former can be used on data streams---so,
can those express all the monotonic queries
expressible in L ? - Definition L is said to be nb-complete when it
can express all the monotonic queries expressible
in L using only its nb-operators
13E-Bay Example
- Auctions a stream of bids on an item.
- bidStream(Item, BidValue, Time)
- Items for which sum of bids is gt 100K
- SELECT Item FROM bidStream GROUP BY
Item HAVING SUM(BidValue) gt 100000 - This is a monotonic query. Thus it can be
expressed in a language containing suitable
query operators, but not in SQL-2. SQL-2 is not
nb-complete thus it is ill-suited for continuous
queries on data streams.
14Relational Algebra
- NonMonotonic (i.e. blocking) RA operators set
difference and division - We are left with select, project, join, and
union. Can these express all FO monotonic
queries? - Some interesting temporal queries coalesce and
until - They are expressible in RA (by double negation)
- They are monotonic
- They cannot be expressed in nb-RA.
- Theorem RA and SQL are not nb-complete.
- SQL faces two problems (i) the exclusion of
EXCEPT/NOT EXISTS, and
(ii) the exclusion of aggregates.
15Embedding SQL Queries in a PL
- In DB applications, SQL can be embedded in a PL
(Java, C) where the PL accesses the tuples
returned by SQL using a Get Next of Cursor
statement. - Operations that could not be expressed in SQL can
then be expressed in the PL - an effective remedy for the lack of expressive
power of SQL - But cursors is a pull-based mechanism and
cannot be used on data streams the DSMS cannot
hold tuples until the PL request them. - The DSMS can only deliver its output to the PL as
a streamThis is OK to drive a GUI. But who is
doing most of the work who is the DSMS? - Contrast this to DBMS who are useful even with a
weak QL.
16Reviewing the Situation
- SQLs lack of expressive power is a major problem
for database-centric applications. - These problems are significantly more serious for
data streams since - Only monotonic queries can be used,
- Actually, not all the monotonic queries since SQL
is not nb-complete, - These problems cannot be solved by using PLs with
embedded SQL statements on streams - Effectiveness of DSMS will be impaired--unless
significant improvements can be made.
17Outline
- Design Objectives for Data Stream Management
System (DSMS) - Language Issues
- The Expressive Stream Language ESL
- The Stream Mill System
- Conclusion and discussion
18ESL the Stream Mill System
- We want to support efficiently
- Continuous queries ad-hoc queries
- Data stream mining complex applications
- Sequence queries (SEQIN, SQL-TS, )
- Streaming XML data
- These require a quantum leap in language power
- ESL (Expressive Stream Language) tries to do that
with minimal extensions, by using - SQL2003 constructsConcrete views, Table
Functions, OLAP Functionsand introducing - A new extensibility mechanism for SQL.
19Extensions by Procedural Code
- Embedding PL functions in SQL
- Scalar User-Defined Functions more useful with
DBMS blobs than with DSMS tuples, - Aggregate User-Defined Functions good for both
DBMS and DSMS - Almost in the SQL standards in Informix,
Oracle, Aurora - A stream-oriented computational model, based on
three states - e.g. for AVG
- Initialize the count to one and the sum to the
input value - Iterate increment both count and sum
- Terminate return the ration of sum/count
20UDAs in ESL
- In ESL user-defined Aggregates (UDAs) can be
defined directly in SQL, rather than in a PL - Native extensibility in SQL
- No impedance mismatch
- Access to DB tables from UDAs
- Data independence and optimization
- Good ease of use and performance
- Ultimate level of expressive power
- Turing completeness powerful than any QL on DB
tables - nb-completeness powerful than any CQL on data
streams
21The UDA AVG in SQL
- AGGREGATE avg(Next Int) Real
- TABLE state(tsum Int, cnt Int)
- INITIALIZE
- INSERT INTO state VALUES (Next, 1)
-
- ITERATE
- UPDATE state
- SET tsumtsumNext, cntcnt1
-
- TERMINATE
- INSERT INTO RETURN
- SELECT tsum/cnt FROM state
-
-
- INSERT INTO RETURN in TERMINATE ? a blocking
UDA
22NonBlocking UDA AVG of last 200 Values
- AGGREGATE myavg(Next Int) Real
- TABLE state(tsum Int, cnt Int)
- INITIALIZE
- INSERT INTO state VALUES (Next, 1)
-
- ITERATE
- UPDATE state SET tsumtsumNext, cntcnt1
- INSERT INTO RETURN
- SELECT tsum/cnt FROM state
- WHERE cnt 200 0
- UPDATE state SET tsumNext, cnt1
- WHERE cnt 200 1
-
- TERMINATE
-
- Empty TERMINATE denotes a non-blocking UDA
- This is called a tumbling window---various kinds
of windows used with data streams
23Data Intensive Applications UDAs
- On DBs, complex applications concisely, with
good performance (ATLAS). - Data Mining
- Classifiers 18 lines of codes, 14 overhead
- APriori 40 lines of codes, 32 overhead
24UDAs in ESL
- UDAs extended to support
- SQL2003 OLAP functions aggregates on windows
- logical physical windows,
- slides, and tumbles
- Finding equential patterns in time-series (SEQ,
SEQIN, SQL-TS, SQL/LPP, ..) - New application of non-window UDAs
- Data Stream Mining,
- Approximate aggregates,
- sketches, histograms, etc.
25SQL2003 OLAP FunctionsAggregates on Windows
CREATE STREAM ClosedAuction (/auction closings
/ itemID, /id of
the item in this auction./ buyerID
/buyer of this item./) Final price real
/final price of the item /,
Current_time) order by source
Auctions
- For each seller, show the average selling price
over the last 10 items sold (physical window)
CREATE STREAM LastTenAvg SELECT
sellerID, AVG(price) OVER(PARTITION BY sellerID
ROWS 9 PRECEDING), Current_time FROM
ClosedPrice
26Slides and Tumbles
- Every two minutes, show the average selling
price over the last 10 minutes (logical window)
CREATE STREAM LastTenAvg SELECT
sellerID, AVG(price) OVER(RANGE 10 MINUTE
PRECEDING
SLIDE 2 MINUTE), Current_time FROM ClosedPrice
Here the window is W10 and the slide is S2.
Tumble When S W
27Window UDAs vs. Base UDAs
- Aggregates with windows are very useful can we
support them on arbitrary UDAs or just on
built-in aggregates (Oracle)? - Clear semantics and optimization rules
needed--integrate - UDAsSQL or PL-defined, algebraic or not
- window (logical physical), slice, tumbles,
etc. - System role and user role in optimization
- Must integrate the two kinds of UDAs
- Base UDAs
- called as traditional SQL-2 aggregates, with
- optional GROUP BY
- Window UDAs
- called with SQL2003 OVER clause
- logical or physical windows
- optional PARTITION BY and SLIDE clauses
28Window UDAs Physical Optimization
- The Stream Mill System provides efficient support
for - Management of new expiring tuples in buffer
- Main memory intelligent paging into disk
- Window sharing
- Users can access the buffer as the table called
inwindow - Events caused by tuple expiration
29Logical Optimization for window UDAs
- The Base UDA can be used for UNLIMITED
PRECEDING Tumbles (S W) - For the other cases, the base UDA is used as a
default - but this is not efficient! We should instead use
a delta computation based on the expiring tuples - ESL allows users to specify this delta
computation UDAs.
WINDOW AGGREGATE avg(Next Real) Real
TABLE inwindow(wvalue Real) / by the sytem/
INITIALIZE ITERATE
INSERT INTO RETURN
SELECT avg(wvalue) FROM inwindow
30Window UDA with Expire
- For each expired tuple decrease the count by one
and the sum by the expired valueworks for
logical physical windows
- WINDOW AGGREGATE avg(Next Real) Real TABLE
state(tsum Int, cnt Real) TABLE
inwindow(wnext Real) - INITIALIZE INSERT INTO state
VALUES (Next, 1) - ITERATE UPDATE state SET tsumtsumNext,
cntcnt1 INSERT INTO RETURN SELECT
tsum/cnt FROM state - EXPIRE /if there are expired tuples, take the
oldest / UPDATE state SET cnt
cnt-1, tsum tsum oldest(wnext)
31MAX
- System maintains inwindow
- Remove dominated (less older) values
- The oldest is always the max.
WINDOW AGGREGATE max (Next Real) Real TABLE
inwindow(wnext real) INITIALIZE
/system adds new tuples to inwindow/
ITERATE DELETE FROM inwindow WHERE Next
wnext INSERT INTO RETURN
VALUE (oldest(wnext))
EXPIRE /expired tuples
removed automatically/
32Searching for Pattern in Sequences
- CREATE STREAM Sessions(SessNo, ClickTime, PageNo,
PageType) ORDER BY ClickTime
SELECT Y.PageNo, Z.ClickTimeFROM Sessions AS
(X, Y, Z) PARTITION BY SessNO WHERE
X.PageTypebannerAND Y.PageTypeproductAND
Z.PageTypeorder
Ad page then adescription page, then a
fill-the-orderfor purchase page
SELECT SessNo, count(A)FROM Sessions AS (A,
B)PARTITION BY SessNO WHERE A.PageType ltgt
product AND B.PageType descriptionAND
count(A) lt 20
Less than 20 clicks before a product description
page
Much easier to support with UDA than self joins
33State-Based Reasoning
- CREATE STREAM Sessions(SessNo, ClickTime, PageNo,
PageType) ORDER BY ClickTime - A sequence of three clicks (1) ad banner page,
(2) the product description page , and then (3)
the order. - AGGREGATE pattern(Next Char) Char
- TABLE state(sno Int)
- INITIALIZE INSERT INTO state VALUES(0)
- UPDATE state SET sno 1 WHERE
Nextbanner' - ITERATE UPDATE state SET sno 0
- WHERE NOT(sno 1 AND Next banner')
- AND NOT(sno 2 AND Next product')
- AND Next ltgt 'order
- UPDATE state SET sno 1
- WHERE Nextbanner'
- UPDATE state SET sno sno1
- WHERE (sno 1 AND Next product')
- OR (sno 2 AND Next order')
- INSERT INTO RETURN
- SELECT 'pattern123' FROM state
- WHERE sno 3
34More on ESL
- Reliance on SQL2003 constructs, e.g., table
functions and concrete views - Full support for UDAs with all window
combinationseffective on UDAs written in SQL,
PLs, and even built-ins - Support for continuous queries and ad hoc
queries, under a simple and unified semantics - Turing completeness ESL can express all possible
queries - nb-completenessESL all possible nonblocking
queries by using its nonblocking operators
(window UDAs base UDA swithout TERMINATE) - Effective on a broad range of data-intensive
applications data/stream mining, approximate
queries, sequential patterns (streaming XML data
not supported) - ESL is making a strong case for the DB-oriented
approach to data streams.
35Outline
- Design Objectives for Data Stream Management
System (DSMS) - Language Issues
- The Expressive Stream Language ESL
- The Stream Mill System
- Conclusion and discussion
36Stream Mill Architecture
Client
Streamed Results
Server
Input streams
DSMS
Memory Mgr
DB Mgr ATLAS
37Stream Mill Architecture
- Client A Java-Based GUI to Create, Upload
Launch queries---and display results - Server
- Query Manager
- Compiles queries into dynamic libraries
- Builds Execution graph
- Execution Manager
- Monitors and performance optimize execution
- Partitions graph into components, and
- Prioritizes components for QOS
- Window and Buffer Managers
- User manager
38Execution Graph Components
Out
- Query Operators SELECT FROM statement
selection, projection, joins-with-DB-tables,UDAs - FIFO processing of tuples, modified for
selectivity gt1 - Prioritized round robin of input buffers
Out
Q1
Q3
Q2
Q4
Stream2
Stream3
Stream1
39Optimization Issues
- Optimize Response time--maximize production
- Optimize Memory--maximize consumption
- Avoid starvation---fairness issues.
- Fast adaptation based on
- Continuous monitoring,
- Re-partitioning of execution graph into
components, - Dynamic re-assignment of priorities.
- Dynamic Execution Tables that support the above,
and addition/deletion of queries with low
overhead.
40Conclusion
- Language Technology
- ESL a very powerful language for data stream and
DB applications - Simple semantics and unified syntax conforming
to SQL2003 standards - Strong case for the DB-oriented approach to data
streams - System Technology
- Some performance-oriented techniques
well-developede.g., buffer management for
windows - For others work is still in progressstay tuned
for latest news - Stream Mill is up and running http//wis.cs.ucla.
edu/stream-mill
41 42Relational Algebra (RA)
- Set difference can produce monotonic queries
Intersection R1 Ç R2 R1 - (R1 - R2) - Are these still expressible without set diff?
- Intersection can be expressed as a joins
productselect - But interval coalescing and Until queries are
monotonic queries that can be expressed in RA but
not in nb-RA. - Example Temporal domain isomorfic to nonnegative
integers.Intervals closed to the left but open to
the right - p(0, 3). 0,1, and 2 are in p but 3 is not
- p(2, 4). 3 is not a hole because is
covered by this - p(4, 5). 5 is a hole because not covered
by any other interval - p(6, 8).
43Coalesce p (cp) p Until q
- p(0, 3). p(2, 4). p(4, 5). p(6, 8).
- cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8).
cp(0, 4). cp(2, 5). cp(0,5). - cp contains intervals from the start point of
any p interval to the endpoint of any p interval
unless the endpoint of an interval in between is
a hole. - cp(I1, J2) p(I1, J1), p(I2, J2), J1 lt J2,
Øhole(I1, J2). - hole(I1, J2) p(I1, J1), p(I2, J2), p(_,K), J1
K, K lt I2, Øcep(K). - cep(K) p(_, K), p(I, J), I K, K lt J.
- q(5,_) holds if cp has an interval that starts at
0 contains 5pUntilq(yes) q(0, J). - pUntilq(yes) cp(0, I), q(J, _), I ³ J .
44References
- 1ATLaS user manual. http//wis.cs.ucla.edu/atlas
. - 2SQL/LPP A Time Series Extension of SQL Based
on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer,
1999. - 4A. Arasu, S. Babu, and J. Widom. An abstract
semantics and concrete language for continuous
queries over streams and relations. Technical
report, Stanford University, 2002. - 5B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and issues in data stream
systems. In PODS, 2002. - 9D. Carney, U. Cetintemel, M. Cherniack, C.
Convey, S. Lee, G. Seidman, M. Stonebraker, N.
Tatbul, and S. Zdonik. Monitoring streams - a new
class of data management applications. In VLDB,
Hong Kong, China, 2002. - 10J. Celko. SQL for Smarties, chapter Advanced
SQL Programming. Morgan Kaufmann, 1995. - 11S. Chandrasekaran and M. Franklin. Streaming
queries over streaming data. In VLDB, 2002. - 12J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
NiagaraCQ A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May
2000. - 13C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk,
and O. Spatscheck. Gigascope A stream database
for network applications. In SIGMOD Conference,
pages 647-651. ACM Press, 2003. - 14Lukasz Golab and M. Tamer Özsu. Issues in
data stream management. ACM SIGMOD Record,
32(2)5-14, 2003. - 15J. M. Hellerstein, P. J. Haas, and H. J.
Wang. Online aggregation. In SIGMOD, 1997. - 17 Chang R. Luo, Haixun Wang, and Carlo
Zaniolo. Designing an expressive query language
for data streams. Submitted for publication - 18Chang R. Luo, Haixun Wang, and Carlo Zaniolo.
Efficient support for window aggregates on data
streams. In preparation
45References (Cont.)
- 18 Yan-Nei Law, Haixun Wang, Carlo Zaniolo
Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004 492-503 - 19 Sam Madden, Mehul A. Shah, Joseph M.
Hellerstein, and Vijayshankar Raman. Continuously
adaptive continuous queries over streams. In
SIGMOD, pages 49-61, 2002. - 20R. Motwani, J. Widom, A. Arasu, B. Babcock,
M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing,
approximation, and resource management in a data
stream management system. In First CIDR 2003
Conference, Asilomar, CA, 2003. - 21R. Ramakrishnan, D. Donjerkovic, A.
Ranganathan, K. Beyer, and M. Krishnaprasad.
SRQL Sorted relational query language, 1998. - 23Reza Sadri, Carlo Zaniolo, and Amir M.
Zarkesh andJafar Adibi. A sequential pattern
query language for supporting instant data
minining for e-services. In VLDB, pages 653-656,
2001. - 24Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and
Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May
2001. - 25P. Seshadri. Predator A resource for
database research. SIGMOD Record, 27(1)16-20,
1998. - 26P. Seshadri, M. Livny, and R. Ramakrishnan.
SEQ A model for sequence databases. In ICDE,
pages 232-239, Taipei, Taiwan, March 1995. - 27Praveen Seshadri, Miron Livny, and Raghu
Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994. - 28M. Sullivan. Tribeca A stream database
manager for network traffic analysis. In VLDB,
1996. - 29D. Terry, D. Goldberg, D. Nichols, and B.
Oki. Continuous queries over append-only
databases. In SIGMOD, pages 321-330, 6 1992. - 30Peter A. Tucker, David Maier, Tim Sheard, and
Leonidas Fegaras. Exploiting punctuation
semantics in continuous data streams. IEEE Trans.
Knowl. Data Eng, 15(3)555-568, 2003. - 31Haixun Wang and Carlo Zaniolo. ATLaS a
native extension of SQL for data minining. In
Proceedings of Third SIAM Int. Conference on Data
MIning, pages 130-141, 2003.