Title: ... over real-time streaming financial data such as stoc
1Models and Issuesin Data Stream Systems
2Abstract
- The need for and research issues arising from a
new model of data processing. - Review past work relevant to data stream systems
and current projects in that area. - Explore topics in stream query languages, new
requirements and challenges in query processing,
and algorithmic issues.
3Outline
- The Data Stream Model
- Review of Data Stream Projects
- Queries of Data Streams
- Stanfords Proposal for DSMS
- Algorithmic Issues
4The Data Stream Model
- Data streams differ from conventional stored
relation model - Data elements in the stream arrive online
- System has no control over order in which data
elements to be processed - Data streams are potentially unbounded in size
- Once an element from a data stream has been
processed, it is discarded or archived. It cannot
be retrieved easily unless it is stored in
memory, which is small relative to the size of
data streams - Operating in data stream model does not preclude
use of data in conventional stored relations.
5Queries
- One-time queries and Continuous queries
- One-time queries
- Evaluated once over a point-in-time snapshot of
data set - Continuous queries
- Evaluated continuously as data streams continue
to arrive - May be stored and updated as new data arrives, or
may produce data streams themselves
6Queries
- Predefined and Ad hoc queries
- Predefined
- Supplied to data stream management system before
any relevant data has arrived - Usually continuous queries
- Scheduled one-time queries possible
- Ad hoc
- Can be either one-time or continuous queries
- Complicates design of data stream management
system (DSMS), because they are not known in
advance for purposes of query optimization and
correctly answering it may require referencing
data that may have already arrived on data
streams and potentially have already been
discarded
7Motivating Examples
- Web-based financial search engine that evaluates
queries over real-time streaming financial data
such as stock tickers and news feeds. - Modern security applications.
- Provides integrated security platform providing
services such as firewall support and intrusion
detection over multi-gigabit network packet
streams. - Needs to perform complex stream processing
including URL-filtering based on table lookups
and correlation across multiple network traffic
flows. - Large web site monitor web logs online to enable
applications such as personalization, performance
monitoring, and load-balancing. (e.g., Yahoo) - Sensor monitoring
- Network traffic management
8Review of Data Stream Projects
- Tapestry System
- Continuous queries used for content-based
filtering over an append-only database of email
and bulletin board messages - Restricted subset of SQL used as query language
in order to provide guarantees about efficient
evaluation and append-only results
9Review of Data Stream Projects
- Alert system
- Mechanism for implementing event-condition-action
style triggers in conventional SQL database - Used continuous queries defined over special
append-only active tables - XFilter content-based filtering system
- Efficient filtering of XML documents based on
user profiles as continuous queries in XPath
language
10Review of Data Stream Projects
- Xyleme
- Similar to Xfilter (content-based filtering
system) - Enables high throughput with restricted query
language - Tribeca stream database manager
- Restricted querying capability over network
packet streams - Tangram stream query processing system
- Used stream processing techniques to analyze
large quantities of stored data
11Review of Data Stream Projects
- OpenCQ
- Support continuous queries for monitoring
persistent data sets spread over wide-area
network. - Uses query processing algorithm based on
incremental view maintenance. - NiagraCQ
- Support continuous queries for monitoring
persistent data sets spread over wide-area
network. - Addresses scalability in number of queries by
proposing techniques for grouping continuous
queries for efficient evaluation. - Problem of supporting blocking operators in query
plans over data streams discussed - Viglas and Naughton proposed rate-based
optimization for queries over data streams (based
on stream-arrival and data-processing rates)
12Review of Data Stream Projects
- Chronicle data model
- Append-only ordered sequences of tuples
(chronicles), a form of data streams - Defined restricted view definition language and
algebra (chronicle algebra) that operates over
chronicles together with traditional relations. - Focus was to ensure that views defined in
chronicle algebra could be maintained
incrementally without storing any of the
chronicles. - Seshadri, Livny, and Ramakrishhnan proposed an
algebra and a declarative query language for
querying ordered relations - Related work in this area includes work on
temporal and time-series databases, where the
ordering of tuples that can be implied by time
can be used in querying, indexing, and query
optimization.
13Review of Data Stream Projects
- Materialized views relates to continuous queries
- Materialized views are really queries that need
to be reevaluated or incrementally updated
whenever the base data changes - Important work in this area
- Self-maintenanceEnsuring that enough data has
been saved to maintain a view even when the base
data is unavailable - Data expirationDetermining when certain base
data can be discarded without compromising the
ability to maintain a view - Differences where continuous queries may
- Involve streams rather than store results
- Deal with append-only input data
- Provide approximate rather than exact answers
- Processing strategy may adapt as characteristics
of data streams change
14Review of Data Stream Projects
- Telegraph project
- Uses adaptive query engine to process queries
efficiently in volatile and unpredictable
environments. - Query execution strategies over data streams
generated by sensors - Processing techniques for multiple continuous
queries. - Tukwila system
- Supports query processing, in order to perform
dynamic data integration over autonomous data
sources
15Review of Data Stream Projects
- Aurora Project
- New data processing system targeted towards
stream monitoring applications - Consists of large network of triggers
- Each trigger is data-flow graph with each node
being one among seven built-in operators - For each stream monitoring application using
system, an application administrator creates and
adds one or more triggers into trigger network - Performs compile-time optimization and run-time
optimization of trigger network - Detects resource overload and perform load
shedding based on application-specific measures
of QoS
16Queries of Data Streams
- Unbounded Memory Requirements
- Approximate Query Answering
- Sliding Windows
- Batch Processing, Sampling, and Synopses
- Blocking Operators
- Queries Referencing Past Data
17Unbounded Memory Requirements
- Since data streams are potentially unbounded in
size, amount of storage required to compute exact
answer to the query may grow without bound - External memory algorithms for handling data sets
larger than main memory cannot be used. - Do not support continuous queries
- Too slow real-time response
- With new data constantly arriving even as old
data is being processed, amount of computation
time per data element must be low - Interested in algorithms that are able to confine
themselves to main memory without accessing disk
18Approximate Query Answering
- Since we limited to bounded amount of memory, it
may not be possible to produce exact answers - High-quality approximate answers can be an
acceptable solution - Techniques for data reduction and synopsis
construction - Sketches
- Random sampling
- Histograms
- Wavelets
19Sliding Windows
- Evaluate query over sliding window of recent data
from streams - Attractive Properties
- Well-defined and understood
- Deterministic so there is no danger that bad
random choices will produce bad approximation - Emphasizes recent data, which in many real-world
applications is more important than old data
Window
Future Data
Past Data
Recent Data
20Sliding Windows
- Research Issues
- How do we define timestamps over streams to
facilitate use of windows? - How do we implementation of sliding window
queries? - What is their impact on query optimization?
- If window is too big to fit in main memory, how
can we give approximate answers using only
available memory?
21Sliding Windows
- Differences in sequence and temporal DB and
stream computation model - Temporal DB
- Concerned with full history of each data value
over time - Stream system concerned with processing new data
elements on-the-fly - Sequence DB
- Attempt to produce query plans that allow for
stream access. - A single scan of input data is sufficient to
evaluate plan and amount of memory required for
plan evaluation is constant, independent of data.
- Assumes that DB system has control over which
sequence to process tuples from next (e.g.,
merging multiple sequences, which cannot be
assumed in stream system)
22Batch Processing, Sampling, and Synopses
- Dont process data elements as it arrives
- Resort to sampling or batch processing technique
to speed up query execution - Framework
- Query answered using data structures that can be
maintained incrementally - Data structure supports two operations
- update(tuple) updates data structure as each new
data element arrives - computeAnswer() produces new or updated results
to query - Best case scenario is that both operations are
fast relative to arrival rate of elements in data
streams no special techniques needed
23Batch Processing
- update(tuple) is fast but computeAnswer() is slow
- Data elements buffered as they arrive
- Answer to query is computed periodically as time
permits - Does not cause any uncertainty about accuracy of
answer, sacrificing timeliness instead. - Good when data streams are bursty
24Sampling
- computeAnswer() fast, but update(tuple) slow
- Some tuples skipped altogether so query is
evaluated over sample of data stream rather than
over entire data stream. - Give confidence bounds on degree of error
introduced by sampling process - For many situations and queries involving joins,
it is not reliable
25Synopsis Data Structures
- computeAnswer() fast, and update(tuple) fast
- Used for queries where no exact data structure
with desired properties exists - Approximate data structure that maintains small
synopsis or sketch of data rather than exact
representation, so computation per data element
is low.
26Blocking Operators
- Query operator that is unable to produce the
first tuple of its output until it has seen its
entire input. (e.g., sorting, aggregation
operators like SUM) - Since streams may be infinite, a blocking
operator using a stream as one of its inputs will
never see entire input and will never produce
output - Operators that are root of tree of query
operators are more tractable than operators that
are interior nodes in tree, producing results
that feed to other operators. - Aggregation operator at root produces a single
value or small number of values and updates to
answer can be streamed out as they are produced - When answer is larger like in a sort, it is more
practical to maintain a data structure with
up-to-date answer rather than retransmitting an
entire answer - Results produced by blocking operators may
continue to change over time, so operators
consuming those results cannot make reliable
decisions based on results at intermediate stage
of query execution
27Blocking Operators
- We can handle operators as interior nodes in
query tree by replacing them with non-blocking
analogs - juggle operator is a non-blocking version of
sort. It aims to locally reorder a data stream so
that tuples that come earlier in desired sort
order are produced before tuples that come later
in sort order, although some tuples may be
delivered out of order
28Blocking Operators
- Tucker et al. suggested augmenting data streams
with assertions about what can and cannot appear
in remainder of data stream - Assertions (punctuations) interleaved with data
elements in stream - Example with assertion for all future tuples,
daynumber ? 10 - Aggregation operator that was grouping by
daynumber could stream out its answers for all
daynumbers lt 10 - Join operator could discard all its saved state
relating to previously-seen tuples in joining
stream with daynumber lt 10
daynumber ? 10
daynumber lt 10
Assertion daynumber ? 10
29Queries Referencing Past Data
- Ad hoc queries that are issued after some data
has already been discarded may be impossible to
answer accurately - One solution is to only allow ad hoc queries that
reference future data. It may be acceptable in
some applications - Another solution is to maintain summaries of data
streams (synopses or aggregates) that can
approximate answers to future ad hoc queries - Problem similar to problems in physical DB design
such as selection of indexes and materialized
views, but in traditional DB design, we can still
get the right answer at higher cost if no index
present. But in stream model, if no summary
structure present, we cant get the answer
30Stanfords Proposal for DSMS
- STREAM (Stanford Stream Data Manager)
- Query Language for a DSMS
- Timestamps in Streams
- Query Processing Architecture of a DSMS
31Query Language for a DSMS
- Modified version of SQL
- Allowed the FROM clause to refer to streams as
well as relations - Allowing optional window specification to be
provided after a stream that is supplied into a
querys FROM clause - Sliding window requires an ordering of data
stream elements, using implicit timestamp
attached to each data element - Example Compute average call length, considering
only ten most recent long-distance calls placed
by each customer
SELECT AVG(S.minutes) FROM Calls SPARTITION BY
S.customer_id ROWS 10 PRECEEDING WHERE
S.typeLong Distance
32Timestamps in Streams
- Timestamps are ambiguous for streams derived from
multiple streams (e.g., join) - Previous example uses implicit timestamps, in
which system adds a special field to each
incoming tuple - Explicit timestamp is data attribute used as a
timestamp. - Used when each tuple corresponds to real-world
event at particular time that is of importance to
meaning of tuple - Drawback is that tuples may not arrive in same
order as timestamps tuples with later
timestamps may come before tuples with earlier
timestamps. Makes it difficult to perform sliding
window computation - But if input stream is almost-sorted, we can
fix it with a little buffering.
33Timestamps in Streams
- Methods of assigning timestamps output of binary
operators - Provide no guarantee about output order of tuples
from a join operator. - Assume that tuples that arrive earlier are likely
to pass through join earlier. - Each tuple that is produced by join op is assign
implicit timestamp that is set to time that it
was produced by join op - Flexible in implementation
- But impossible to impose defined deterministic
sliding window semantics on results of subqueries
34Timestamps in Streams
- User specifies as part of query what timestamp is
to be assigned to tuples resulting from join of
multiple streams - Order in which streams are listed in FROM clause
of query represents a prioritization of streams - Implementation can be difficult (e.g., if output
is to be sorted by timestamp, join op needs to
buffer output until it can be determine that
future input tuples will not disrupt ordering of
output tuples)
SELECT FROM S1ROWS 1000 PRECEEDING, S2ROWS
100 PRECEEDING WHERE S1.A S2.B
Output tuple will have same timestamp as S1
35Timestamps in Streams
36Query Processing Architecture
- Query execution plans consist of operators
connected by queues - Operators scheduled for execution by central
scheduler - During execution operator reads data from its
input queues, updates synopsis structure and
writes results to output queues - Period of execution of operator determined
dynamically by scheduler and operator returns
control back to scheduler once period expires
37Query Processing Architecture
- To handle stream data characteristic
fluctuations, operators are adaptive (primarily
to memory) - Trading accuracy for memory
- Operator maximizes accuracy of output based on
size of available memory - Handles dynamic changes in size of its available
memory - Example For a sliding window join, the larger
the window, the better the approximation
38Query Processing Architecture
- Issues in Memory Management
- How do different query ops produce approximate
answers under limited memory? - How approximate results behave when operators are
composed in query results? - How can the DSMS allocate memory to operators to
maximize accuracy of answer? - How can DSMS reallocate memory among operators
under changing conditions? - How does the query optimizer come up with a query
plan when given a query with best memory
allocation and minimizes approximation? Should
plans be modified when conditions changed? - Since synopses can be shared among query plans,
how do we optimally consider a set of queries,
which may be weighted by importance?
39Query Processing Architecture
- Issues in Scheduling
- Scheduler needs to provide rate synchronization
within operators and pipelined operators in query
plans - Time-varying arrival rates of data streams and
time-varying output rates of operators complicate
matters - Need to take into account
- Memory allocation across operators
- Mgt of buffers for incoming streams
- Availability of synopses on disk (instead of
memory) - Performance requirements of individual queries
40Algorithmic Issues
- Random Samples
- Sketching Techniques
- Histograms
- Sliding Windows
- Negative Results
- Miscellaneous algorithms
41Random Samples
- Used as summary structure in many scenarios where
small sample is expected to capture essential
characteristics of data set - Easiest form of summarization
- Other synopses can be built from sample itself
42Sketching Techniques
- Building summary of data stream using small
amount of memory - Makes it possible to estimate answer to certain
queries (like distance queries) over data set - F0 is number of distinct values in S
- F1 is the length of S
- F2 is the self-join size
- F? is the most frequent items multiplicity
43Histograms
- V-Optimal Histogram approximate distribution of a
set of values by a piecewise-constant function so
as to minimize the sum of squared error. - Equi-Width Histograms partition the domain into
buckets such that the number of values falling
into each bucket is uniform across all buckets.
They maintain quantiles for the underlying data
distribution as the bucket boundaries. - End-Biased Histograms maintain exact counts of
items that occur with frequency above a
threshold, and approximate other counts by an
uniform distribution.
44Wavelets
- Uses to provide a summary representation of data
- Wavelet coefficients are projections of the given
signal onto an orthogonal set of basis vector - Choice of basis vectors determines type of
wavelets - Haar wavelets are used in DB for ease of
computation - The signal reconstructed from top few wavelet
coefficients best approximate the original signal
45Sliding Windows
- Prevent stale data from influencing analysis and
statistics - Serve as tool for approximation in face of
bounded memory - Open problems
- Clustering
- Maintaining top wavelet coefficients
- Maintaining statistics like variance
- Computing correlated aggregates
46Negative Results
- Emerging set of negative results on space-time
requirements of algorithms that operate in stream
model - Henzinger, Raghavan, and Rajagopalan provided
space lower bounds for concrete problems in
stream model, derived from results in
communication complexity - Alon, Matia, Szeged provided almost tight lower
bounds for computing the frequency moments. - General lower bound techique for sampling-based
algorithms presented by Bar-Yoseef et al.
47Other Algorithms
- Data Mining Decision tree are another form of
synopsis used for prediction - Multiple Streams Computing simple functions in
distributed environment - Reduction of Streams In list-efficient
algorithms, instead of being presented one data
item at a time, they are implicitly presented
with a list of data items in a succinct form - Property Testing Programs that make one pass
over data and using small space verify if the
data satisfies a certain property - Measuring Sortedness Useful in determining the
choice of a sort algorithm for underlying data
48Conclusion
- Adaption to some existing techniques to the
proposed model can be performed - Exact answers from a data stream query is
probably not possible - There are a lot of ongoing projects that deal
with streams
49References
- Babcock, Brian, S Babu, M Datar, R Motwani, J
Widom - Models and Issues in Data Stream Systems. In
Proc. ACM SIGMOD/PODS 2002. June 3-5, 2002.
Madison, Wisconsin.
50End of Presentation