... over real-time streaming financial data such as stoc presentation

About This Presentation

Transcript and Presenter's Notes

Title: ... over real-time streaming financial data such as stoc

1
Models and Issuesin Data Stream Systems

EVREN AYORAK
2001700537

2
Abstract

The need for and research issues arising from a
new model of data processing.
Review past work relevant to data stream systems
and current projects in that area.
Explore topics in stream query languages, new
requirements and challenges in query processing,
and algorithmic issues.

3
Outline

The Data Stream Model
Review of Data Stream Projects
Queries of Data Streams
Stanfords Proposal for DSMS
Algorithmic Issues

4
The Data Stream Model

Data streams differ from conventional stored
relation model
Data elements in the stream arrive online
System has no control over order in which data
elements to be processed
Data streams are potentially unbounded in size
Once an element from a data stream has been
processed, it is discarded or archived. It cannot
be retrieved easily unless it is stored in
memory, which is small relative to the size of
data streams
Operating in data stream model does not preclude
use of data in conventional stored relations.

5
Queries

One-time queries and Continuous queries
One-time queries
Evaluated once over a point-in-time snapshot of
data set
Continuous queries
Evaluated continuously as data streams continue
to arrive
May be stored and updated as new data arrives, or
may produce data streams themselves

6
Queries

Predefined and Ad hoc queries
Predefined
Supplied to data stream management system before
any relevant data has arrived
Usually continuous queries
Scheduled one-time queries possible
Ad hoc
Can be either one-time or continuous queries
Complicates design of data stream management
system (DSMS), because they are not known in
advance for purposes of query optimization and
correctly answering it may require referencing
data that may have already arrived on data
streams and potentially have already been
discarded

7
Motivating Examples

Web-based financial search engine that evaluates
queries over real-time streaming financial data
such as stock tickers and news feeds.
Modern security applications.
Provides integrated security platform providing
services such as firewall support and intrusion
detection over multi-gigabit network packet
streams.
Needs to perform complex stream processing
including URL-filtering based on table lookups
and correlation across multiple network traffic
flows.
Large web site monitor web logs online to enable
applications such as personalization, performance
monitoring, and load-balancing. (e.g., Yahoo)
Sensor monitoring
Network traffic management

8
Review of Data Stream Projects

Tapestry System
Continuous queries used for content-based
filtering over an append-only database of email
and bulletin board messages
Restricted subset of SQL used as query language
in order to provide guarantees about efficient
evaluation and append-only results

9
Review of Data Stream Projects

Alert system
Mechanism for implementing event-condition-action
style triggers in conventional SQL database
Used continuous queries defined over special
append-only active tables
XFilter content-based filtering system
Efficient filtering of XML documents based on
user profiles as continuous queries in XPath
language

10
Review of Data Stream Projects

Xyleme
Similar to Xfilter (content-based filtering
system)
Enables high throughput with restricted query
language
Tribeca stream database manager
Restricted querying capability over network
packet streams
Tangram stream query processing system
Used stream processing techniques to analyze
large quantities of stored data

11
Review of Data Stream Projects

OpenCQ
Support continuous queries for monitoring
persistent data sets spread over wide-area
network.
Uses query processing algorithm based on
incremental view maintenance.
NiagraCQ
Support continuous queries for monitoring
persistent data sets spread over wide-area
network.
Addresses scalability in number of queries by
proposing techniques for grouping continuous
queries for efficient evaluation.
Problem of supporting blocking operators in query
plans over data streams discussed
Viglas and Naughton proposed rate-based
optimization for queries over data streams (based
on stream-arrival and data-processing rates)

12
Review of Data Stream Projects

Chronicle data model
Append-only ordered sequences of tuples
(chronicles), a form of data streams
Defined restricted view definition language and
algebra (chronicle algebra) that operates over
chronicles together with traditional relations.
Focus was to ensure that views defined in
chronicle algebra could be maintained
incrementally without storing any of the
chronicles.
Seshadri, Livny, and Ramakrishhnan proposed an
algebra and a declarative query language for
querying ordered relations
Related work in this area includes work on
temporal and time-series databases, where the
ordering of tuples that can be implied by time
can be used in querying, indexing, and query
optimization.

13
Review of Data Stream Projects

Materialized views relates to continuous queries
Materialized views are really queries that need
to be reevaluated or incrementally updated
whenever the base data changes
Important work in this area
Self-maintenanceEnsuring that enough data has
been saved to maintain a view even when the base
data is unavailable
Data expirationDetermining when certain base
data can be discarded without compromising the
ability to maintain a view
Differences where continuous queries may
Involve streams rather than store results
Deal with append-only input data
Provide approximate rather than exact answers
Processing strategy may adapt as characteristics
of data streams change

14
Review of Data Stream Projects

Telegraph project
Uses adaptive query engine to process queries
efficiently in volatile and unpredictable
environments.
Query execution strategies over data streams
generated by sensors
Processing techniques for multiple continuous
queries.
Tukwila system
Supports query processing, in order to perform
dynamic data integration over autonomous data
sources

15
Review of Data Stream Projects

Aurora Project
New data processing system targeted towards
stream monitoring applications
Consists of large network of triggers
Each trigger is data-flow graph with each node
being one among seven built-in operators
For each stream monitoring application using
system, an application administrator creates and
adds one or more triggers into trigger network
Performs compile-time optimization and run-time
optimization of trigger network
Detects resource overload and perform load
shedding based on application-specific measures
of QoS

16
Queries of Data Streams

Unbounded Memory Requirements
Approximate Query Answering
Sliding Windows
Batch Processing, Sampling, and Synopses
Blocking Operators
Queries Referencing Past Data

17
Unbounded Memory Requirements

Since data streams are potentially unbounded in
size, amount of storage required to compute exact
answer to the query may grow without bound
External memory algorithms for handling data sets
larger than main memory cannot be used.
Do not support continuous queries
Too slow real-time response
With new data constantly arriving even as old
data is being processed, amount of computation
time per data element must be low
Interested in algorithms that are able to confine
themselves to main memory without accessing disk

18
Approximate Query Answering

Since we limited to bounded amount of memory, it
may not be possible to produce exact answers
High-quality approximate answers can be an
acceptable solution
Techniques for data reduction and synopsis
construction
Sketches
Random sampling
Histograms
Wavelets

19
Sliding Windows

Evaluate query over sliding window of recent data
from streams
Attractive Properties
Well-defined and understood
Deterministic so there is no danger that bad
random choices will produce bad approximation
Emphasizes recent data, which in many real-world
applications is more important than old data

Window
Future Data
Past Data
Recent Data
20
Sliding Windows

Research Issues
How do we define timestamps over streams to
facilitate use of windows?
How do we implementation of sliding window
queries?
What is their impact on query optimization?
If window is too big to fit in main memory, how
can we give approximate answers using only
available memory?

21
Sliding Windows

Differences in sequence and temporal DB and
stream computation model
Temporal DB
Concerned with full history of each data value
over time
Stream system concerned with processing new data
elements on-the-fly
Sequence DB
Attempt to produce query plans that allow for
stream access.
A single scan of input data is sufficient to
evaluate plan and amount of memory required for
plan evaluation is constant, independent of data.
Assumes that DB system has control over which
sequence to process tuples from next (e.g.,
merging multiple sequences, which cannot be
assumed in stream system)

22
Batch Processing, Sampling, and Synopses

Dont process data elements as it arrives
Resort to sampling or batch processing technique
to speed up query execution
Framework
Query answered using data structures that can be
maintained incrementally
Data structure supports two operations
update(tuple) updates data structure as each new
data element arrives
computeAnswer() produces new or updated results
to query
Best case scenario is that both operations are
fast relative to arrival rate of elements in data
streams no special techniques needed

23
Batch Processing

update(tuple) is fast but computeAnswer() is slow
Data elements buffered as they arrive
Answer to query is computed periodically as time
permits
Does not cause any uncertainty about accuracy of
answer, sacrificing timeliness instead.
Good when data streams are bursty

24
Sampling

computeAnswer() fast, but update(tuple) slow
Some tuples skipped altogether so query is
evaluated over sample of data stream rather than
over entire data stream.
Give confidence bounds on degree of error
introduced by sampling process
For many situations and queries involving joins,
it is not reliable

25
Synopsis Data Structures

computeAnswer() fast, and update(tuple) fast
Used for queries where no exact data structure
with desired properties exists
Approximate data structure that maintains small
synopsis or sketch of data rather than exact
representation, so computation per data element
is low.

26
Blocking Operators

Query operator that is unable to produce the
first tuple of its output until it has seen its
entire input. (e.g., sorting, aggregation
operators like SUM)
Since streams may be infinite, a blocking
operator using a stream as one of its inputs will
never see entire input and will never produce
output
Operators that are root of tree of query
operators are more tractable than operators that
are interior nodes in tree, producing results
that feed to other operators.
Aggregation operator at root produces a single
value or small number of values and updates to
answer can be streamed out as they are produced
When answer is larger like in a sort, it is more
practical to maintain a data structure with
up-to-date answer rather than retransmitting an
entire answer
Results produced by blocking operators may
continue to change over time, so operators
consuming those results cannot make reliable
decisions based on results at intermediate stage
of query execution

27
Blocking Operators

We can handle operators as interior nodes in
query tree by replacing them with non-blocking
analogs
juggle operator is a non-blocking version of
sort. It aims to locally reorder a data stream so
that tuples that come earlier in desired sort
order are produced before tuples that come later
in sort order, although some tuples may be
delivered out of order

28
Blocking Operators

Tucker et al. suggested augmenting data streams
with assertions about what can and cannot appear
in remainder of data stream
Assertions (punctuations) interleaved with data
elements in stream
Example with assertion for all future tuples,
daynumber ? 10
Aggregation operator that was grouping by
daynumber could stream out its answers for all
daynumbers lt 10
Join operator could discard all its saved state
relating to previously-seen tuples in joining
stream with daynumber lt 10

daynumber ? 10
daynumber lt 10
Assertion daynumber ? 10
29
Queries Referencing Past Data

Ad hoc queries that are issued after some data
has already been discarded may be impossible to
answer accurately
One solution is to only allow ad hoc queries that
reference future data. It may be acceptable in
some applications
Another solution is to maintain summaries of data
streams (synopses or aggregates) that can
approximate answers to future ad hoc queries
Problem similar to problems in physical DB design
such as selection of indexes and materialized
views, but in traditional DB design, we can still
get the right answer at higher cost if no index
present. But in stream model, if no summary
structure present, we cant get the answer

30
Stanfords Proposal for DSMS

STREAM (Stanford Stream Data Manager)
Query Language for a DSMS
Timestamps in Streams
Query Processing Architecture of a DSMS

31
Query Language for a DSMS

Modified version of SQL
Allowed the FROM clause to refer to streams as
well as relations
Allowing optional window specification to be
provided after a stream that is supplied into a
querys FROM clause
Sliding window requires an ordering of data
stream elements, using implicit timestamp
attached to each data element
Example Compute average call length, considering
only ten most recent long-distance calls placed
by each customer

SELECT AVG(S.minutes) FROM Calls SPARTITION BY
S.customer_id ROWS 10 PRECEEDING WHERE
S.typeLong Distance
32
Timestamps in Streams

Timestamps are ambiguous for streams derived from
multiple streams (e.g., join)
Previous example uses implicit timestamps, in
which system adds a special field to each
incoming tuple
Explicit timestamp is data attribute used as a
timestamp.
Used when each tuple corresponds to real-world
event at particular time that is of importance to
meaning of tuple
Drawback is that tuples may not arrive in same
order as timestamps tuples with later
timestamps may come before tuples with earlier
timestamps. Makes it difficult to perform sliding
window computation
But if input stream is almost-sorted, we can
fix it with a little buffering.

33
Timestamps in Streams

Methods of assigning timestamps output of binary
operators
Provide no guarantee about output order of tuples
from a join operator.
Assume that tuples that arrive earlier are likely
to pass through join earlier.
Each tuple that is produced by join op is assign
implicit timestamp that is set to time that it
was produced by join op
Flexible in implementation
But impossible to impose defined deterministic
sliding window semantics on results of subqueries

34
Timestamps in Streams

User specifies as part of query what timestamp is
to be assigned to tuples resulting from join of
multiple streams
Order in which streams are listed in FROM clause
of query represents a prioritization of streams
Implementation can be difficult (e.g., if output
is to be sorted by timestamp, join op needs to
buffer output until it can be determine that
future input tuples will not disrupt ordering of
output tuples)

SELECT FROM S1ROWS 1000 PRECEEDING, S2ROWS
100 PRECEEDING WHERE S1.A S2.B
Output tuple will have same timestamp as S1
35
Timestamps in Streams

Best-effort

36
Query Processing Architecture

Query execution plans consist of operators
connected by queues
Operators scheduled for execution by central
scheduler
During execution operator reads data from its
input queues, updates synopsis structure and
writes results to output queues
Period of execution of operator determined
dynamically by scheduler and operator returns
control back to scheduler once period expires

37
Query Processing Architecture

To handle stream data characteristic
fluctuations, operators are adaptive (primarily
to memory)
Trading accuracy for memory
Operator maximizes accuracy of output based on
size of available memory
Handles dynamic changes in size of its available
memory
Example For a sliding window join, the larger
the window, the better the approximation

38
Query Processing Architecture

Issues in Memory Management
How do different query ops produce approximate
answers under limited memory?
How approximate results behave when operators are
composed in query results?
How can the DSMS allocate memory to operators to
maximize accuracy of answer?
How can DSMS reallocate memory among operators
under changing conditions?
How does the query optimizer come up with a query
plan when given a query with best memory
allocation and minimizes approximation? Should
plans be modified when conditions changed?
Since synopses can be shared among query plans,
how do we optimally consider a set of queries,
which may be weighted by importance?

39
Query Processing Architecture

Issues in Scheduling
Scheduler needs to provide rate synchronization
within operators and pipelined operators in query
plans
Time-varying arrival rates of data streams and
time-varying output rates of operators complicate
matters
Need to take into account
Memory allocation across operators
Mgt of buffers for incoming streams
Availability of synopses on disk (instead of
memory)
Performance requirements of individual queries

40
Algorithmic Issues

Random Samples
Sketching Techniques
Histograms
Sliding Windows
Negative Results
Miscellaneous algorithms

41
Random Samples

Used as summary structure in many scenarios where
small sample is expected to capture essential
characteristics of data set
Easiest form of summarization
Other synopses can be built from sample itself

42
Sketching Techniques

Building summary of data stream using small
amount of memory
Makes it possible to estimate answer to certain
queries (like distance queries) over data set
F0 is number of distinct values in S
F1 is the length of S
F2 is the self-join size
F? is the most frequent items multiplicity

43
Histograms

V-Optimal Histogram approximate distribution of a
set of values by a piecewise-constant function so
as to minimize the sum of squared error.
Equi-Width Histograms partition the domain into
buckets such that the number of values falling
into each bucket is uniform across all buckets.
They maintain quantiles for the underlying data
distribution as the bucket boundaries.
End-Biased Histograms maintain exact counts of
items that occur with frequency above a
threshold, and approximate other counts by an
uniform distribution.

44
Wavelets

Uses to provide a summary representation of data
Wavelet coefficients are projections of the given
signal onto an orthogonal set of basis vector
Choice of basis vectors determines type of
wavelets
Haar wavelets are used in DB for ease of
computation
The signal reconstructed from top few wavelet
coefficients best approximate the original signal

45
Sliding Windows

Prevent stale data from influencing analysis and
statistics
Serve as tool for approximation in face of
bounded memory
Open problems
Clustering
Maintaining top wavelet coefficients
Maintaining statistics like variance
Computing correlated aggregates

46
Negative Results

Emerging set of negative results on space-time
requirements of algorithms that operate in stream
model
Henzinger, Raghavan, and Rajagopalan provided
space lower bounds for concrete problems in
stream model, derived from results in
communication complexity
Alon, Matia, Szeged provided almost tight lower
bounds for computing the frequency moments.
General lower bound techique for sampling-based
algorithms presented by Bar-Yoseef et al.

47
Other Algorithms

Data Mining Decision tree are another form of
synopsis used for prediction
Multiple Streams Computing simple functions in
distributed environment
Reduction of Streams In list-efficient
algorithms, instead of being presented one data
item at a time, they are implicitly presented
with a list of data items in a succinct form
Property Testing Programs that make one pass
over data and using small space verify if the
data satisfies a certain property
Measuring Sortedness Useful in determining the
choice of a sort algorithm for underlying data

48
Conclusion

Adaption to some existing techniques to the
proposed model can be performed
Exact answers from a data stream query is
probably not possible
There are a lot of ongoing projects that deal
with streams

49
References

Babcock, Brian, S Babu, M Datar, R Motwani, J
Widom
Models and Issues in Data Stream Systems. In
Proc. ACM SIGMOD/PODS 2002. June 3-5, 2002.
Madison, Wisconsin.

50
End of Presentation

Write a Comment

User Comments (0)

About PowerShow.com

... over real-time streaming financial data such as stoc PowerPoint PPT Presentation