Efficient Processing of XML Update Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Processing of XML Update Streams

Description:

real-time processing. high throughput, low latency, fast mean response time, low jitter ... in a stock ticker feed stream, where updates to ticker values come ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 28

Provided by: leonidas7

Learn more at: https://lambda.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Processing of XML Update Streams

1
Efficient Processing of XML Update Streams

Leonidas Fegaras
University of Texas at Arlington

2
Data Stream Processing

What is a data stream?
continuous, time-varying data arriving at
unpredictable rates
continuous updates, long-running queries,
continuous results
Sought characteristics of stream processing
engines
real-time processing
high throughput, low latency, fast mean response
time, low jitter
low memory footprint
Why bother?
many data are already available in stream form
sensor networks, network traffic monitoring,
stock tickers
publisher-subscriber systems
data stream mining for fraud detection
data may be too volatile to index
continuous measurements

3
XML Stream Processing

Why XML?
There is no reason to normalize stream data
Various sources of XML streams
tokenized XML documents
RSS feeds
web service results
Granularity
XML tokens (events) , , text, etc
region-encoded XML elements (eg, based on
pre-order numbering)?
XML fragments (hole-filler model)?
Push-based processing SAX
Pull-based processing StAX

4
Traditional Stream Processing
output stream
input stream

Works on streams that consist of numerical values
or relational tuples
Focuses on a sliding window
fixed number of tuples, or
fixed time span
Calculates approximate results
Uses a small (bounded) state
Examples
top-k most frequent values
group-by SQL queries

past
stream engine
sliding window
state
future
5
Our Goals

Handle continuous XQueries over continuous
streamed XML data
Embedded updates in the streams
Exact rather than approximate answers
Produce continuous results, even when the results
are not complete
Problem most interesting operations are blocking
and/or require unbounded state
grouping aggregation
predicate evaluation
sorting
sequence concatenation
backward axis steps
We want to address the blocking problem
differently
Display the current result of the blocking
operation continuously in the form of an update
stream
incoming vs. generated updates

6
Our View of XML Update Streams

A continuous (possibly infinite) sequence of XML
tokens with embedded updates
Typically, a finite data stream followed by an
infinite stream of updates
SAX-like events
three basic types of tokens , ,
text
the target of an update is a stream subsequence
that contains zero, one, or more complete XML
elements
the source is also a token sequence that contains
complete XML elements
updates are embedded in the data stream and can
come at any time
update events can be interleaved with data events
and with each other
each event must now have an ID to associate it
with an update
updated regions can be updated too
to update a stream subsequence, you wrap it into
a Mutable region
three types of updates
replace, insertBefore, insertAfter

7
Example

id Event
equivalent to
1
1
2 startMutable(1)
2 Y
2 X
2
2 endMutable(1) X
1
3 startInsertBefore(2)
3
3 Y
3
3 endInsertBefore(2)?
1

8
Continuous Results

Our stream engine is implemented as a pipeline
each pipeline stage performs a very simple task
The final pipeline stage is the Result Display
that displays the query results continuously
the display is an editable text window, where
text can be inserted, deleted, and replaced at
any point
when an update is coming in the input stream, it
is propagated all the way to the result display,
where it causes an update to the display text!

result display
input stream with updates
output stream with updates
query pipeline
9
Motivating Example

Group and order book titles by author
let al distinct-values(doc(bib.xml)//book/au
thor)?
return
for a in al order by a
return
doc(bib.xml)//bookauthora/title
Multiple points of blocking
distinct-values
count
self-join
order-by

10
Motivating Example

Group and order book titles by author
let al distinct-values(doc(bib.xml)//book/au
thor)?
return
for a in al order by a
return
doc(bib.xml)//bookauthora/title
The result display is refreshed continuously

display

T1
input stream
currently
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
11
Motivating Example

Group and order book titles by author
let al distinct-values(doc(bib.xml)//book/au
thor)?
return
for a in al order by a
return
doc(bib.xml)//bookauthora/title
The result display is refreshed continuously

display

T2 nameD T1

input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
12
Motivating Example

Group and order book titles by author
let al distinct-values(doc(bib.xml)//book/au
thor)?
return
for a in al order by a
return
doc(bib.xml)//bookauthora/title
The result display is refreshed continuously

display

T2
T3 nameD T1

input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
13
Motivating Example

Group and order book titles by author
let al distinct-values(doc(bib.xml)//book/au
thor)?
return
for a in al order by a
return
doc(bib.xml)//bookauthora/title
The result display is refreshed continuously

display

T2
T3 nameB T4

T1
input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
14
Motivating Example

Group and order book titles by author
let al distinct-values(doc(bib.xml)//book/au
thor)?
return
for a in al order by a
return
doc(bib.xml)//bookauthora/title
The result display is refreshed continuously

display

T2
T3 nameB T4
T5 nameD T1

input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
15
Why?

Because, this is what you really want to see as
the result of a query
eg, in a stock ticker feed stream, where updates
to ticker values come continuously
It leads to optimistic evaluation where
results are displayed immediately, to be
retracted or modified later when more information
is available
addresses the blocking problem
we proceed without waiting, displaying the
results so far, but later we may have to send
updates
generalizes on-line aggregation

16
Optimistic Evaluation

Pessimistic evaluation at all times, the query
display must always show the correct results up
to that point
Optimistic evaluation display any possible
output without delay and later, if necessary,
retract it or modify it to make it correct
complementary to, but different than, lazy
evaluation
How?
Generated and incoming updates are propagated
through the evaluation pipeline until they are
processed by the display
They may cause changes to the states of the
pipeline stages
Examples
Event counting instead of waiting until we count
all events, we generate updates that continuously
display the counter so far
Predicate testing assume the predicate is true,
but when you later find that it is false, retract
all output associated with this predicate
Sorting wrap each element to be sorted around an
update that inserts it into the correct place to
the element sequence so far

17
Contributions

Instead of eagerly performing the updates on
cached portions of the stream, we propagate the
updates through the pipeline
all the way to the query result display
the display prints the results continuously,
replacing old results with new
Other approaches
display approximate answers continuously by
focusing on a sliding window over the stream
Our approach
generates exact answers continuously in the form
of an update stream
But the propagated updates may affect the state
of the operators
we developed a uniform methodology to incorporate
state change
Used this framework to unblock operations and
reduce buffering
let the operations themselves embed new updates
into the stream that retroactively perform the
blocking parts of the operation
why? because later is often better than now

18
State Transformers

Each stage in the query evaluation pipeline is a
state transformer
Input a single event and a state S
Output a sequence of events and a new state S
Implemented as a function from an event to a
sequence of events that destructively modifies
the state
can be used in both pull- and push-based stream
processing
The state transformers need only handle the basic
SAX events , , text, and
begin/end of stream

state
state transformer
stream with regular and update events
stream with regular events
19
Example Element Counting

The state is an integer counter, count
A blocking state transformer, f(e)
if e is a text event
count count1
return
else if e is end-of-stream
return count value
A non-blocking state transformer
if e is begin-of-stream
return startMutable(id), 0, endMutable(id)
else if e is a text event
count count1
return startReplace(d), count value,
endReplace(id)

20
XPath Steps

The state transformers of simple XPath forward
steps are trivial to implement
Example the Child step (/tag)
state
need a counter nest to keep track of the nesting
depth, and
a flag pass to remember if we are currently
passing through or discarding events
logic
when we see the event at nest1, we enter
pass mode and stay there until we see at
nest1
when in pass mode, we return the current event
otherwise, we return

21
Handling Updates

It would be cumbersome to modify each state
transformer to handle incoming update events
Our solution the update events are handled in
the same way for all state transformers
Each state transformer is wrapped by a fixed
function that handles update events by adjusting
the states, while passing the regular events to
the state transformer

state
update events
state adjustment
.
.
.
state
current state
regular and update events
regular and update events
state transformer
regular events
22
Adjusting States

For each state transformer, we need to provide
only one function
adjust(s1,s2,s3)
if state s2 is replaced by s3, adjust the
succeeding state s1 accordingly
Example the adjust function for element counting
is
adjust(s1,s2,s3).count s1.count(s3.count-s2.c
ount)
The adjust function for XPath steps is the
identity
adjust(s1,s2,s3) s1

23
Adjusting State for Element Counting

id Event element
counting adjustment
1
1
2 startMutable(1) start state(2)
n n1
2
2 X
2
2 endMutable(1) end state(2) n1 n2
1
3 startInsertBefore(2) start state(3) n
3
3 Y
3
3 endInsertBefore(2) end state(3) n1
1

work with id2 state copy
work with id3 state copy
24
The Result Display

Its like any other state transformer but it also
does side-effects to the display screen
The state of an update id is its position in the
screen
adjust(s1,s2,s3) s1s3-s2
Side effects
remove_text(start,end)?
insert_text(position,text)?
The state transformer is very simple
eg, for a event insert in the
screen at position

25
Problem

For each update id, each state transformer must
keep a separate copy of the state
Will lead to space explosion for an infinite
stream of updates
Not applicable to replacement updates, since our
queries are snapshot, not historical
For incoming updates, we can ignore updates that
are irrelevant to the query
Hard for content-based predicates
For generated updates, the scope of an update is
usually limited
The scope is often known at run time
Allows the removal of out-of-scope states
eg, predicate testing

26
Unblocking XQuery Operations

We have used this technique for unblocking
concatenation
predicates
descendant
backward steps
sorting
In our preliminary results, many XQueries on
large data sets had high throughput, required
very little buffering, and (of course) had very
fast first response time

27
Future Work