Title: Efficient Processing of XML Update Streams
1Efficient Processing of XML Update Streams
- Leonidas Fegaras
- University of Texas at Arlington
2Data Stream Processing
- What is a data stream?
- continuous, time-varying data arriving at
unpredictable rates - continuous updates, long-running queries,
continuous results - Sought characteristics of stream processing
engines - real-time processing
- high throughput, low latency, fast mean response
time, low jitter - low memory footprint
- Why bother?
- many data are already available in stream form
- sensor networks, network traffic monitoring,
stock tickers - publisher-subscriber systems
- data stream mining for fraud detection
- data may be too volatile to index
- continuous measurements
3XML Stream Processing
- Why XML?
- There is no reason to normalize stream data
- Various sources of XML streams
- tokenized XML documents
- RSS feeds
- web service results
- Granularity
- XML tokens (events) , , text, etc
- region-encoded XML elements (eg, based on
pre-order numbering)? - XML fragments (hole-filler model)?
- Push-based processing SAX
- Pull-based processing StAX
4Traditional Stream Processing
output stream
input stream
- Works on streams that consist of numerical values
or relational tuples - Focuses on a sliding window
- fixed number of tuples, or
- fixed time span
- Calculates approximate results
- Uses a small (bounded) state
- Examples
- top-k most frequent values
- group-by SQL queries
past
stream engine
sliding window
state
future
5Our Goals
- Handle continuous XQueries over continuous
streamed XML data - Embedded updates in the streams
- Exact rather than approximate answers
- Produce continuous results, even when the results
are not complete - Problem most interesting operations are blocking
and/or require unbounded state - grouping aggregation
- predicate evaluation
- sorting
- sequence concatenation
- backward axis steps
- We want to address the blocking problem
differently - Display the current result of the blocking
operation continuously in the form of an update
stream - incoming vs. generated updates
6Our View of XML Update Streams
- A continuous (possibly infinite) sequence of XML
tokens with embedded updates - Typically, a finite data stream followed by an
infinite stream of updates - SAX-like events
- three basic types of tokens , ,
text - the target of an update is a stream subsequence
that contains zero, one, or more complete XML
elements - the source is also a token sequence that contains
complete XML elements - updates are embedded in the data stream and can
come at any time - update events can be interleaved with data events
and with each other - each event must now have an ID to associate it
with an update - updated regions can be updated too
- to update a stream subsequence, you wrap it into
a Mutable region - three types of updates
- replace, insertBefore, insertAfter
7Example
- id Event
equivalent to - 1
- 1
- 2 startMutable(1)
- 2 Y
- 2 X
- 2
- 2 endMutable(1) X
- 1
- 3 startInsertBefore(2)
- 3
- 3 Y
- 3
- 3 endInsertBefore(2)?
- 1
8Continuous Results
- Our stream engine is implemented as a pipeline
- each pipeline stage performs a very simple task
- The final pipeline stage is the Result Display
that displays the query results continuously - the display is an editable text window, where
text can be inserted, deleted, and replaced at
any point - when an update is coming in the input stream, it
is propagated all the way to the result display,
where it causes an update to the display text!
result display
input stream with updates
output stream with updates
query pipeline
9Motivating Example
- Group and order book titles by author
- let al distinct-values(doc(bib.xml)//book/au
thor)? - return
- for a in al order by a
- return
-
doc(bib.xml)//bookauthora/title -
-
- Multiple points of blocking
- distinct-values
- count
- self-join
- order-by
10Motivating Example
- Group and order book titles by author
- let al distinct-values(doc(bib.xml)//book/au
thor)? - return
- for a in al order by a
- return
-
doc(bib.xml)//bookauthora/title -
-
- The result display is refreshed continuously
display
T1
input stream
currently
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
11Motivating Example
- Group and order book titles by author
- let al distinct-values(doc(bib.xml)//book/au
thor)? - return
- for a in al order by a
- return
-
doc(bib.xml)//bookauthora/title -
-
- The result display is refreshed continuously
display
T2 nameD T1
input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
12Motivating Example
- Group and order book titles by author
- let al distinct-values(doc(bib.xml)//book/au
thor)? - return
- for a in al order by a
- return
-
doc(bib.xml)//bookauthora/title -
-
- The result display is refreshed continuously
display
T2
T3 nameD T1
input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
13Motivating Example
- Group and order book titles by author
- let al distinct-values(doc(bib.xml)//book/au
thor)? - return
- for a in al order by a
- return
-
doc(bib.xml)//bookauthora/title -
-
- The result display is refreshed continuously
display
T2
T3 nameB T4
T1
input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
14Motivating Example
- Group and order book titles by author
- let al distinct-values(doc(bib.xml)//book/au
thor)? - return
- for a in al order by a
- return
-
doc(bib.xml)//bookauthora/title -
-
- The result display is refreshed continuously
display
T2
T3 nameB T4
T5 nameD T1
input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
15Why?
- Because, this is what you really want to see as
the result of a query - eg, in a stock ticker feed stream, where updates
to ticker values come continuously - It leads to optimistic evaluation where
results are displayed immediately, to be
retracted or modified later when more information
is available - addresses the blocking problem
- we proceed without waiting, displaying the
results so far, but later we may have to send
updates - generalizes on-line aggregation
16Optimistic Evaluation
- Pessimistic evaluation at all times, the query
display must always show the correct results up
to that point - Optimistic evaluation display any possible
output without delay and later, if necessary,
retract it or modify it to make it correct - complementary to, but different than, lazy
evaluation - How?
- Generated and incoming updates are propagated
through the evaluation pipeline until they are
processed by the display - They may cause changes to the states of the
pipeline stages - Examples
- Event counting instead of waiting until we count
all events, we generate updates that continuously
display the counter so far - Predicate testing assume the predicate is true,
but when you later find that it is false, retract
all output associated with this predicate - Sorting wrap each element to be sorted around an
update that inserts it into the correct place to
the element sequence so far
17Contributions
- Instead of eagerly performing the updates on
cached portions of the stream, we propagate the
updates through the pipeline - all the way to the query result display
- the display prints the results continuously,
replacing old results with new - Other approaches
- display approximate answers continuously by
focusing on a sliding window over the stream - Our approach
- generates exact answers continuously in the form
of an update stream - But the propagated updates may affect the state
of the operators - we developed a uniform methodology to incorporate
state change - Used this framework to unblock operations and
reduce buffering - let the operations themselves embed new updates
into the stream that retroactively perform the
blocking parts of the operation - why? because later is often better than now
18State Transformers
- Each stage in the query evaluation pipeline is a
state transformer - Input a single event and a state S
- Output a sequence of events and a new state S
- Implemented as a function from an event to a
sequence of events that destructively modifies
the state - can be used in both pull- and push-based stream
processing - The state transformers need only handle the basic
SAX events , , text, and
begin/end of stream
state
state transformer
stream with regular and update events
stream with regular events
19Example Element Counting
- The state is an integer counter, count
- A blocking state transformer, f(e)
- if e is a text event
- count count1
- return
- else if e is end-of-stream
- return count value
- A non-blocking state transformer
- if e is begin-of-stream
- return startMutable(id), 0, endMutable(id)
- else if e is a text event
- count count1
- return startReplace(d), count value,
endReplace(id)
20XPath Steps
- The state transformers of simple XPath forward
steps are trivial to implement - Example the Child step (/tag)
- state
- need a counter nest to keep track of the nesting
depth, and - a flag pass to remember if we are currently
passing through or discarding events - logic
- when we see the event at nest1, we enter
pass mode and stay there until we see at
nest1 - when in pass mode, we return the current event
- otherwise, we return
21Handling Updates
- It would be cumbersome to modify each state
transformer to handle incoming update events - Our solution the update events are handled in
the same way for all state transformers - Each state transformer is wrapped by a fixed
function that handles update events by adjusting
the states, while passing the regular events to
the state transformer
state
update events
state adjustment
.
.
.
state
current state
regular and update events
regular and update events
state transformer
regular events
22Adjusting States
- For each state transformer, we need to provide
only one function - adjust(s1,s2,s3)
- if state s2 is replaced by s3, adjust the
succeeding state s1 accordingly - Example the adjust function for element counting
is - adjust(s1,s2,s3).count s1.count(s3.count-s2.c
ount) - The adjust function for XPath steps is the
identity - adjust(s1,s2,s3) s1
23Adjusting State for Element Counting
- id Event element
counting adjustment - 1
- 1
- 2 startMutable(1) start state(2)
n n1 - 2
- 2 X
- 2
- 2 endMutable(1) end state(2) n1 n2
- 1
- 3 startInsertBefore(2) start state(3) n
- 3
- 3 Y
- 3
- 3 endInsertBefore(2) end state(3) n1
- 1
work with id2 state copy
work with id3 state copy
24The Result Display
- Its like any other state transformer but it also
does side-effects to the display screen - The state of an update id is its position in the
screen - adjust(s1,s2,s3) s1s3-s2
- Side effects
- remove_text(start,end)?
- insert_text(position,text)?
- The state transformer is very simple
- eg, for a event insert in the
screen at position
25Problem
- For each update id, each state transformer must
keep a separate copy of the state - Will lead to space explosion for an infinite
stream of updates - Not applicable to replacement updates, since our
queries are snapshot, not historical - For incoming updates, we can ignore updates that
are irrelevant to the query - Hard for content-based predicates
- For generated updates, the scope of an update is
usually limited - The scope is often known at run time
- Allows the removal of out-of-scope states
- eg, predicate testing
26Unblocking XQuery Operations
- We have used this technique for unblocking
- concatenation
- predicates
- descendant
- backward steps
- sorting
- In our preliminary results, many XQueries on
large data sets had high throughput, required
very little buffering, and (of course) had very
fast first response time
27Future Work
- Plan to cover most XQuery features completely
- Would like to handle historical queries
- Same model for update streams
- ... but now replacement updates may add a new
version - Example
- which stock increased its value by at least 10
since the last update? - Need to extend the XQuery syntax with historical
features - Need to cut-off out-of-scope historical data