XPath on Streaming XML using YFilter and ao - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

XPath on Streaming XML using YFilter and ao

Description:

YFilter's NFA approach has several advantages over its predecessors ... Common prefixes of the paths only occur once in the NFA ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 57

Provided by: smd8

Category:

more less

Transcript and Presenter's Notes

Title: XPath on Streaming XML using YFilter and ao

1
XPath on Streaming XML using YFilter and ?ao?

Daniel Cairney

2
YFilter

XFilter is successful in filtering XML documents
One scan of the document results in simultaneous
results for different user profiles, based on the
user-specific XPath queries
As filtering systems are deployed the internet,
the number of users can become very large
However, there is likely to be significant
commonality among user interests and thus their
XPath expressions
XFilter stores each user-query into a single FSM
Movement between the states of the various
machines occurs as the document is processed
An XFilter like approach may result in redundant
processing, something YFilter tries to avoid

3
YFilter

Y Filter is based on the XFilter approach to
filtering documents
Event driven XML document parsing
YFilter aims to exploit common XPath expressions
by using a combined state machine to represent
all path expressions
By putting all the user queries into a combined
state machine (Nondeterministic Finite
Automaton), common XPath expressions among
queries can be shared, reducing the amount
processing needed for each event

4
YFilter Advantages

YFilters NFA approach has several advantages
over its predecessors
Relatively small number of states are required to
represent a large number of XPath expressions
Can support complicated document types (recursive
nesting and queries, multiple wildcards, etc.)
The NFA is constructed and maintained
incrementally
The shared path matching used by YFilters NFA
approach as been shown to have significant
performance improvements over the XFilter approach

5
YFilter Disadvantages

Creating an NFA to match value-based predicates
would quickly explode the size of the NFA
To deal with this value based predicates are not
included in the NFA
When a state containing a value based predicate
is reached, value-based predicate matching occurs
separately
By handling this separately different methods of
handling value matching can be employed

6
YFilter NFA

Y Filter represents all the user queries in a
single Non-Determenistic Finite Automaton (NFA)
Labels between the NFA states represent a trie
over the location steps of the path
Common prefixes of the paths only occur once in
the NFA
Identical queries share the same path in the NFA,
including the end state
YFilter also uses a stack to deal with
non-determinism
If a next state does not exist, the algorithm
must backtrack to the previous states
The YFilter NFA differs from a traditional NFA
YFilter must find all matching queries
This means the NFA execution must continue even
after an end/accepting state has been reached

7
Creating an NFA from XPath Expressions

Adding XPath expressions to an existing NFA is a
simple incremental process
Combining the XPath Expressions
/a
/b

a
a
b
b
8
Creating an NFA from XPath Expressions

When inserting a new query/path into the NFA
Traverse the current NFA as it matches with the
new expression until
The accepting state in the new XPath query is
reached
Make the final state an end/accepting state
Associate this query id with the ending state
A state is reached where there is no transition
to match the next step in the expression
Create a new branch from the last state reached
in the combined NFA

9
Creating an NFA from XPath Expressions

When inserting a new query/path into the NFA
Traverse the current NFA as it matches with the
new expression until
The accepting state in the new XPath query is
reached
Make the final state an end/accepting state
Associate this query id with the ending state
A state is reached where there is no transition
to match the next step in the expression
Create a new branch from the last state reached
in the combined NFA

10
Creating an NFA from XPath Expressions

The wild card () and descendant (//) operators
create the non-determinism and are handled
specially when creating the NFA
Wildcards require 2 edges in the NFA, one marked
by wildcard () and the other by the input that
will follow
Descendants are handled by looping, as the NFA
must move to the next state, but stay in the
current location
Example //a

11
YFilter NFA Example

Example 8 XPath queries
Q1/a/b
Q2/a/c
Q3/a/b/c
Q4/a//b/c
Q5/a//c
Q6/a//c
Q7/a///c
Q8/a/b/c

Darker states represent shared states Bold
outlined states represent accepting (end)
states 22 XPath nodes represented in 13 states
12
Executing the NFA

The NFA execution implements hash table based
approach for keeping track of states
Each state in the has table includes
A state ID
Type information (an accepting end state, or a //
descendant)
A small hash table which includes transitions
from this state
The ID of the associated user-query for accepting
states
In addition the NFA execution uses a stack
mechanism capable of tracking multiple tasks
The stack keeps track of the current state ID as
well as a set of target states
When the end element even occurs the stack
back-tracks to the previous state at the time of
the previous start element event

13
Executing the NFA

Start Element Event Handler
When a new element is read from the document, the
NFA follows the transitions from all currently
active states
For each active state there are 4 checks
The incoming element name is looked up from the
state hash table
If it exists, the state id is added to the
target states list
The symbol is looked up in the hash table
If it exists, the state id is added to the
target states list
If the state is a descendant (//) the current
state is added to the target states list
Implementing the loop
The hash table is checked for the e symbol
If it exists, the descendent state is processed
recursively according to the previous 3 rules
Once all active states have been checked, the
list of target states is pushed onto the stack
These states become active for the next start
event

14
Executing the NFA

End Element Event Handler
When the end element occurs, the NFA backtracks
by popping the top set of states from the stack
Once popped, the new top of the stack
represents the active states

15
YFilter NFA Efficiency

Using a shared NFA results in a machine with
fewer states
The NFA could have performance problems due to
the need to support multiple transitions from
each state
The NFA could be converted to a DFA however,
would result in scalability problems, as the
number of states would quickly explode
Despite this concern, experimental results show
that NFA performance not to be an issue
YFilters evaluation is sufficiently fast
In many cases the cost of parsing the XML
document was more significant than processing the
NFA

16
Experimental Results

Experiments were performed on a number of random
queries comparing XFilter, YFilter, and a hybrid
of the two

Performance on distinct queries
Performance on queries containing duplicates
MultiQuery Processing Time (MQPT) Filtering
time document parsing time
17
YFilter Conclusion

The statistical results show that YFilter
performs much better as the number of user
queries grows
As the number of distinct queries grows, YFilter
significantly outperforms XFilter
These results show that path sharing via NFA
provide significant performance improvements over
traditional methods of XML document filtering

18
Additional Information

Not mentioned in the presentation are the various
methods in which YFilter may handle value-based
predicates
Value-based predicates are not handled in the
NFA, thus were excluded for simplicity
The results also show a hybrid approach that is a
combination of XFilter and YFilter
In the hybrid approach rather than creating a
single NFA, XFilter like FSMs are created for
XPaths/queries with similar prefixes (or
substrings)
The hybrid approach is quite similar to the XTrie
algorithm discussed in the previous presentation

19
YFilter Limitations

Although YFilter shows significant improvments
over many of its predecessors, there are still a
few areas where it is lacking
YFilter only supports forward axes (descendant,
child) and not backwards axes (ancestor, parent)
Supporting backwards axes in an efficient manner
can be a tricky problem, do be discussed next

20
Streaming XPath with Forward and Backward Axes
(?ao?)

Main focus on the efficiency of the XPath engine
XPath provides the basis for a number of XML
tools
SQLX, XSLT, XQuery, etc.
Because so much has been built on XPath an
efficient method to process streaming XML is
necessary

21
?ao? Goals

?ao? aims to provide significant improvements by
handling XML data somewhat differently from
predecessors
Goals
Allow for efficient processing regardless of
document size
Process the document in a streaming manner
Process queries / filtering as the document is
parsed
Only 1 pass through the document, visiting each
element only once.
Support forward (descendant/child) and backwards
(ancestor/parent) axes

22
Existing XPath Engines

Most existing XPath engines require the entire
document to be in memory
This is too expensive for large/streaming XML
documents
Previous XPath processing engines only support
forward axes (XFilter, XTrie, YFilter, etc)
Many XPath processors require more than one pass
through the document for axis processing
This is also too expensive for extremely large
documents

23
Example

Example Apache Xalan processor on the expression
/descendantx/ancestory (selects all y
ancestors of x elements)
Xalan traverses the document once to find all the
x elements
For each x element, visit the ancestors looking
for a matching y element
This process means Xalan visits some elements of
the document more than once
This is very costly for extremely large XML
documents

24
Example (continued)

By eliminating the second (or third) traversal
?ao? is often able to discard unnecessary nodes
sooner, saving on memory
The ?ao? algorithm is able to efficiently deal
with backward axes by converting them to forward
axes before processing the data

25
Components

The ?ao? algorithm is based on several key
components
An XPath expression is represented as
an x-tree and an x-dag
An XML document or XML stream is read and parsed
using an event based parser (such as SAX)
Each parsed XML element is given a sequential ID
and a level
XML Elements which match the XPath expression are
stored in a Matching-Structure

26
X-Tree Construction

The ?ao? algorithm expression maps each XPath
node to an x-node in the x-tree
The start of each x-tree is labeled Root
Following the Root node, x-tree nodes are
labeled with the tag-name of each node in the
XPath expression
Edges are labeled with the axis specifier (child,
descendant, parent, ancestor)
The rightmost node in the XPath expression that
is not contained in the predicate is labeled as
the output node

27
X-Tree Construction

Example /descendantychildU/descendantWan
cestorZ/childV

Root
descendant
Y
descendant
child
W is marked as the output node
W
U
ancestor
Z
child
V
28
X-dag

Once the x-tree is created, the ?ao? algorithm
then uses a directed-asyclic graph representation
of an XPath query called an x-dag
The x-dag is obtained from the x-tree
representation by converting the ancestor and
parent constraints to descendant and child
constraints
The x-dag is a directed, labeled graph G with the
same set of vertices as the previously defined
x-tree T. Edges in the x-dag are defined as
follows

29
X-dag Construction

Example /descendantychildU/descendantWan
cestorZ/childV

Edges in the x-tree T labeled child or descendant
are also edges in the x-dag G

Root
descendant
Y
descendant
child
W
U
ancestor
Z
child
V
30
X-dag Construction

Example /descendantychildU/descendantWan
cestorZ/childV

Root

Edges in the x-tree T labeled child or descendant
are also edges in the x-dag G
For each edge in T labeled ancestor, G contains
an edge joining the same nodes, but with the
direction reversed and label changed to
descendant. (Similarly for parent edges to child
edges)

descendant
Y
descendant
child
W
U
descendant
Z
child
V
31
X-dag Construction

Example /descendantychildU/descendantWan
cestorZ/childV

Root

Edges in the x-tree T labeled child or descendant
are also edges in the x-dag G
For each edge in T labeled ancestor, G contains
an edge joining the same nodes, but with the
direction reversed and label changed to
descendant. (Similarly for parent edges to child
edges)
For any non-root x-node v in G without an
incoming edge, add a descendant edge from root to
v

descendant
descendant
Y
Z
descendant
child
child
descendant
U
W
V
32
X-dag Construction

Example /descendantychildU/descendantWan
cestorZ/childV

X-tree
Resulting x-dag
Root
descendant
Root
descendant
Y
descendant
descendant
child
Z
Y
W
U
child
child
descendant
descendant
ancestor
U
W
V
Z
child
V
33
Reading the XML

XML is parsed using an event driven parser (such
as SAX)
?ao? focuses on start element and end element
events
XML document is parsed depth first
For each element visited that element is assigned
a sequential ID that uniquely identifies that
element in the document
As the parser visits each node it records the
current level or depth
The level is needed for determining the
child/parent axes, as the child must be located
exactly 1 level above the current level

34
Matchings

As the SAX events occur, the ?ao? algorithm tries
to match elements of the current document with
the XPath expression, more specifically x-nodes
in the x-tree.
Formally defined, A partial matching of x-nodes
from x-tree T to elements of document D, m VT ?
VD satisfies the following characteristics
All mapped vertices satisfy the node test
For all x-nodes v in domain(m), label(v)
tag(m(v))
For all x-nodes v1 and v2connected by an edge in
T such that v1, v2 in domain(m), (v1, m(v1)) is
consistent with (v2, (mv2))
Which basically means, a partial matching x-node
label must match the element tag, and the
relationship between 2 x-nodes, must be the same
between the 2 mapped elements from the XML
document

35
Total Matchings (Overview)

A matching at an x-node v is total if its domain
contains all the vertices of the subtree rooted
at v
The results is the collection of total matchings
at the Root
Computing the total matching for the entire
expression involves collecting the matches as the
document tree and x-tree are traversed

36
Looking for Total Matchings

An element e is relevant if there exists some
document completion where e participates in a
total matching at the Root
All relevant elements must be processed
As events occur new relevant elements may appear,
while others may no longer be relevant.
The x-dag is used to determine which element is
relevant

37
Looking for Total Matchings

The x-dag is useful since it orders the nodes in
the order they should appear in the document
Example Given the XML string
ltXgtltYgtltW /gt
Since the document is processed depth-first, by
the time the start element event for W is
reached, the start element for all of Ws
ancestors have been reached
According to the x-dag, both an Y and Z need to
be encountered before W becomes relevant
In this case, a Z has not appeared, so the W can
be discarded

38
Looking For Set

To help determine when elements are relevant, or
more importantly when they are not relevant (and
can be discarded) the ?ao? algorithm maintains a
Looking For set
The Looking For set L consists of the nodes and
level that the ?ao? algorithm is looking for next
Which elements occur next is based on the current
level, and the next corresponding node on the
x-dag

39
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ) (v, l) where v is the
tag/node name followed by the level l Both Y and
Z are descendants, so the level does not
matter Matches X-dag node ? sequential
element id
descendant
Z
Y
child
descendant
child
descendant
U
W
V
40
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ) (v, l) where v is the
tag/node name followed by the level l Both Y and
Z are decendants, so the level does not
matter Start X event fires, X is not in the
looking set, so it can be discarded.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
41
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (U, 3) Y is in the
looking for set (U, 3) is added, since the
current level is 2 and U must be a child of
Y Note W is not added since Z has not occurred
yet.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
42
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ) W is not in the looking
for set, and can be discarded (U, 3) is removed,
since the current level is 3 the next element is
level 4, or an end element. Note W is not added
since Z has not occurred yet.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
43
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (U, 3) W ends, returning
to level 2, resume looking for U3 Matches Y?
2
descendant
Z
Y
child
descendant
child
descendant
U
W
V
44
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (W, ), (V, 4) Z matches
the looking for set, (V, 4) is added to the
looking for set (W, ) added to the looking for
set since both Y and Z have been
encountered Matches Y? 2, Z? 4
descendant
Z
Y
child
descendant
child
descendant
U
W
V
45
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (W, ) V on level 4 is in
L, Matches Y? 2, Z? 4, V? 5
descendant
Z
Y
child
descendant
child
descendant
U
W
V
46
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (W, ) W is in L Matches
Y? 2, Z? 4, V? 5, W? 6 Even though the
output node is reached, still looking for
Y/childU
descendant
Z
Y
child
descendant
child
descendant
U
W
V
47
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (U, 3) Z ended, back at
level 3, looking for U at the current
level Matches Y? 2, Z? 4, V? 5, W? 6
Even though the output node is reached, still
looking for Y/childU
descendant
Z
Y
child
descendant
child
descendant
U
W
V
48
Simple Example

Looking for set L as for the following XML
stream
ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant

L (Y, ), (Z, )
U at level 3 found,
Matches
Y? 2, U?7 Z? 4, V? 5, W? 6
Total matching found. Add W to the
solutions, and continue looking for matches as
the document continues

descendant
Z
Y
child
descendant
child
descendant
U
W
V
49
Incomplete Matches

The previous example showed a relatively
straight-forward successful example.
Items were added to the current
matching-structure, until all properties were
solved
This is not always the case
?ao? takes an optimistic approach when adding
items to the matches list
As end-tags occur, if all the required x-tree
nodes have not been visited, items must then be
removed from the match list

50
Incomplete Matches Example

Consider the XML string
ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
At the end tag of the W the Look-for and matches
are as follows
ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
L (Y, ), (Z, ), (W, ), (V, 4)
M Y?2, Z?3, W?4
One step later at end Z, V cannot be a child of
Zid3. Since Zid3 was added
(optimistically), Zid3 and its children must
be removed from the Match list
ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
L (Y, ), (Z, ), (U, 3)
M Y?2, Z?3, W?4

51
Results

?ao? was compared with Apaches Xalan with XMark
generated XML documents
?ao? and Xalan are about even until the document
reaches about 100MB in size
Since Xalan requires multiple traversals and
cannot as quickly discard processed XML nodes
Regardless of the document size ?ao? discarded
99.8 of the elements encountered
This is primarily why the performance of ?ao?
remained steady (linear) regardless of the
document size

52
Overall Performance

?ao? was compared with Apaches Xalan
Overall execution time ?ao? performed about 25
faster than Xalan.
Documents with 640,000 elements (6.7 MB), Xalan
52.28 seconds, ?ao? 39 seconds

53
XPath Performance

A comparison was then made, which excluded the
parsing time
Results showed ?ao? outperformed Xalan by more
than before
The performance gain is primarily a result of
building unnecessary traversals

54
Optimizations / Extensions

Extensions to the ?ao? algorithm include support
for XPath expressions with multiple outputs
Handled by creating an x-dag as before with
multiple x-nodes marked as output nodes
Multiple output nodes may also be used to support
joins of XPath expressions
Optimize the storage for total matchings
If a branch of the x-dag does not contain an
output node, a true or satisfied value can
represent the subtree in the total matching,
rather than storing mappings for the entire
subtree

55
Conclusion

The ?ao? algorithm provides a very effective way
for processing XPath expressions with both
backwards and forwards axes
The ability to quickly discard non-relevant nodes
makes ?ao? very effective on extremely large XML
documents
Furthermore the way in which the algorithm works
makes it very scalable for streaming XML data
By handling XML data and matches as they occur,
?ao? has the ability to provide results as they
occur, before the end of the document/stream is
reached

56
References

C. Barton, P. Charles, D. Goyal, M. Raghavachari,
M. Fontoura, and V. Josifovski. Streaming XPath
Processing with Forward and Backward Axes. In
Proc. of ICDE, 2003.
Y. Diao, M. Altinel, M. Franklin, et al. Path
Sharing and Predicate Evaluation for
High-Performance XML Filtering. In TODS, pages
467516, 2003.

Write a Comment

User Comments (0)