Title: XPath on Streaming XML using YFilter and ao
1XPath on Streaming XML using YFilter and ?ao?
2YFilter
- XFilter is successful in filtering XML documents
- One scan of the document results in simultaneous
results for different user profiles, based on the
user-specific XPath queries - As filtering systems are deployed the internet,
the number of users can become very large - However, there is likely to be significant
commonality among user interests and thus their
XPath expressions - XFilter stores each user-query into a single FSM
- Movement between the states of the various
machines occurs as the document is processed - An XFilter like approach may result in redundant
processing, something YFilter tries to avoid
3YFilter
- Y Filter is based on the XFilter approach to
filtering documents - Event driven XML document parsing
- YFilter aims to exploit common XPath expressions
by using a combined state machine to represent
all path expressions - By putting all the user queries into a combined
state machine (Nondeterministic Finite
Automaton), common XPath expressions among
queries can be shared, reducing the amount
processing needed for each event
4YFilter Advantages
- YFilters NFA approach has several advantages
over its predecessors - Relatively small number of states are required to
represent a large number of XPath expressions - Can support complicated document types (recursive
nesting and queries, multiple wildcards, etc.) - The NFA is constructed and maintained
incrementally - The shared path matching used by YFilters NFA
approach as been shown to have significant
performance improvements over the XFilter approach
5YFilter Disadvantages
- Creating an NFA to match value-based predicates
would quickly explode the size of the NFA - To deal with this value based predicates are not
included in the NFA - When a state containing a value based predicate
is reached, value-based predicate matching occurs
separately - By handling this separately different methods of
handling value matching can be employed
6YFilter NFA
- Y Filter represents all the user queries in a
single Non-Determenistic Finite Automaton (NFA) - Labels between the NFA states represent a trie
over the location steps of the path - Common prefixes of the paths only occur once in
the NFA - Identical queries share the same path in the NFA,
including the end state - YFilter also uses a stack to deal with
non-determinism - If a next state does not exist, the algorithm
must backtrack to the previous states - The YFilter NFA differs from a traditional NFA
- YFilter must find all matching queries
- This means the NFA execution must continue even
after an end/accepting state has been reached
7Creating an NFA from XPath Expressions
- Adding XPath expressions to an existing NFA is a
simple incremental process - Combining the XPath Expressions
- /a
- /b
a
a
b
b
8Creating an NFA from XPath Expressions
- When inserting a new query/path into the NFA
- Traverse the current NFA as it matches with the
new expression until - The accepting state in the new XPath query is
reached - Make the final state an end/accepting state
- Associate this query id with the ending state
- A state is reached where there is no transition
to match the next step in the expression - Create a new branch from the last state reached
in the combined NFA
9Creating an NFA from XPath Expressions
- When inserting a new query/path into the NFA
- Traverse the current NFA as it matches with the
new expression until - The accepting state in the new XPath query is
reached - Make the final state an end/accepting state
- Associate this query id with the ending state
- A state is reached where there is no transition
to match the next step in the expression - Create a new branch from the last state reached
in the combined NFA
10Creating an NFA from XPath Expressions
- The wild card () and descendant (//) operators
create the non-determinism and are handled
specially when creating the NFA - Wildcards require 2 edges in the NFA, one marked
by wildcard () and the other by the input that
will follow - Descendants are handled by looping, as the NFA
must move to the next state, but stay in the
current location - Example //a
11YFilter NFA Example
- Example 8 XPath queries
- Q1/a/b
- Q2/a/c
- Q3/a/b/c
- Q4/a//b/c
- Q5/a//c
- Q6/a//c
- Q7/a///c
- Q8/a/b/c
Darker states represent shared states Bold
outlined states represent accepting (end)
states 22 XPath nodes represented in 13 states
12Executing the NFA
- The NFA execution implements hash table based
approach for keeping track of states - Each state in the has table includes
- A state ID
- Type information (an accepting end state, or a //
descendant) - A small hash table which includes transitions
from this state - The ID of the associated user-query for accepting
states - In addition the NFA execution uses a stack
mechanism capable of tracking multiple tasks - The stack keeps track of the current state ID as
well as a set of target states - When the end element even occurs the stack
back-tracks to the previous state at the time of
the previous start element event
13Executing the NFA
- Start Element Event Handler
- When a new element is read from the document, the
NFA follows the transitions from all currently
active states - For each active state there are 4 checks
- The incoming element name is looked up from the
state hash table - If it exists, the state id is added to the
target states list - The symbol is looked up in the hash table
- If it exists, the state id is added to the
target states list - If the state is a descendant (//) the current
state is added to the target states list - Implementing the loop
- The hash table is checked for the e symbol
- If it exists, the descendent state is processed
recursively according to the previous 3 rules - Once all active states have been checked, the
list of target states is pushed onto the stack - These states become active for the next start
event
14Executing the NFA
- End Element Event Handler
- When the end element occurs, the NFA backtracks
by popping the top set of states from the stack - Once popped, the new top of the stack
represents the active states
15YFilter NFA Efficiency
- Using a shared NFA results in a machine with
fewer states - The NFA could have performance problems due to
the need to support multiple transitions from
each state - The NFA could be converted to a DFA however,
would result in scalability problems, as the
number of states would quickly explode - Despite this concern, experimental results show
that NFA performance not to be an issue - YFilters evaluation is sufficiently fast
- In many cases the cost of parsing the XML
document was more significant than processing the
NFA
16Experimental Results
- Experiments were performed on a number of random
queries comparing XFilter, YFilter, and a hybrid
of the two
Performance on distinct queries
Performance on queries containing duplicates
MultiQuery Processing Time (MQPT) Filtering
time document parsing time
17YFilter Conclusion
- The statistical results show that YFilter
performs much better as the number of user
queries grows - As the number of distinct queries grows, YFilter
significantly outperforms XFilter - These results show that path sharing via NFA
provide significant performance improvements over
traditional methods of XML document filtering
18Additional Information
- Not mentioned in the presentation are the various
methods in which YFilter may handle value-based
predicates - Value-based predicates are not handled in the
NFA, thus were excluded for simplicity - The results also show a hybrid approach that is a
combination of XFilter and YFilter - In the hybrid approach rather than creating a
single NFA, XFilter like FSMs are created for
XPaths/queries with similar prefixes (or
substrings) - The hybrid approach is quite similar to the XTrie
algorithm discussed in the previous presentation
19YFilter Limitations
- Although YFilter shows significant improvments
over many of its predecessors, there are still a
few areas where it is lacking - YFilter only supports forward axes (descendant,
child) and not backwards axes (ancestor, parent) - Supporting backwards axes in an efficient manner
can be a tricky problem, do be discussed next
20Streaming XPath with Forward and Backward Axes
(?ao?)
- Main focus on the efficiency of the XPath engine
- XPath provides the basis for a number of XML
tools - SQLX, XSLT, XQuery, etc.
- Because so much has been built on XPath an
efficient method to process streaming XML is
necessary
21?ao? Goals
- ?ao? aims to provide significant improvements by
handling XML data somewhat differently from
predecessors - Goals
- Allow for efficient processing regardless of
document size - Process the document in a streaming manner
- Process queries / filtering as the document is
parsed - Only 1 pass through the document, visiting each
element only once. - Support forward (descendant/child) and backwards
(ancestor/parent) axes
22Existing XPath Engines
- Most existing XPath engines require the entire
document to be in memory - This is too expensive for large/streaming XML
documents - Previous XPath processing engines only support
forward axes (XFilter, XTrie, YFilter, etc) - Many XPath processors require more than one pass
through the document for axis processing - This is also too expensive for extremely large
documents
23Example
- Example Apache Xalan processor on the expression
/descendantx/ancestory (selects all y
ancestors of x elements) - Xalan traverses the document once to find all the
x elements - For each x element, visit the ancestors looking
for a matching y element - This process means Xalan visits some elements of
the document more than once - This is very costly for extremely large XML
documents
24Example (continued)
- By eliminating the second (or third) traversal
?ao? is often able to discard unnecessary nodes
sooner, saving on memory - The ?ao? algorithm is able to efficiently deal
with backward axes by converting them to forward
axes before processing the data
25Components
- The ?ao? algorithm is based on several key
components - An XPath expression is represented as
- an x-tree and an x-dag
- An XML document or XML stream is read and parsed
using an event based parser (such as SAX) - Each parsed XML element is given a sequential ID
and a level - XML Elements which match the XPath expression are
stored in a Matching-Structure
26X-Tree Construction
- The ?ao? algorithm expression maps each XPath
node to an x-node in the x-tree - The start of each x-tree is labeled Root
- Following the Root node, x-tree nodes are
labeled with the tag-name of each node in the
XPath expression - Edges are labeled with the axis specifier (child,
descendant, parent, ancestor) - The rightmost node in the XPath expression that
is not contained in the predicate is labeled as
the output node
27X-Tree Construction
- Example /descendantychildU/descendantWan
cestorZ/childV
Root
descendant
Y
descendant
child
W is marked as the output node
W
U
ancestor
Z
child
V
28X-dag
- Once the x-tree is created, the ?ao? algorithm
then uses a directed-asyclic graph representation
of an XPath query called an x-dag - The x-dag is obtained from the x-tree
representation by converting the ancestor and
parent constraints to descendant and child
constraints - The x-dag is a directed, labeled graph G with the
same set of vertices as the previously defined
x-tree T. Edges in the x-dag are defined as
follows
29X-dag Construction
- Example /descendantychildU/descendantWan
cestorZ/childV
- Edges in the x-tree T labeled child or descendant
are also edges in the x-dag G
Root
descendant
Y
descendant
child
W
U
ancestor
Z
child
V
30X-dag Construction
- Example /descendantychildU/descendantWan
cestorZ/childV
Root
- Edges in the x-tree T labeled child or descendant
are also edges in the x-dag G - For each edge in T labeled ancestor, G contains
an edge joining the same nodes, but with the
direction reversed and label changed to
descendant. (Similarly for parent edges to child
edges)
descendant
Y
descendant
child
W
U
descendant
Z
child
V
31X-dag Construction
- Example /descendantychildU/descendantWan
cestorZ/childV
Root
- Edges in the x-tree T labeled child or descendant
are also edges in the x-dag G - For each edge in T labeled ancestor, G contains
an edge joining the same nodes, but with the
direction reversed and label changed to
descendant. (Similarly for parent edges to child
edges) - For any non-root x-node v in G without an
incoming edge, add a descendant edge from root to
v
descendant
descendant
Y
Z
descendant
child
child
descendant
U
W
V
32X-dag Construction
- Example /descendantychildU/descendantWan
cestorZ/childV
X-tree
Resulting x-dag
Root
descendant
Root
descendant
Y
descendant
descendant
child
Z
Y
W
U
child
child
descendant
descendant
ancestor
U
W
V
Z
child
V
33Reading the XML
- XML is parsed using an event driven parser (such
as SAX) - ?ao? focuses on start element and end element
events - XML document is parsed depth first
- For each element visited that element is assigned
a sequential ID that uniquely identifies that
element in the document - As the parser visits each node it records the
current level or depth - The level is needed for determining the
child/parent axes, as the child must be located
exactly 1 level above the current level
34Matchings
- As the SAX events occur, the ?ao? algorithm tries
to match elements of the current document with
the XPath expression, more specifically x-nodes
in the x-tree. - Formally defined, A partial matching of x-nodes
from x-tree T to elements of document D, m VT ?
VD satisfies the following characteristics - All mapped vertices satisfy the node test
- For all x-nodes v in domain(m), label(v)
tag(m(v)) - For all x-nodes v1 and v2connected by an edge in
T such that v1, v2 in domain(m), (v1, m(v1)) is
consistent with (v2, (mv2)) - Which basically means, a partial matching x-node
label must match the element tag, and the
relationship between 2 x-nodes, must be the same
between the 2 mapped elements from the XML
document
35Total Matchings (Overview)
- A matching at an x-node v is total if its domain
contains all the vertices of the subtree rooted
at v - The results is the collection of total matchings
at the Root - Computing the total matching for the entire
expression involves collecting the matches as the
document tree and x-tree are traversed
36Looking for Total Matchings
- An element e is relevant if there exists some
document completion where e participates in a
total matching at the Root - All relevant elements must be processed
- As events occur new relevant elements may appear,
while others may no longer be relevant. - The x-dag is used to determine which element is
relevant
37Looking for Total Matchings
- The x-dag is useful since it orders the nodes in
the order they should appear in the document - Example Given the XML string
- ltXgtltYgtltW /gt
- Since the document is processed depth-first, by
the time the start element event for W is
reached, the start element for all of Ws
ancestors have been reached - According to the x-dag, both an Y and Z need to
be encountered before W becomes relevant - In this case, a Z has not appeared, so the W can
be discarded
38Looking For Set
- To help determine when elements are relevant, or
more importantly when they are not relevant (and
can be discarded) the ?ao? algorithm maintains a
Looking For set - The Looking For set L consists of the nodes and
level that the ?ao? algorithm is looking for next - Which elements occur next is based on the current
level, and the next corresponding node on the
x-dag
39Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ) (v, l) where v is the
tag/node name followed by the level l Both Y and
Z are descendants, so the level does not
matter Matches X-dag node ? sequential
element id
descendant
Z
Y
child
descendant
child
descendant
U
W
V
40Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ) (v, l) where v is the
tag/node name followed by the level l Both Y and
Z are decendants, so the level does not
matter Start X event fires, X is not in the
looking set, so it can be discarded.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
41Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ), (U, 3) Y is in the
looking for set (U, 3) is added, since the
current level is 2 and U must be a child of
Y Note W is not added since Z has not occurred
yet.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
42Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ) W is not in the looking
for set, and can be discarded (U, 3) is removed,
since the current level is 3 the next element is
level 4, or an end element. Note W is not added
since Z has not occurred yet.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
43Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ), (U, 3) W ends, returning
to level 2, resume looking for U3 Matches Y?
2
descendant
Z
Y
child
descendant
child
descendant
U
W
V
44Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ), (W, ), (V, 4) Z matches
the looking for set, (V, 4) is added to the
looking for set (W, ) added to the looking for
set since both Y and Z have been
encountered Matches Y? 2, Z? 4
descendant
Z
Y
child
descendant
child
descendant
U
W
V
45Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ), (W, ) V on level 4 is in
L, Matches Y? 2, Z? 4, V? 5
descendant
Z
Y
child
descendant
child
descendant
U
W
V
46Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ), (W, ) W is in L Matches
Y? 2, Z? 4, V? 5, W? 6 Even though the
output node is reached, still looking for
Y/childU
descendant
Z
Y
child
descendant
child
descendant
U
W
V
47Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt
Root
descendant
L (Y, ), (Z, ), (U, 3) Z ended, back at
level 3, looking for U at the current
level Matches Y? 2, Z? 4, V? 5, W? 6
Even though the output node is reached, still
looking for Y/childU
descendant
Z
Y
child
descendant
child
descendant
U
W
V
48Simple Example
- Looking for set L as for the following XML
stream - ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt
Root
descendant
- L (Y, ), (Z, )
- U at level 3 found,
- Matches
- Y? 2, U?7 Z? 4, V? 5, W? 6
-
- Total matching found. Add W to the
solutions, and continue looking for matches as
the document continues -
descendant
Z
Y
child
descendant
child
descendant
U
W
V
49Incomplete Matches
- The previous example showed a relatively
straight-forward successful example. - Items were added to the current
matching-structure, until all properties were
solved - This is not always the case
- ?ao? takes an optimistic approach when adding
items to the matches list - As end-tags occur, if all the required x-tree
nodes have not been visited, items must then be
removed from the match list
50Incomplete Matches Example
- Consider the XML string
- ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
- At the end tag of the W the Look-for and matches
are as follows - ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
- L (Y, ), (Z, ), (W, ), (V, 4)
- M Y?2, Z?3, W?4
- One step later at end Z, V cannot be a child of
Zid3. Since Zid3 was added
(optimistically), Zid3 and its children must
be removed from the Match list - ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
- L (Y, ), (Z, ), (U, 3)
- M Y?2, Z?3, W?4
51Results
- ?ao? was compared with Apaches Xalan with XMark
generated XML documents - ?ao? and Xalan are about even until the document
reaches about 100MB in size - Since Xalan requires multiple traversals and
cannot as quickly discard processed XML nodes - Regardless of the document size ?ao? discarded
99.8 of the elements encountered - This is primarily why the performance of ?ao?
remained steady (linear) regardless of the
document size
52Overall Performance
- ?ao? was compared with Apaches Xalan
- Overall execution time ?ao? performed about 25
faster than Xalan. - Documents with 640,000 elements (6.7 MB), Xalan
52.28 seconds, ?ao? 39 seconds
53XPath Performance
- A comparison was then made, which excluded the
parsing time - Results showed ?ao? outperformed Xalan by more
than before - The performance gain is primarily a result of
building unnecessary traversals
54Optimizations / Extensions
- Extensions to the ?ao? algorithm include support
for XPath expressions with multiple outputs - Handled by creating an x-dag as before with
multiple x-nodes marked as output nodes - Multiple output nodes may also be used to support
joins of XPath expressions - Optimize the storage for total matchings
- If a branch of the x-dag does not contain an
output node, a true or satisfied value can
represent the subtree in the total matching,
rather than storing mappings for the entire
subtree
55Conclusion
- The ?ao? algorithm provides a very effective way
for processing XPath expressions with both
backwards and forwards axes - The ability to quickly discard non-relevant nodes
makes ?ao? very effective on extremely large XML
documents - Furthermore the way in which the algorithm works
makes it very scalable for streaming XML data - By handling XML data and matches as they occur,
?ao? has the ability to provide results as they
occur, before the end of the document/stream is
reached
56References
- C. Barton, P. Charles, D. Goyal, M. Raghavachari,
M. Fontoura, and V. Josifovski. Streaming XPath
Processing with Forward and Backward Axes. In
Proc. of ICDE, 2003. - Y. Diao, M. Altinel, M. Franklin, et al. Path
Sharing and Predicate Evaluation for
High-Performance XML Filtering. In TODS, pages
467516, 2003.