XPath on Streaming XML using YFilter and ao - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

XPath on Streaming XML using YFilter and ao

Description:

YFilter's NFA approach has several advantages over its predecessors ... Common prefixes of the paths only occur once in the NFA ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 57
Provided by: smd8
Category:

less

Transcript and Presenter's Notes

Title: XPath on Streaming XML using YFilter and ao


1
XPath on Streaming XML using YFilter and ?ao?
  • Daniel Cairney

2
YFilter
  • XFilter is successful in filtering XML documents
  • One scan of the document results in simultaneous
    results for different user profiles, based on the
    user-specific XPath queries
  • As filtering systems are deployed the internet,
    the number of users can become very large
  • However, there is likely to be significant
    commonality among user interests and thus their
    XPath expressions
  • XFilter stores each user-query into a single FSM
  • Movement between the states of the various
    machines occurs as the document is processed
  • An XFilter like approach may result in redundant
    processing, something YFilter tries to avoid

3
YFilter
  • Y Filter is based on the XFilter approach to
    filtering documents
  • Event driven XML document parsing
  • YFilter aims to exploit common XPath expressions
    by using a combined state machine to represent
    all path expressions
  • By putting all the user queries into a combined
    state machine (Nondeterministic Finite
    Automaton), common XPath expressions among
    queries can be shared, reducing the amount
    processing needed for each event

4
YFilter Advantages
  • YFilters NFA approach has several advantages
    over its predecessors
  • Relatively small number of states are required to
    represent a large number of XPath expressions
  • Can support complicated document types (recursive
    nesting and queries, multiple wildcards, etc.)
  • The NFA is constructed and maintained
    incrementally
  • The shared path matching used by YFilters NFA
    approach as been shown to have significant
    performance improvements over the XFilter approach

5
YFilter Disadvantages
  • Creating an NFA to match value-based predicates
    would quickly explode the size of the NFA
  • To deal with this value based predicates are not
    included in the NFA
  • When a state containing a value based predicate
    is reached, value-based predicate matching occurs
    separately
  • By handling this separately different methods of
    handling value matching can be employed

6
YFilter NFA
  • Y Filter represents all the user queries in a
    single Non-Determenistic Finite Automaton (NFA)
  • Labels between the NFA states represent a trie
    over the location steps of the path
  • Common prefixes of the paths only occur once in
    the NFA
  • Identical queries share the same path in the NFA,
    including the end state
  • YFilter also uses a stack to deal with
    non-determinism
  • If a next state does not exist, the algorithm
    must backtrack to the previous states
  • The YFilter NFA differs from a traditional NFA
  • YFilter must find all matching queries
  • This means the NFA execution must continue even
    after an end/accepting state has been reached

7
Creating an NFA from XPath Expressions
  • Adding XPath expressions to an existing NFA is a
    simple incremental process
  • Combining the XPath Expressions
  • /a
  • /b

a
a
b
b
8
Creating an NFA from XPath Expressions
  • When inserting a new query/path into the NFA
  • Traverse the current NFA as it matches with the
    new expression until
  • The accepting state in the new XPath query is
    reached
  • Make the final state an end/accepting state
  • Associate this query id with the ending state
  • A state is reached where there is no transition
    to match the next step in the expression
  • Create a new branch from the last state reached
    in the combined NFA

9
Creating an NFA from XPath Expressions
  • When inserting a new query/path into the NFA
  • Traverse the current NFA as it matches with the
    new expression until
  • The accepting state in the new XPath query is
    reached
  • Make the final state an end/accepting state
  • Associate this query id with the ending state
  • A state is reached where there is no transition
    to match the next step in the expression
  • Create a new branch from the last state reached
    in the combined NFA

10
Creating an NFA from XPath Expressions
  • The wild card () and descendant (//) operators
    create the non-determinism and are handled
    specially when creating the NFA
  • Wildcards require 2 edges in the NFA, one marked
    by wildcard () and the other by the input that
    will follow
  • Descendants are handled by looping, as the NFA
    must move to the next state, but stay in the
    current location
  • Example //a

11
YFilter NFA Example
  • Example 8 XPath queries
  • Q1/a/b
  • Q2/a/c
  • Q3/a/b/c
  • Q4/a//b/c
  • Q5/a//c
  • Q6/a//c
  • Q7/a///c
  • Q8/a/b/c

Darker states represent shared states Bold
outlined states represent accepting (end)
states 22 XPath nodes represented in 13 states
12
Executing the NFA
  • The NFA execution implements hash table based
    approach for keeping track of states
  • Each state in the has table includes
  • A state ID
  • Type information (an accepting end state, or a //
    descendant)
  • A small hash table which includes transitions
    from this state
  • The ID of the associated user-query for accepting
    states
  • In addition the NFA execution uses a stack
    mechanism capable of tracking multiple tasks
  • The stack keeps track of the current state ID as
    well as a set of target states
  • When the end element even occurs the stack
    back-tracks to the previous state at the time of
    the previous start element event

13
Executing the NFA
  • Start Element Event Handler
  • When a new element is read from the document, the
    NFA follows the transitions from all currently
    active states
  • For each active state there are 4 checks
  • The incoming element name is looked up from the
    state hash table
  • If it exists, the state id is added to the
    target states list
  • The symbol is looked up in the hash table
  • If it exists, the state id is added to the
    target states list
  • If the state is a descendant (//) the current
    state is added to the target states list
  • Implementing the loop
  • The hash table is checked for the e symbol
  • If it exists, the descendent state is processed
    recursively according to the previous 3 rules
  • Once all active states have been checked, the
    list of target states is pushed onto the stack
  • These states become active for the next start
    event

14
Executing the NFA
  • End Element Event Handler
  • When the end element occurs, the NFA backtracks
    by popping the top set of states from the stack
  • Once popped, the new top of the stack
    represents the active states

15
YFilter NFA Efficiency
  • Using a shared NFA results in a machine with
    fewer states
  • The NFA could have performance problems due to
    the need to support multiple transitions from
    each state
  • The NFA could be converted to a DFA however,
    would result in scalability problems, as the
    number of states would quickly explode
  • Despite this concern, experimental results show
    that NFA performance not to be an issue
  • YFilters evaluation is sufficiently fast
  • In many cases the cost of parsing the XML
    document was more significant than processing the
    NFA

16
Experimental Results
  • Experiments were performed on a number of random
    queries comparing XFilter, YFilter, and a hybrid
    of the two

Performance on distinct queries
Performance on queries containing duplicates
MultiQuery Processing Time (MQPT) Filtering
time document parsing time
17
YFilter Conclusion
  • The statistical results show that YFilter
    performs much better as the number of user
    queries grows
  • As the number of distinct queries grows, YFilter
    significantly outperforms XFilter
  • These results show that path sharing via NFA
    provide significant performance improvements over
    traditional methods of XML document filtering

18
Additional Information
  • Not mentioned in the presentation are the various
    methods in which YFilter may handle value-based
    predicates
  • Value-based predicates are not handled in the
    NFA, thus were excluded for simplicity
  • The results also show a hybrid approach that is a
    combination of XFilter and YFilter
  • In the hybrid approach rather than creating a
    single NFA, XFilter like FSMs are created for
    XPaths/queries with similar prefixes (or
    substrings)
  • The hybrid approach is quite similar to the XTrie
    algorithm discussed in the previous presentation

19
YFilter Limitations
  • Although YFilter shows significant improvments
    over many of its predecessors, there are still a
    few areas where it is lacking
  • YFilter only supports forward axes (descendant,
    child) and not backwards axes (ancestor, parent)
  • Supporting backwards axes in an efficient manner
    can be a tricky problem, do be discussed next

20
Streaming XPath with Forward and Backward Axes
(?ao?)
  • Main focus on the efficiency of the XPath engine
  • XPath provides the basis for a number of XML
    tools
  • SQLX, XSLT, XQuery, etc.
  • Because so much has been built on XPath an
    efficient method to process streaming XML is
    necessary

21
?ao? Goals
  • ?ao? aims to provide significant improvements by
    handling XML data somewhat differently from
    predecessors
  • Goals
  • Allow for efficient processing regardless of
    document size
  • Process the document in a streaming manner
  • Process queries / filtering as the document is
    parsed
  • Only 1 pass through the document, visiting each
    element only once.
  • Support forward (descendant/child) and backwards
    (ancestor/parent) axes

22
Existing XPath Engines
  • Most existing XPath engines require the entire
    document to be in memory
  • This is too expensive for large/streaming XML
    documents
  • Previous XPath processing engines only support
    forward axes (XFilter, XTrie, YFilter, etc)
  • Many XPath processors require more than one pass
    through the document for axis processing
  • This is also too expensive for extremely large
    documents

23
Example
  • Example Apache Xalan processor on the expression
    /descendantx/ancestory (selects all y
    ancestors of x elements)
  • Xalan traverses the document once to find all the
    x elements
  • For each x element, visit the ancestors looking
    for a matching y element
  • This process means Xalan visits some elements of
    the document more than once
  • This is very costly for extremely large XML
    documents

24
Example (continued)
  • By eliminating the second (or third) traversal
    ?ao? is often able to discard unnecessary nodes
    sooner, saving on memory
  • The ?ao? algorithm is able to efficiently deal
    with backward axes by converting them to forward
    axes before processing the data

25
Components
  • The ?ao? algorithm is based on several key
    components
  • An XPath expression is represented as
  • an x-tree and an x-dag
  • An XML document or XML stream is read and parsed
    using an event based parser (such as SAX)
  • Each parsed XML element is given a sequential ID
    and a level
  • XML Elements which match the XPath expression are
    stored in a Matching-Structure

26
X-Tree Construction
  • The ?ao? algorithm expression maps each XPath
    node to an x-node in the x-tree
  • The start of each x-tree is labeled Root
  • Following the Root node, x-tree nodes are
    labeled with the tag-name of each node in the
    XPath expression
  • Edges are labeled with the axis specifier (child,
    descendant, parent, ancestor)
  • The rightmost node in the XPath expression that
    is not contained in the predicate is labeled as
    the output node

27
X-Tree Construction
  • Example /descendantychildU/descendantWan
    cestorZ/childV

Root
descendant
Y
descendant
child
W is marked as the output node
W
U
ancestor
Z
child
V
28
X-dag
  • Once the x-tree is created, the ?ao? algorithm
    then uses a directed-asyclic graph representation
    of an XPath query called an x-dag
  • The x-dag is obtained from the x-tree
    representation by converting the ancestor and
    parent constraints to descendant and child
    constraints
  • The x-dag is a directed, labeled graph G with the
    same set of vertices as the previously defined
    x-tree T. Edges in the x-dag are defined as
    follows

29
X-dag Construction
  • Example /descendantychildU/descendantWan
    cestorZ/childV
  • Edges in the x-tree T labeled child or descendant
    are also edges in the x-dag G

Root
descendant
Y
descendant
child
W
U
ancestor
Z
child
V
30
X-dag Construction
  • Example /descendantychildU/descendantWan
    cestorZ/childV

Root
  • Edges in the x-tree T labeled child or descendant
    are also edges in the x-dag G
  • For each edge in T labeled ancestor, G contains
    an edge joining the same nodes, but with the
    direction reversed and label changed to
    descendant. (Similarly for parent edges to child
    edges)

descendant
Y
descendant
child
W
U
descendant
Z
child
V
31
X-dag Construction
  • Example /descendantychildU/descendantWan
    cestorZ/childV

Root
  • Edges in the x-tree T labeled child or descendant
    are also edges in the x-dag G
  • For each edge in T labeled ancestor, G contains
    an edge joining the same nodes, but with the
    direction reversed and label changed to
    descendant. (Similarly for parent edges to child
    edges)
  • For any non-root x-node v in G without an
    incoming edge, add a descendant edge from root to
    v

descendant
descendant
Y
Z
descendant
child
child
descendant
U
W
V
32
X-dag Construction
  • Example /descendantychildU/descendantWan
    cestorZ/childV

X-tree
Resulting x-dag
Root
descendant
Root
descendant
Y
descendant
descendant
child
Z
Y
W
U
child
child
descendant
descendant
ancestor
U
W
V
Z
child
V
33
Reading the XML
  • XML is parsed using an event driven parser (such
    as SAX)
  • ?ao? focuses on start element and end element
    events
  • XML document is parsed depth first
  • For each element visited that element is assigned
    a sequential ID that uniquely identifies that
    element in the document
  • As the parser visits each node it records the
    current level or depth
  • The level is needed for determining the
    child/parent axes, as the child must be located
    exactly 1 level above the current level

34
Matchings
  • As the SAX events occur, the ?ao? algorithm tries
    to match elements of the current document with
    the XPath expression, more specifically x-nodes
    in the x-tree.
  • Formally defined, A partial matching of x-nodes
    from x-tree T to elements of document D, m VT ?
    VD satisfies the following characteristics
  • All mapped vertices satisfy the node test
  • For all x-nodes v in domain(m), label(v)
    tag(m(v))
  • For all x-nodes v1 and v2connected by an edge in
    T such that v1, v2 in domain(m), (v1, m(v1)) is
    consistent with (v2, (mv2))
  • Which basically means, a partial matching x-node
    label must match the element tag, and the
    relationship between 2 x-nodes, must be the same
    between the 2 mapped elements from the XML
    document

35
Total Matchings (Overview)
  • A matching at an x-node v is total if its domain
    contains all the vertices of the subtree rooted
    at v
  • The results is the collection of total matchings
    at the Root
  • Computing the total matching for the entire
    expression involves collecting the matches as the
    document tree and x-tree are traversed

36
Looking for Total Matchings
  • An element e is relevant if there exists some
    document completion where e participates in a
    total matching at the Root
  • All relevant elements must be processed
  • As events occur new relevant elements may appear,
    while others may no longer be relevant.
  • The x-dag is used to determine which element is
    relevant

37
Looking for Total Matchings
  • The x-dag is useful since it orders the nodes in
    the order they should appear in the document
  • Example Given the XML string
  • ltXgtltYgtltW /gt
  • Since the document is processed depth-first, by
    the time the start element event for W is
    reached, the start element for all of Ws
    ancestors have been reached
  • According to the x-dag, both an Y and Z need to
    be encountered before W becomes relevant
  • In this case, a Z has not appeared, so the W can
    be discarded

38
Looking For Set
  • To help determine when elements are relevant, or
    more importantly when they are not relevant (and
    can be discarded) the ?ao? algorithm maintains a
    Looking For set
  • The Looking For set L consists of the nodes and
    level that the ?ao? algorithm is looking for next
  • Which elements occur next is based on the current
    level, and the next corresponding node on the
    x-dag

39
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ) (v, l) where v is the
tag/node name followed by the level l Both Y and
Z are descendants, so the level does not
matter Matches X-dag node ? sequential
element id
descendant
Z
Y
child
descendant
child
descendant
U
W
V
40
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ) (v, l) where v is the
tag/node name followed by the level l Both Y and
Z are decendants, so the level does not
matter Start X event fires, X is not in the
looking set, so it can be discarded.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
41
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (U, 3) Y is in the
looking for set (U, 3) is added, since the
current level is 2 and U must be a child of
Y Note W is not added since Z has not occurred
yet.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
42
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ) W is not in the looking
for set, and can be discarded (U, 3) is removed,
since the current level is 3 the next element is
level 4, or an end element. Note W is not added
since Z has not occurred yet.
descendant
Z
Y
child
descendant
child
descendant
U
W
V
43
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (U, 3) W ends, returning
to level 2, resume looking for U3 Matches Y?
2
descendant
Z
Y
child
descendant
child
descendant
U
W
V
44
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (W, ), (V, 4) Z matches
the looking for set, (V, 4) is added to the
looking for set (W, ) added to the looking for
set since both Y and Z have been
encountered Matches Y? 2, Z? 4
descendant
Z
Y
child
descendant
child
descendant
U
W
V
45
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (W, ) V on level 4 is in
L, Matches Y? 2, Z? 4, V? 5
descendant
Z
Y
child
descendant
child
descendant
U
W
V
46
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (W, ) W is in L Matches
Y? 2, Z? 4, V? 5, W? 6 Even though the
output node is reached, still looking for
Y/childU
descendant
Z
Y
child
descendant
child
descendant
U
W
V
47
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
L (Y, ), (Z, ), (U, 3) Z ended, back at
level 3, looking for U at the current
level Matches Y? 2, Z? 4, V? 5, W? 6
Even though the output node is reached, still
looking for Y/childU
descendant
Z
Y
child
descendant
child
descendant
U
W
V
48
Simple Example
  • Looking for set L as for the following XML
    stream
  • ltXgtltYgtltWgtlt/WgtltZgtltV /gtltW /gtlt/ZgtltU /gt

Root
descendant
  • L (Y, ), (Z, )
  • U at level 3 found,
  • Matches
  • Y? 2, U?7 Z? 4, V? 5, W? 6
  • Total matching found. Add W to the
    solutions, and continue looking for matches as
    the document continues

descendant
Z
Y
child
descendant
child
descendant
U
W
V
49
Incomplete Matches
  • The previous example showed a relatively
    straight-forward successful example.
  • Items were added to the current
    matching-structure, until all properties were
    solved
  • This is not always the case
  • ?ao? takes an optimistic approach when adding
    items to the matches list
  • As end-tags occur, if all the required x-tree
    nodes have not been visited, items must then be
    removed from the match list

50
Incomplete Matches Example
  • Consider the XML string
  • ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
  • At the end tag of the W the Look-for and matches
    are as follows
  • ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
  • L (Y, ), (Z, ), (W, ), (V, 4)
  • M Y?2, Z?3, W?4
  • One step later at end Z, V cannot be a child of
    Zid3. Since Zid3 was added
    (optimistically), Zid3 and its children must
    be removed from the Match list
  • ltXgtltYgtltZgtltW /gtlt/ZgtltUgtlt/Ygtlt/Xgt
  • L (Y, ), (Z, ), (U, 3)
  • M Y?2, Z?3, W?4

51
Results
  • ?ao? was compared with Apaches Xalan with XMark
    generated XML documents
  • ?ao? and Xalan are about even until the document
    reaches about 100MB in size
  • Since Xalan requires multiple traversals and
    cannot as quickly discard processed XML nodes
  • Regardless of the document size ?ao? discarded
    99.8 of the elements encountered
  • This is primarily why the performance of ?ao?
    remained steady (linear) regardless of the
    document size

52
Overall Performance
  • ?ao? was compared with Apaches Xalan
  • Overall execution time ?ao? performed about 25
    faster than Xalan.
  • Documents with 640,000 elements (6.7 MB), Xalan
    52.28 seconds, ?ao? 39 seconds

53
XPath Performance
  • A comparison was then made, which excluded the
    parsing time
  • Results showed ?ao? outperformed Xalan by more
    than before
  • The performance gain is primarily a result of
    building unnecessary traversals

54
Optimizations / Extensions
  • Extensions to the ?ao? algorithm include support
    for XPath expressions with multiple outputs
  • Handled by creating an x-dag as before with
    multiple x-nodes marked as output nodes
  • Multiple output nodes may also be used to support
    joins of XPath expressions
  • Optimize the storage for total matchings
  • If a branch of the x-dag does not contain an
    output node, a true or satisfied value can
    represent the subtree in the total matching,
    rather than storing mappings for the entire
    subtree

55
Conclusion
  • The ?ao? algorithm provides a very effective way
    for processing XPath expressions with both
    backwards and forwards axes
  • The ability to quickly discard non-relevant nodes
    makes ?ao? very effective on extremely large XML
    documents
  • Furthermore the way in which the algorithm works
    makes it very scalable for streaming XML data
  • By handling XML data and matches as they occur,
    ?ao? has the ability to provide results as they
    occur, before the end of the document/stream is
    reached

56
References
  • C. Barton, P. Charles, D. Goyal, M. Raghavachari,
    M. Fontoura, and V. Josifovski. Streaming XPath
    Processing with Forward and Backward Axes. In
    Proc. of ICDE, 2003.
  • Y. Diao, M. Altinel, M. Franklin, et al. Path
    Sharing and Predicate Evaluation for
    High-Performance XML Filtering. In TODS, pages
    467516, 2003.
Write a Comment
User Comments (0)
About PowerShow.com