Title: Xpath Query Evaluation
1Xpath Query Evaluation
2Goal
- Evaluating an Xpath query against a given
document - To find all matches
- We will also consider the use of types
- Complexity is important
- Huge Documents
3Data complexity vs. Combined Complexity
- Two inputs to the query evaluation problem
- Data (XML document) of size D
- Query (Xpath expression) of size Q
- Usually Q ltlt D
- Polynomial data complexity
- Complexity that is polynomial in D, possibly
exponential in Q - Polynomial combined complexity
- Complexity that is polynomial in D and Q
- Fixed Parameter Tractable complexity
- Complexity Poly(D)f(Q)
4Xpath standard semantics
5Core XPath
- locpath /' locpath j locpath /' locpath j
- locpath j' locpath j locstep.
- locstep axis ' ntst ' bexpr ' . . .
' bexpr '. - bexpr bexpr and' bexpr j bexpr or' bexpr j
- not(' bexpr )' j locpath.
- axis self' j child' j parent' j
- descendant' j descendant-or-self' j
- ancestor' j ancestor-or-self'
- following' j following-sibling'
- preceding' j preceding-sibling'.
6Xpath Query Evaluation
- Input XML Document D, Xpath query Q
- Output A subset of the nodes of D,
- as defined by Q
- We will follow Efficient Algorithms for
Processing Xpath Queries / Gottlob, Koch,
Pichler, TODS 2005
7Simple algorithm
- process-location-step(n,Q)
-
- S- Apply Q.first to n
- If Qgt 1
- For each node n in s do
- process-location-step(n,Q.next)
-
8Complexity
- Worst case in each step of Q the axis is
following - So we apply the query in each step on O(D)
nodes - And we get Time(Q) DTime(Q-1)
- I.e. the complexity is O(DQ)
9Early Systems Performance
Figure taken from Gottlob, Koch, Pichler 05
10Internet Explorer 6
Figure taken from Gottlob, Koch, Pichler 05
11IE6 performance as a function of document size
Figure taken from Gottlob, Koch, Pichler 05
12Polynomial data complexity
- Poly data complexity is sometimes considered good
even if exponential in the query size - But can we have polynomial combined complexity
for Xpath query evaluation? - Yes!
13Two main principles
- Query parse trees the query is divided to parts
according to its structure (not to be confused
with the XML tree structure) - Context-value tables for every expression e
occurring in the parse tree, compute a table of
all valid combinations of context c and value v
such that e evaluates to v in c.
14Xpath query parse tree
- descendantb/following-sibling
-
position() ! last()
15Bottom-up vs. Top-down evaluation
- We will discuss two kinds of query evaluation
algorithms - Bottom-up means that the query parse tree is
processed from the leaves up to the root - Top-down means that the parse tree is processed
from the root to the leaves - When processing we will fill in the context-value
table
16Bottom-up evaluation
- Main idea compute the value for each leaf for
every possible context - Propagate upwards until the root
- Dynamic programming algorithm to avoid
re-evaluation of queries in the same context
17Operational semantics
- Needed as a first step for evaluation algorithms
- Similar ideas used in compilers design
- Here the semantics is based on the notion of
contexts
18Contexts
- The domain of contexts is
- C dom X ltk,ngt 1ltkltnlt dom
- A context is cltx,k,ngt
- where x is a
context node - k is a
context position - n is the
context size -
19Types
20Semantics for Xpath expressions
- The semantics of evaluating an expression is a
4-tuple where the first 3 elements are the
context, and the fourth is the value obtained by
evaluation in the context
21Some notations
- T(t) all nodes satisfying a predicate t
- E(e) all nodes satisfying a regular exp. e
(applied with respect to a given axis) - Idxx(x,S) is the index of a node x in the set s
with respect to a given axis and the document
order
22(No Transcript)
23Context-value Table
- Given a query sub-expression e, the context-value
table of e specifies all combinations of context
c and value v, such that computing e on the
context c results in v - Bottom-up algorithm follows compute the
context-value table in a bottom-up fashion with
respect to the query
24Bottom-up algorithm
25Example
4 times
26Complexity
- O(D3Q) space ignoring strings and numbers
- O(Q) tables, with 3 columns, each including
values in 1D thus O(D3Q) - An extra O(DQ) multiplicative factor for
strings and numbers - O(D5Q) time ignoring strings and numbers
- It can take O(D2) to combine two nodesets
- Extra O(Q) in case of strings and numbers
27Optimization
- Represent contexts as pairs of current and
previous node - Allows to get the time complexity down to
O(D4 Q2) - Space complexity can be brought down to
O(D2Q2) via more optimizations
28Top-down evaluation
- Similar idea
- But allows to compute only values for contexts
that are needed - Same worst-case bounds
29Top-down or bottom-up?
- General question in processing XML trees
- The tradeoff
- Usually easier to combine results computed in
children to obtain the result at the parent - So bottom-up traversal is usually easier to
design - On the other hand, some of the computation is
redundant since we dont know if it will become
relevant - So top-down traversal may be more efficient
30Linear-time fragment
- Core Xpath includes only navigation
- \ and \\
- Core Xpath can be evaluated in O(DQ)
- Observtion no need to consider the entire
triple, only current context node - Top-down or bottom-up evaluation with essentially
the same algorithm - But smaller tables (for every query node, all
document nodes and values of evaluation) are
maintained.
31Types are helpful
- Can direct the search
- In some parts of the tree there is no hope to get
a match to a given sub-expression of the query - As a result we may have tables with less entries.
- Whiteboard discussion
32Type Checking and Inference
- Type checking a single document straightforward
- Polynomial combined complexity if automaton
representing type is deterministic, exponential
in automaton size but polynomial in document size
otherwise - Type checking the results of a (Xpath) query
- Inferring the results of a query
33Type Inference
- An (incomplete) algorithm for type inference can
work its way to the top of the query parse tree
to infer a type in a bottom-up fashion - Start by inferring a type for the leaves (simple
queries), then use it for their parents - Type Inference is inherently incomplete.
- Can be performed for some languages that are
regular in a sense.
34Restricted language allowing for type inference
- Axes child, descendant, parent, ancestor,
following-sibling, etc. - variables can be bound to nodes in the input
tree then passed as parameters - An equality test can be performed between node
ID's, but not between node values.
35Type Checking
- In addition to inferring a type we need to verify
containment in another type. - Type Inference can be used as a tool for Type
Checking. - Type Checking was shown to be decidable for the
same language fragment, but with high complexity.
36Intuitive connection to text
- Queries gt regular expressions
- Types (tree automata) gt context free languages
- Type Inference gt intersection of context free
and regular languages, resulting in a context
free one - Type checking gt Type Inference inclusion of
context free languages (with some restrictions to
guarantee decidability)