Title: A TransducerBased XML Query Processor
1A Transducer-Based XML Query Processor
2Over View
- Introduction
- XML Stream Machine (XSM) Framework
- Translation to XSM Networks
- XSM Composition
- Optimizations
- Experimental Results
3Introduction
- Web service implementations and mediators
exchange XML message via XML import/export
mechanisms - A large number of important applications require
extremely efficient processing and transformation
of sequentially accessed streams
4Introduction (cont.)
- A qualitatively different architecture is needed
where stream processing is performed on the fly
and within minimum memory - We propose the XSM-based architecture and
algorithms for the construction of XML Query
processors that efficiently process XML streams
on-the-fly
5Introduction (cont.)
- XSM the XML Stream Machine system is a novel
XQuery processing paradigm that is tuned to the
efficient processing of sequentially accessed XML
data (streams).
6Introduction (cont.)
XQuery
7Introduction (cont.)
- The Key challenge is in the XQuery to XSM
compiler taking into account both the query and
the schemas of the input stream
8Introduction (cont.)
- Goal
- Minimize the computation performed for each
incoming piece of stream data, i.e., reduce the
number of tests, read and write actions.
- Minimize the number and size of buffers
- Pipeline the computation and write tokens in the
output stream as soon as possible
9 - Introduction
- XML Stream Machine Framework
- Translation to XSM Networks
- XSM Composition
- Optimizations
- Experimental Results
10XML Stream Machine Framework
- XML and XML streams
- In our Model, we consider sets of element names e
and character data (strings) D. - XQuery expressions containing variables drawn
from a set of variable names V
11XML Stream Machine Framework (cont.)
- An XML stream is a sequence of tokens, where the
set of tokens T is defined as - T ltegt e ? e U d(x) x ? D
- U lt/egt e ? e U sv, ev v ? V U
eol.
lt!ELEMENT root (a)gt lt!ELEMENT a
(b)gt lt!ELEMENT b PCDATAgt
sv, ltrootgt(ltagt(ltbgtPCDATAlt/bgt)lt/agt)lt/rootgt,
ev, eol
12XML Stream Machine Framework (cont.)
- For querying streams, it is reasonable to assume
acyclic DTDs - This implies that all valid XML streams have
bounded depth and hence no stack is required to
check well-formedness. - In the sequel, we only consider valid streams
over acyclic DTDs.
13XML Stream Machine Framework (cont.)
- XQuery Example
- XQE XQE1/Constant //Path
- XQE1, XQE2 //Concatenation
- ltTaggt XQE1 lt/Taggt //Element Creation
- for Var in XQE1 //Generator
- where Cond //optional Condition
- return XQE2 //Body
- Var
14XML Stream Machine Framework (cont.)
E
for X in R/a return
F
for Y in X/b return
Q
L
G
ltresgt Y, X lt/resgt
H
15XML Stream Machine Framework (cont.)
- We say that the variable V is free in an XQUERY
expression if it is not within the scope of a
for V in. - We call input variable the variable that are free
within the outermost XQuery Q, as they correspond
to the input streams of the Q
16XML Stream Machine Framework (cont.)
- XML Stream Machines
- XSM Buffers and Buffer Actions
- In state transitions, XSMs can access and query
buffer contents (via read operations such as
p), and execute sequence of actions A
A1,,Ak. An atomic action Ai can have the form - p advanced pointer p
- w(p,c) at p, write c then advance p
- w(p,r) at p write r, then advance p
- pp set p to the position of p.
17XML Stream Machine Framework (cont.)
- XML Stream Machines (cont.)
- XSM control
- An XSM has a finite number of states Q, one of
which is the distinguished initial state q0. - An XSM moves from the current state q to the next
state q, provided there is a transition t ? T - t q fA q
- whose condition f is satisfied. Before entering
q, the action sequence A is executed.
18XML Stream Machine Framework (cont.)
- XSM control(cont.)
- The transition condition f is a boolean
combination over the following atomic
expressions - p p, p ? p, p lt p (pointer
comparison) - r c, r ? c, r r, r ? r (token
comparison)
19XML Stream Machine Framework (cont.)
- XSM control(cont.)
- Example Input (X,Y) output (w)
1
0
2
3
20 - Introduction
- XML Stream Machine Framework
- Translation to XSM Networks
- XSM Composition
- Optimizations
- Experimental Results
21Translation to XSM Networks
- The XSM Compiler translates XQueries into
optimized XSMs. - The process is based on building buffers for
subexpression results and variables, a basic
XSM for each kind of XQuery subexpression, and
appropriately connecting the buffers and XSMs.
22Translation to XSM Networks (cont.)
- XSM Networks
- An XSM network is a directed acyclic graph (DAG)
whose nodes are XSMs and whose labeled edges are
of the form M1 B M2 indicating that the output
buffer of M1 is the input buffer of M2. - We call that M1 is a producer, M2 is a consumer
XSM. - An input stream I in I M
- An output stream O M o out
23Translation to XSM Networks (cont.)
- XSM Networks (cont.)
- Example
24E
for X in R/a return
F
for Y in X/b return
Q
L
G
ltresgt Y, X lt/resgt
H
M(F) X/b
Y
X
X
Z
O
M(L) ForVars Y Y,X -gt Y, X
R
M(G) Y, X
M(H) ltresgtZlt/resgt
X
in
out
M(E) R/a
Y
25Translation to XSM Networks (cont.)
- Translation Algorithm
- Associate each input variable I with a
corresponding input buffer named I. - For every path, concatenation , and element
creation (sub) expression Q (I.e., every
subexpression other than a for expression or a
single variable) we create a buffer named out(Q),
which will store the output results of the
subexpression. - Which buffers are created for a for expression
- F for V in Q1 return Q2
- Depends on the free variables in the body Q2 of
F
26Translation to XSM Networks (cont.)
- Translation Algorithm (cont.)
- For translating an XQuery into an XSM network, we
use the following XSM templates - Path(inBuf, ChildTag, OutBuf)
- Concat(Inbuf, InBuf2, OutBuf)
- CreateEI(Inbuf, ElemTag, OutBuf)
- ForVars(InVar, InVars,outVars)
- The ChildTag and ElemTag parameters have to be
instantiated with constants, InBuf, OutBuf, InVar
with buffer names, and InVars, OutVars with list
of buffer names.
27Translation to XSM Networks (cont.)
- Translation Algorithm (cont.)
r ?lt/agtw(x, r),r
r ltagtr
0
1
2
r srr
rltagtw(x,sx),w(x,ltagt),r
rlt/agtw(x,lt/agt),w(x,ex),r
r erw(x,eol),r
M(E)Path(R,a,X)
28Translation to XSM Networks (cont.)
- Translation Algorithm (cont.)
- We are now ready to produce XSM networks that
use the buffers described above. - For every subexpression Q of the given XQuery
- If Q Var then
- /skip for variables /
- Else if QQ1/c then
- produce Path(out(Q1), c, out(Q))
- Else if QQ1,Q2 then
- Concat(out(Q1), out(Q2), out(Q))
- Else if QltegtQ1lt/egt then
- createEl(out(Q1), e, out(Q))
- Else if Qfor Var in Q1 return Q2 and free
(Q2)\V?0 then - InVarsfree(Q2)
- outVarsVV ? InVars
- produce ForVars(V,InVars,outVars)
29 - Introduction
- XML Stream Machine Framework
- Translation to XSM Networks
- XSM Composition
- Optimizations
- Experimental Results
30XSM Composition
- In XSM networks, consecutive XSMs are linked via
buffers. For example, consider two XSMs that are
linked via a shared buffer Bs M1 Bs M2. - XSM composition allows us to replace M1 and M2
with a single XSM M3 M1 M2. - The composition creates opportunities for
elimination the need for the shared buffer Bs and
for optimizing the composed XSM
31XSM Composition (cont.)
- For a state q, let readPtr(q) denote the set of
read pointers on which any outgoing transition t
q fA q depends. - scPtr(q) is the subset of readPtr(q) which point
into the shared connection buffers Bs - scPtr(q) ptr(Bs) ? readPtr(q)
32XSM Composition(cont.)
- Basic XSM composition algorithm
- Input
- Producer XSM M1 (Q1, q01, B1, T1)
- Consumer XSM M2 (Q2, q02, B2, T2)
- Shared connection buffers Bs B1 ? B2
- Output
- Composed XSM M3 (Q3, q03, B3, T3)
33XSM Composition(cont.)
- Basic XSM composition algorithm
- Begin
- Q3 Q1 x Q2 q03 (q01,q02)
- B3 B1 U B2 T3 0
- for (q1 fA1 q1) ? T1, (q2 fA1 q2) ? T2 do
- if scPtr(q2) 0 then
- add(T3, (q1, q2))
- else
- ? ?r ? scPtr(q2) ? AE( r)
- add(T3, (q1, q2) f ? ?2 A2 (q1, q2))
- add(T3, (q1, q2) f ? ? ? A2 (q1, q2))
- end
AE( r)(At-End r), which is runtime check
rwp comparing the position of r in Bs with the
position wp (writePtr(buffe(r )) from M1
34 - Introduction
- XML Stream Machine Framework
- Translation to XSM Networks
- XSM Composition
- Optimizations
- Experimental Results
35Optimizations
- The efficiency of the resulting XSM can be
improved in several ways - Lockstep Optimization
- We exploit the fact that the basic algorithm
introduces runtime checks AE(p) which can be
shown to be valid or unsatisfiable using a static
analysis technique. - The basic idea is to statically analyze when the
producer M1 and the consumer M2 operate in
lockstep on the shared connection buffer. - i.e. when a read pointer r is trailing its
associated write pointer wp by at most one
position. In such case the optimized composition
can eliminate AE checks.
36Optimizations
- Lockstep Optimization (cont.)
- Input
- Producer XSM M1 (Q1, q01, B1, T1)
- Consumer XSM M2 (Q2, q02, B2, T2)
- Shared connection buffers Bs B1 ? B2
- Precondition
- For all q2 ? Q2 scPtr(q2)?1
- Output
- Optimized composed XSM M3 (Q3, q03, B3, T3)
- Initialization
- Q3 Q1 x Q2 x go, no, ae
- Q0 (q01, q02, no) B3 B1 U B2 T3 0
37Optimizations
- Schema-Based Optimization
- If the XML schema of the input stream is know,
further optimizations are possible
38Optimizations
- For example
- consider XSM M(E) Path(R,a,X). If we know that
on the input stream R only ltagt elements can
appear, we could simplify the XSM further -
r !lt/agtw(x, r),r
0
1
2
r
rltagtw(x,sx),w(x,ltagt),r
rlt/agtw(x,lt/agt),w(x,ex),r
r erw(x,eol),r
M(E)Path(R,a,X)
39 - Introduction
- XML Stream Machine Framework
- Translation to XSM Networks
- XSM Composition
- Optimizations
- Experimental Results
40Experimental Results
- The output of the XSM compiler is a C program
which uses a SAX parser on the incoming XML
stream - We measured the performance of our XSM-based
query processing engine and compared it to
several publicly available XSLT engines by
running the query on the DBLP XML database, a
popular online XML bibliography database used by
many researchers.
41Experimental Results
42A Transducer-Based XML Query Processor