Title: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching
1From Region Encoding To Extended Dewey On
Efficient Processing of XML Twig Pattern Matching
- Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting
Chen - National University of Singapore
2Outline
- Background
- Define our problem XML twig pattern matching
- Previous work and problems
- Our new twig matching algorithms
- A new labeling scheme extended Dewey
- A new holistic algorithm TJFast
- Experimental results
- Conclusion
3XML basics
- Short for Extensible Markup Language
- A language for defining the syntax and semantics
of structured data - An XML document is commonly modeled as a rooted,
ordered and tagged tree.
book
chapter
preface
chapter
.
Intro
section
section
paragraph
section
title
paragraph
title
paragraph
Data
XML
4Querying XML Data
- Major standards for querying XML data
- XPath and XQuery
- XML twig pattern matching is a core operation in
XPath and XQuery - Definition of XML twig pattern An XML twig
pattern is a small tree whose nodes are tags,
attributes or text values and edges are either
Parent-Child edges or Ancestor-Descendant edges
5An XML twig pattern example
- Create a flat list of all the title-author pairs
for every book in bibliography.
XQuery ltresultsgt for b in
doc("bib.xml")/bib//book, t in
b/title, a in b/author,
return ltresultgt t a lt/resultgt
lt/resultsgt
To answer the XQuery, we need to first match the
following XML twig pattern
bib
Ancestor-descendant relationship
b book
t title
a author
Parent-child relationship
6Our research problem
- Problem Statement
- Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D. - E.g. Consider the following twig pattern and
document
An XML tree
Query answers
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
7Our research problem
- Problem Statement
- Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D. - E.g. Consider the following twig pattern and
document
An XML tree
Query solutions
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
8Our research problem
- Problem Statement
- Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D. - E.g. Consider the following twig pattern and
document
An XML tree
Query solutions
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
9Outline
- Background
- Define our problem XML twig pattern matching
- Previous work and challenge
- Our new twig matching algorithms
- A new labeling scheme extended Dewey
- A new holistic algorithm TJFast
- Experiments
- Conclusion
10Related work
- TreeMerge and Stack-tree Al-Khalifa ICDE 2002
- A stack-based binary join algorithm
- But large intermediate results
- TwigStack Bruno SIGMOD 2002
- A holistic twig join algorithm.
- Sub-optimal for queries with parent-child
relationships - TwigStackList Lu CIKM 2004
- A new holistic twig join algorithm, which
produces less useless intermediate results than
TwigStack does for queries with parent-child
relationship
11Our research goal
- In this research, we want to design a new
holistic twig join algorithm which is more
efficient than previous work. - Two aspects to achieve this goal
- (1) Input reduce the input I/O cost
- (2) Output reduce the size of intermediate
results
12Outline
- Background
- Define our problem XML twig pattern matching
- Previous work and challenges
- Our new twig matching algorithms
- A new labeling scheme extended Dewey
- A new holistic algorithm TJFast
- Experiments
- Conclusion
13Original Dewey Labeling Scheme
- In Dewey labeling scheme, each element is
presented by a vector - (i) the root is labeled by an empty stringe
- (ii) for a non-root element u, label(u)
label(s).x, where u is the x-th child of s. - For example
e
s1
2
1
3
s2
t1
f2
2.1
2.2
f1
t2
14Main problem of the original Dewey
- If we use the original Dewey labeling scheme to
answer a twig query, we need to read labels for
all query nodes. Thus, we have no performance
benefit compared to pervious methods. - Our idea Extend the original Dewey labeling
scheme so that given the label of any element e,
we can know the path of e from this label alone.
15Modulo function
- We need to know some schema information DTD
(Document Type Definitions ) or XML schema - Given DTD information book ? author, title,
chapter - Our solution using modulo function, we create a
match between an element tag and a integer
number. - We define Xauthormod 3 0 Xtitlemod 3 1
Xchaptermod 3 2 - where Xt is the last component of the label
of tag t.
Why not 3 as the original Dewey ?
e
book
0
5
2
1
author
chapter
chapter
title
16Derive element tag
- From a label , we can derive its tag name.
- book ? author, title, chapter
- Recall that we define Xauthormod 3 0
Xtitlemod 3 1 Xchaptermod 3 2.
e
book
0
5
2
1
author
chapter
chapter
title
?
?
?
?
17Derive the path from a label
- By following a finite state transducer (FST), we
may recursively derive the whole path from any
extended Dewey label. - For example
FST
DTD book ? author, title, chapter chapter ?
(paragraph section) section ? (paragraph
section)
Mod 30
author
Mod 31
book
title
paragraph
Mod 20
Mod 32
Mod 20
book
chapter
section
Document
Mod 21
Mod 21
chapter
chapter
title
author
Question Given a label 5.1.0 for an element,
what is the corresponding path ?
section
section
section
paragraph
18Derive the path from a label
- By following a finite state transducer (FST), we
may recursively derive the whole path from any
extended Dewey label. - For example
FST
DTD book ? author, title, chapter chapter ?
(paragraph section) section ? (paragraph
section)
Mod 30
author
Mod 31
book
paragraph
title
Mod 20
Mod 32
Mod 20
book
chapter
Document
section
Mod 21
Mod 21
chapter
chapter
Following the above red path, we get 5.1.0
denotes
title
author
section
section
book/ chapter/section/paragraph
section
paragraph
19Two properties of extended Dewey
- Find Ancestor Label
- From a label of any element, we can derive the
labels of its all ancestors. - Find Ancestor Name
- From a label of any element, we can derive the
tag names of its all ancestors. - Two properties enable us to design a new and
efficient algorithm for XML twig pattern
matching.
20Outline
- Background
- Define our problem XML twig pattern matching
- Previous work and challenges
- Our new twig matching algorithms
- A new labeling scheme extended Dewey
- A new holistic algorithm TJFast (a Fast Twig
Join algorithm) - Experiments
- Conclusion
21A new algorithm TJFast
- For each node n in the query, there exists a
corresponding input stream Tn. - Tn contains the extended Dewey labels of elements
of tag n. Those labels are arranged by the
document order. - For each branching node b of the twig pattern,
there is a corresponding set Sb, which contains
elements possibly involving query answers.
(Compared to TwigStack, what difference? ) - During any point of computing, the size of set Sb
is bounded by the depth of the XML document.
22A new algorithm TJFast
- Two-phase algorithm
- Phase 1 parts of intermediate root-leaf paths
are output - Insert elements that possibly involve in query
answers to sets - Output intermediate paths according to elements
in sets - Phase 2 the intermediate paths are merge-joined
to get the final results
23An example for TJFast algorithm
A set for the branching node A
e
Document
Root
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
DTD a -gt a,d, b b -gt d, c d -gt c
c2
c1
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
Why do we not need TA, TB streams?
TC
0.3.2.1, 0.5.0.0
24An example for TJFast algorithm
e
Document
Root
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
derive
0.0.1 a1/a2/d1
c2
c1
0.3.2.1
0.5.0.0
derive
0.3.2.1 a1/a3/b1/c1
TD
0.0.1 , 0.3.1, 0.5.0
By finite state transducer of extended Dewey
labeling scheme
TC
0.3.2.1, 0.5.0.0
25An example for TJFast algorithm
e
Document
Root
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Both a1 and a3 possibly involve in query answers.
(Why not a2 ?)
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
26An example for TJFast algorithm
e
Document
Root
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
Then we insert a1 to the set, since a1 is an
ancestor of a3.
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
27An example for TJFast algorithm
e
Document
Root
a1
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of TD from d1 to d2 and output
one path solution lta1, d1gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
28An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
derive
0.3.1 a1/a3/d2
0.3.2.1
0.5.0.0
We insert a3 to the set, since a3 definitely
involves in query answers.
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
29An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of stream TD from d2 to d3 and
output lta1,d2gt and lta3,d2gt.
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
30An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of stream TC from c1 to c2 and
output the path lta3,b1,c1gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
31An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
- Move the cursor TD of to the end and output path
solution lta1,d3gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
32An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
- Move the cursor of TC of to the end and output
lta1,b2,c2gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
33An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1
0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
Now all five elements has been scanned, in
the second phase we merge-join all output path
solutions.
TC
0.3.2.1, 0.5.0.0
34An example for TJFast algorithm
A
Document
a1
Query
D
B
a3
b2
a2
C
d2
b1
d3
d1
c2
c1
Phase 1. Intermediate paths
Phase 2. Final solutions
A// D lta1, d1gt, lta1, d2gt, lta1, d3gt, lta3, d2gt
A/B//C lta1,b2, c2gt, lta3, b1,c1gt
ltA, D, B,Cgt
Join
lta1,d1,b2,c2gt,lta1,d2, b2,c2gt, lta1,d3,b2,c2gt,lta3,d2
, b1,c1gt,
35Outline
- Background
- Define our problem XML twig pattern matching
- Previous work and challenges
- Our new twig matching algorithms
- A new labeling scheme extended Dewey
- A new holistic algorithm TJFast
- Experimental results
- Conclusion
36Experiments
- Benchmarks
- XMark Synthetic Data
- DBLP Real Data for DBLP database
- Treebank Real Data from Wall Street Journal
XMark DBLP Treebank
Data size(MB) 582 130 82
Nodes(million) 8 3.3 2.4
Max/Avg depth 12/5 6/2.9 36/7.8
37Path query
We compared PathStack1 and TJFast on the
following four path queries on XMark data.
Path Queries
PQ1 /site/closed-auctions/closed_auction/price
PQ2 /site/regions//item/location
PQ3 /site/people/person/gender
PQ4 /site/open_auctions/open_auction/reserve
38Experiments Number of elements read and input
file size for path queries
Observation TJFast scans less elements than
PathStack does. Explanation TJFast only scans
labels for leaf nodes in queries, but PathStack
scans all nodes in the query.
39Experiments Execution time for path queries
Observation TJFast has better performance for
all four path queries than PathStack. Explanation
TJFast reduces I/O cost by reading less elements.
40Twig queries
We compared TwigStack, TwigStackList and TJFast
on the following five twig queries on DBLP and
TreeBank data.
Source Twig Queries
TQ1 DBLP //proceedings//title.//i//sup
TQ2 DBLP //article.//sup//title//sub
TQ3 Treebank /S.//VP/IN//NP
TQ4 Treebank /S/VP/PPIN/NP/VBN
TQ5 Treebank //VPDT//PRP_DOLLAR_
41Experiments Number of elements read and input
file size for twig queries
Observation TJFast scans far less elements than
TwigStack and TwigStackList do in two twig
queries. Explanation TJFast only scans elements
for leaf nodes in queries. But
TwigStack/TwigStackList needs to scan elements
for all nodes. And the number of elements for
non-leaf nodes is much more than that of leaf
nodes.
42Experiments Execution time for twig queries
TW-SS and TJ-SS denote the sequential scan time
of input data for TwigStack/TwigStacklist and
TJFast, respectively.
Observation For DBLP data, TJFast has much
better performance than that of
TwigStack/TwigStackList. Explanation TJFast
reduces I/O cost by reading less elements.
43Outline
- Background
- Define our problem XML twig pattern matching
- Previous work and challenges
- Our new twig matching algorithms
- A new labeling scheme extended Dewey
- A new holistic algorithm TJFast
- Experimental results
- Conclusion
44Conclusions
- Efficient processing of twig queries is a core
operation in XPath and XQuery - We have proposed a new labeling scheme, extended
Dewey and a new holistic twig pattern matching
algorithm TJFast. - Compared to previous work
- TJFast reduces the input I/O cost
- TJFast reduces the output I/O cost for
intermediate results.
45Reference
- 1 S. Al-Khalifa , H.V. Jagadish, J. Patel, Y.
Wu N. Koudas, D. Srivastava Structural Joins A
Primitive for Efficient XML Query Pattern
Matching. ICDE 2002 141- 152 - Propose StackTree algorithm
- 2 N. Bruno, D. Srivastava, and N. Koudas.
Holistic twig joins optimal xml pattern
matching. In Proceedings of ACM SIGMOD, 2002. - Propose TwigStack algorithm
- 3 T. Chen, J. Lu, and T. Ling. On boosting
holism in xml twig pattern matching using
structural indexingtechniques. In SIGMOD, 2005. - Propose two new data streaming
techniques - 4 Y. Chen, S. B. Davidson, and Y. Zheng. BLAS
An efficient XPath processing system. In Proc. of
SIGMOD, pages 47-58, 2004. - Propose a new algorithm for XPath
query
46Reference
- 5 H. Jiang, W Wang and H. Lu Holistic twig
joins on indexed XML documents VLDB 2003 - Propose TSGeneric algorithm
- 6 J. Lu, T. Chen, and T. W. Ling. Efficient
processing of xml twig patterns with parent child
edges a look-ahead approach. In CIKM, pages
533-542, 2004. - Propose TwigStackList
algorithm - 7 P. Rao and B. Moon PRIX Indexing and
querying XML using prufer sequences In ICDE pages
288-300 2004 - Propose PRIX system
- 8 H. Wang, S. park, W Fan and P.S. Yu ViST A
dynamic index method for querying XML data by
tree structures In SIGMOD 2003 - Propose ViST system
- 9 B. Yang M. Fontoura, E.J. Shekita, S.
Rajagopalan and K.S. Beyer Virtual Corsors
for XML joins CIKM pages 523-532 2004 - Propose Virtual cursor
algorithm
47END
48Related work
- Comparison between Virtual Cursor (VC) Yang CIKM
2004 and our work - Develop independently
- Finite state transducer in TJFast, path table in
VC - Size of path table depends on the distinct paths,
but that of FST depends on the distinct elements
types. - TJFast reduces the number of useless intermediate
path when queries with parent-child edges, but VC
has not this property
49Backup
a1
Query
Document
a
b1
f1
b
c
a2
e
d
c1
d1
f2
c2
TwigStackList outputs lta1,b1gt . But TJFast does
not output this path solution.
e1
50Labels size
Xmark DBLP TreeBank
Region encoding(MB) 71.9 21.6 23.3
Original Dewey(MB) 56.2 18.1 22.8
Extended Dewey(MB) 72.6 19.5 28.7
51Optimal query classes
- If an algorithm does not output any useless
intermediate results for an query Q for all given
documents, we call this algorithm is optimal for
query Q. - If an algorithm has a larger optimal query class,
this algorithm has better ability to control the
size of intermediate results.
52Optimal class of TJFast and TwigStack
TwigStack TJFast
Optimal query class All edges are ancestor-descendant relationships All edges connecting branching nodes and the children are ancestor-descendant relationship
Even for non-optimal queries, TJFast usually
output less useless intermediate paths than
TwigStack do.
53Update of XML documents
- In order to support the update of XML documents,
we need to slightly modify extended Dewey
labeling scheme. - Our idea comes from ORDPATH.
- We can avoid to relabel the documents in any
circumstance of update.
P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G.
Schaller, and N. Westbury. ORDPATHs
Insert-friendly XML node labels. In SIGMOD, pages
903--908, 2004.
54More examples for assigning labels
- Let us consider a more complicated DTD
- a ? (b c ), d?, c
- We define Xbmod 3 0 Xcmod 3 1 Xd mod 3
2 - (Why do we use mod 3 instead of 4?)
e
a
0
7
2
4
b
c
c
d
55Computing cost of FST
- The CPU time complexity of FST is linear in the
length of an extended Dewey label, but
independent of the complexity of schema
definition. - The main memory size of FST is quadratic to the
number of distinct element names in XML
documents, as the number of transition in FST is
quadratic in the worst case.