From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

Description:

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 56

Provided by: ChenT151

Category:

more less

Transcript and Presenter's Notes

Title: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

1
From Region Encoding To Extended Dewey On
Efficient Processing of XML Twig Pattern Matching

Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting
Chen
National University of Singapore

2
Outline

Background
Define our problem XML twig pattern matching
Previous work and problems
Our new twig matching algorithms
A new labeling scheme extended Dewey
A new holistic algorithm TJFast
Experimental results
Conclusion

3
XML basics

Short for Extensible Markup Language
A language for defining the syntax and semantics
of structured data
An XML document is commonly modeled as a rooted,
ordered and tagged tree.

book
chapter
preface
chapter
.
Intro
section
section
paragraph
section
title
paragraph
title
paragraph

Data
XML

4
Querying XML Data

Major standards for querying XML data
XPath and XQuery
XML twig pattern matching is a core operation in
XPath and XQuery
Definition of XML twig pattern An XML twig
pattern is a small tree whose nodes are tags,
attributes or text values and edges are either
Parent-Child edges or Ancestor-Descendant edges

5
An XML twig pattern example

Create a flat list of all the title-author pairs
for every book in bibliography.

XQuery ltresultsgt for b in
doc("bib.xml")/bib//book, t in
b/title, a in b/author,
return ltresultgt t a lt/resultgt
lt/resultsgt
To answer the XQuery, we need to first match the
following XML twig pattern

bib
Ancestor-descendant relationship
b book
t title
a author
Parent-child relationship
6
Our research problem

Problem Statement
Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D.
E.g. Consider the following twig pattern and
document

An XML tree
Query answers
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
7
Our research problem

Problem Statement
Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D.
E.g. Consider the following twig pattern and
document

An XML tree
Query solutions
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
8
Our research problem

Problem Statement
Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D.
E.g. Consider the following twig pattern and
document

An XML tree
Query solutions
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
9
Outline

Background
Define our problem XML twig pattern matching
Previous work and challenge
Our new twig matching algorithms
A new labeling scheme extended Dewey
A new holistic algorithm TJFast
Experiments
Conclusion

10
Related work

TreeMerge and Stack-tree Al-Khalifa ICDE 2002
A stack-based binary join algorithm
But large intermediate results
TwigStack Bruno SIGMOD 2002
A holistic twig join algorithm.
Sub-optimal for queries with parent-child
relationships
TwigStackList Lu CIKM 2004
A new holistic twig join algorithm, which
produces less useless intermediate results than
TwigStack does for queries with parent-child
relationship

11
Our research goal

In this research, we want to design a new
holistic twig join algorithm which is more
efficient than previous work.
Two aspects to achieve this goal
(1) Input reduce the input I/O cost
(2) Output reduce the size of intermediate
results

12
Outline

Background
Define our problem XML twig pattern matching
Previous work and challenges
Our new twig matching algorithms
A new labeling scheme extended Dewey
A new holistic algorithm TJFast
Experiments
Conclusion

13
Original Dewey Labeling Scheme

In Dewey labeling scheme, each element is
presented by a vector
(i) the root is labeled by an empty stringe
(ii) for a non-root element u, label(u)
label(s).x, where u is the x-th child of s.
For example

e
s1
2
1
3

s2
t1
f2
2.1
2.2
f1
t2
14
Main problem of the original Dewey

If we use the original Dewey labeling scheme to
answer a twig query, we need to read labels for
all query nodes. Thus, we have no performance
benefit compared to pervious methods.
Our idea Extend the original Dewey labeling
scheme so that given the label of any element e,
we can know the path of e from this label alone.

15
Modulo function

We need to know some schema information DTD
(Document Type Definitions ) or XML schema
Given DTD information book ? author, title,
chapter
Our solution using modulo function, we create a
match between an element tag and a integer
number.
We define Xauthormod 3 0 Xtitlemod 3 1
Xchaptermod 3 2
where Xt is the last component of the label
of tag t.

Why not 3 as the original Dewey ?
e
book
0
5
2
1
author
chapter
chapter
title
16
Derive element tag

From a label , we can derive its tag name.
book ? author, title, chapter
Recall that we define Xauthormod 3 0
Xtitlemod 3 1 Xchaptermod 3 2.

e
book
0
5
2
1
author
chapter
chapter
title
?
?
?
?
17
Derive the path from a label

By following a finite state transducer (FST), we
may recursively derive the whole path from any
extended Dewey label.
For example

FST
DTD book ? author, title, chapter chapter ?
(paragraph section) section ? (paragraph
section)
Mod 30
author
Mod 31
book
title
paragraph
Mod 20
Mod 32
Mod 20
book
chapter
section
Document
Mod 21
Mod 21
chapter
chapter
title
author
Question Given a label 5.1.0 for an element,
what is the corresponding path ?
section
section
section
paragraph
18
Derive the path from a label

By following a finite state transducer (FST), we
may recursively derive the whole path from any
extended Dewey label.
For example

FST
DTD book ? author, title, chapter chapter ?
(paragraph section) section ? (paragraph
section)
Mod 30
author
Mod 31
book
paragraph
title
Mod 20
Mod 32
Mod 20
book
chapter
Document
section
Mod 21
Mod 21
chapter
chapter
Following the above red path, we get 5.1.0
denotes
title
author
section
section
book/ chapter/section/paragraph
section
paragraph
19
Two properties of extended Dewey

Find Ancestor Label
From a label of any element, we can derive the
labels of its all ancestors.
Find Ancestor Name
From a label of any element, we can derive the
tag names of its all ancestors.
Two properties enable us to design a new and
efficient algorithm for XML twig pattern
matching.

20
Outline

Background
Define our problem XML twig pattern matching
Previous work and challenges
Our new twig matching algorithms
A new labeling scheme extended Dewey
A new holistic algorithm TJFast (a Fast Twig
Join algorithm)
Experiments
Conclusion

21
A new algorithm TJFast

For each node n in the query, there exists a
corresponding input stream Tn.
Tn contains the extended Dewey labels of elements
of tag n. Those labels are arranged by the
document order.
For each branching node b of the twig pattern,
there is a corresponding set Sb, which contains
elements possibly involving query answers.
(Compared to TwigStack, what difference? )
During any point of computing, the size of set Sb
is bounded by the depth of the XML document.

22
A new algorithm TJFast

Two-phase algorithm
Phase 1 parts of intermediate root-leaf paths
are output
Insert elements that possibly involve in query
answers to sets
Output intermediate paths according to elements
in sets
Phase 2 the intermediate paths are merge-joined
to get the final results

23
An example for TJFast algorithm
A set for the branching node A
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
DTD a -gt a,d, b b -gt d, c d -gt c
c2
c1
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
Why do we not need TA, TB streams?
TC
0.3.2.1, 0.5.0.0
24
An example for TJFast algorithm
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
derive
0.0.1 a1/a2/d1
c2
c1
0.3.2.1
0.5.0.0
derive
0.3.2.1 a1/a3/b1/c1
TD
0.0.1 , 0.3.1, 0.5.0
By finite state transducer of extended Dewey
labeling scheme
TC
0.3.2.1, 0.5.0.0
25
An example for TJFast algorithm
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Both a1 and a3 possibly involve in query answers.
(Why not a2 ?)
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
26
An example for TJFast algorithm
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
Then we insert a1 to the set, since a1 is an
ancestor of a3.
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
27
An example for TJFast algorithm
e
Document
Root
a1
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of TD from d1 to d2 and output
one path solution lta1, d1gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
28
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
derive
0.3.1 a1/a3/d2
0.3.2.1
0.5.0.0
We insert a3 to the set, since a3 definitely
involves in query answers.
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
29
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of stream TD from d2 to d3 and
output lta1,d2gt and lta3,d2gt.
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
30
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of stream TC from c1 to c2 and
output the path lta3,b1,c1gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
31
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0

Move the cursor TD of to the end and output path
solution lta1,d3gt

TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
32
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0

Move the cursor of TC of to the end and output
lta1,b2,c2gt

TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
33
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
Now all five elements has been scanned, in
the second phase we merge-join all output path
solutions.
TC
0.3.2.1, 0.5.0.0
34
An example for TJFast algorithm
A
Document
a1
Query
D
B
a3
b2
a2
C
d2
b1
d3
d1
c2
c1
Phase 1. Intermediate paths
Phase 2. Final solutions
A// D lta1, d1gt, lta1, d2gt, lta1, d3gt, lta3, d2gt
A/B//C lta1,b2, c2gt, lta3, b1,c1gt
ltA, D, B,Cgt
Join
lta1,d1,b2,c2gt,lta1,d2, b2,c2gt, lta1,d3,b2,c2gt,lta3,d2
, b1,c1gt,
35
Outline

Background
Define our problem XML twig pattern matching
Previous work and challenges
Our new twig matching algorithms
A new labeling scheme extended Dewey
A new holistic algorithm TJFast
Experimental results
Conclusion

36
Experiments

Benchmarks
XMark Synthetic Data
DBLP Real Data for DBLP database
Treebank Real Data from Wall Street Journal

XMark DBLP Treebank
Data size(MB) 582 130 82
Nodes(million) 8 3.3 2.4
Max/Avg depth 12/5 6/2.9 36/7.8
37
Path query
We compared PathStack1 and TJFast on the
following four path queries on XMark data.
Path Queries
PQ1 /site/closed-auctions/closed_auction/price
PQ2 /site/regions//item/location
PQ3 /site/people/person/gender
PQ4 /site/open_auctions/open_auction/reserve
38
Experiments Number of elements read and input
file size for path queries
Observation TJFast scans less elements than
PathStack does. Explanation TJFast only scans
labels for leaf nodes in queries, but PathStack
scans all nodes in the query.
39
Experiments Execution time for path queries
Observation TJFast has better performance for
all four path queries than PathStack. Explanation
TJFast reduces I/O cost by reading less elements.
40
Twig queries
We compared TwigStack, TwigStackList and TJFast
on the following five twig queries on DBLP and
TreeBank data.
Source Twig Queries
TQ1 DBLP //proceedings//title.//i//sup
TQ2 DBLP //article.//sup//title//sub
TQ3 Treebank /S.//VP/IN//NP
TQ4 Treebank /S/VP/PPIN/NP/VBN
TQ5 Treebank //VPDT//PRP_DOLLAR_
41
Experiments Number of elements read and input
file size for twig queries
Observation TJFast scans far less elements than
TwigStack and TwigStackList do in two twig
queries. Explanation TJFast only scans elements
for leaf nodes in queries. But
TwigStack/TwigStackList needs to scan elements
for all nodes. And the number of elements for
non-leaf nodes is much more than that of leaf
nodes.
42
Experiments Execution time for twig queries
TW-SS and TJ-SS denote the sequential scan time
of input data for TwigStack/TwigStacklist and
TJFast, respectively.
Observation For DBLP data, TJFast has much
better performance than that of
TwigStack/TwigStackList. Explanation TJFast
reduces I/O cost by reading less elements.
43
Outline

Background
Define our problem XML twig pattern matching
Previous work and challenges
Our new twig matching algorithms
A new labeling scheme extended Dewey
A new holistic algorithm TJFast
Experimental results
Conclusion

44
Conclusions

Efficient processing of twig queries is a core
operation in XPath and XQuery
We have proposed a new labeling scheme, extended
Dewey and a new holistic twig pattern matching
algorithm TJFast.
Compared to previous work
TJFast reduces the input I/O cost
TJFast reduces the output I/O cost for
intermediate results.

45
Reference

1 S. Al-Khalifa , H.V. Jagadish, J. Patel, Y.
Wu N. Koudas, D. Srivastava Structural Joins A
Primitive for Efficient XML Query Pattern
Matching. ICDE 2002 141- 152
Propose StackTree algorithm
2 N. Bruno, D. Srivastava, and N. Koudas.
Holistic twig joins optimal xml pattern
matching. In Proceedings of ACM SIGMOD, 2002.
Propose TwigStack algorithm
3 T. Chen, J. Lu, and T. Ling. On boosting
holism in xml twig pattern matching using
structural indexingtechniques. In SIGMOD, 2005.
Propose two new data streaming
techniques
4 Y. Chen, S. B. Davidson, and Y. Zheng. BLAS
An efficient XPath processing system. In Proc. of
SIGMOD, pages 47-58, 2004.
Propose a new algorithm for XPath
query

46
Reference

5 H. Jiang, W Wang and H. Lu Holistic twig
joins on indexed XML documents VLDB 2003
Propose TSGeneric algorithm
6 J. Lu, T. Chen, and T. W. Ling. Efficient
processing of xml twig patterns with parent child
edges a look-ahead approach. In CIKM, pages
533-542, 2004.
Propose TwigStackList
algorithm
7 P. Rao and B. Moon PRIX Indexing and
querying XML using prufer sequences In ICDE pages
288-300 2004
Propose PRIX system
8 H. Wang, S. park, W Fan and P.S. Yu ViST A
dynamic index method for querying XML data by
tree structures In SIGMOD 2003
Propose ViST system
9 B. Yang M. Fontoura, E.J. Shekita, S.
Rajagopalan and K.S. Beyer Virtual Corsors
for XML joins CIKM pages 523-532 2004
Propose Virtual cursor
algorithm

47
END

Thank you!
Q A

48
Related work

Comparison between Virtual Cursor (VC) Yang CIKM
2004 and our work
Develop independently
Finite state transducer in TJFast, path table in
VC
Size of path table depends on the distinct paths,
but that of FST depends on the distinct elements
types.
TJFast reduces the number of useless intermediate
path when queries with parent-child edges, but VC
has not this property

49
Backup
a1
Query
Document
a
b1
f1
b
c
a2
e
d
c1
d1
f2
c2
TwigStackList outputs lta1,b1gt . But TJFast does
not output this path solution.
e1
50
Labels size
Xmark DBLP TreeBank
Region encoding(MB) 71.9 21.6 23.3
Original Dewey(MB) 56.2 18.1 22.8
Extended Dewey(MB) 72.6 19.5 28.7
51
Optimal query classes

If an algorithm does not output any useless
intermediate results for an query Q for all given
documents, we call this algorithm is optimal for
query Q.
If an algorithm has a larger optimal query class,
this algorithm has better ability to control the
size of intermediate results.

52
Optimal class of TJFast and TwigStack
TwigStack TJFast
Optimal query class All edges are ancestor-descendant relationships All edges connecting branching nodes and the children are ancestor-descendant relationship
Even for non-optimal queries, TJFast usually
output less useless intermediate paths than
TwigStack do.
53
Update of XML documents

In order to support the update of XML documents,
we need to slightly modify extended Dewey
labeling scheme.
Our idea comes from ORDPATH.
We can avoid to relabel the documents in any
circumstance of update.

P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G.
Schaller, and N. Westbury. ORDPATHs
Insert-friendly XML node labels. In SIGMOD, pages
903--908, 2004.
54
More examples for assigning labels

Let us consider a more complicated DTD
a ? (b c ), d?, c
We define Xbmod 3 0 Xcmod 3 1 Xd mod 3
2
(Why do we use mod 3 instead of 4?)

e
a
0
7
2
4
b
c
c
d
55
Computing cost of FST

The CPU time complexity of FST is linear in the
length of an extended Dewey label, but
independent of the complexity of schema
definition.
The main memory size of FST is quadratic to the
number of distinct element names in XML
documents, as the number of transition in FST is
quadratic in the worst case.

Write a Comment

User Comments (0)