From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

Description:

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 56
Provided by: ChenT151
Category:

less

Transcript and Presenter's Notes

Title: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching


1
From Region Encoding To Extended Dewey On
Efficient Processing of XML Twig Pattern Matching
  • Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting
    Chen
  • National University of Singapore

2
Outline
  • Background
  • Define our problem XML twig pattern matching
  • Previous work and problems
  • Our new twig matching algorithms
  • A new labeling scheme extended Dewey
  • A new holistic algorithm TJFast
  • Experimental results
  • Conclusion

3
XML basics
  • Short for Extensible Markup Language
  • A language for defining the syntax and semantics
    of structured data
  • An XML document is commonly modeled as a rooted,
    ordered and tagged tree.

book
chapter
preface
chapter
.
Intro
section
section
paragraph
section
title
paragraph
title
paragraph

Data
XML


4
Querying XML Data
  • Major standards for querying XML data
  • XPath and XQuery
  • XML twig pattern matching is a core operation in
    XPath and XQuery
  • Definition of XML twig pattern An XML twig
    pattern is a small tree whose nodes are tags,
    attributes or text values and edges are either
    Parent-Child edges or Ancestor-Descendant edges

5
An XML twig pattern example
  • Create a flat list of all the title-author pairs
    for every book in bibliography.

XQuery ltresultsgt for b in
doc("bib.xml")/bib//book, t in
b/title, a in b/author,
return ltresultgt t a lt/resultgt
lt/resultsgt
To answer the XQuery, we need to first match the
following XML twig pattern

bib
Ancestor-descendant relationship
b book
t title
a author
Parent-child relationship
6
Our research problem
  • Problem Statement
  • Given an XML twig pattern Q, and an XML database
    D, we need to find ALL the matches of Q on D.
  • E.g. Consider the following twig pattern and
    document


An XML tree
Query answers
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
7
Our research problem
  • Problem Statement
  • Given an XML twig pattern Q, and an XML database
    D, we need to find ALL the matches of Q on D.
  • E.g. Consider the following twig pattern and
    document


An XML tree
Query solutions
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
8
Our research problem
  • Problem Statement
  • Given an XML twig pattern Q, and an XML database
    D, we need to find ALL the matches of Q on D.
  • E.g. Consider the following twig pattern and
    document


An XML tree
Query solutions
(s1, t1, f1) (s2, t2, f1) (s1, t2,
f1)
Twig pattern
s1
Section
t1
s2
p1
t2
Title
Figure
f1
9
Outline
  • Background
  • Define our problem XML twig pattern matching
  • Previous work and challenge
  • Our new twig matching algorithms
  • A new labeling scheme extended Dewey
  • A new holistic algorithm TJFast
  • Experiments
  • Conclusion

10
Related work
  • TreeMerge and Stack-tree Al-Khalifa ICDE 2002
  • A stack-based binary join algorithm
  • But large intermediate results
  • TwigStack Bruno SIGMOD 2002
  • A holistic twig join algorithm.
  • Sub-optimal for queries with parent-child
    relationships
  • TwigStackList Lu CIKM 2004
  • A new holistic twig join algorithm, which
    produces less useless intermediate results than
    TwigStack does for queries with parent-child
    relationship

11
Our research goal
  • In this research, we want to design a new
    holistic twig join algorithm which is more
    efficient than previous work.
  • Two aspects to achieve this goal
  • (1) Input reduce the input I/O cost
  • (2) Output reduce the size of intermediate
    results

12
Outline
  • Background
  • Define our problem XML twig pattern matching
  • Previous work and challenges
  • Our new twig matching algorithms
  • A new labeling scheme extended Dewey
  • A new holistic algorithm TJFast
  • Experiments
  • Conclusion

13
Original Dewey Labeling Scheme
  • In Dewey labeling scheme, each element is
    presented by a vector
  • (i) the root is labeled by an empty stringe
  • (ii) for a non-root element u, label(u)
    label(s).x, where u is the x-th child of s.
  • For example

e
s1
2
1
3

s2
t1
f2
2.1
2.2
f1
t2
14
Main problem of the original Dewey
  • If we use the original Dewey labeling scheme to
    answer a twig query, we need to read labels for
    all query nodes. Thus, we have no performance
    benefit compared to pervious methods.
  • Our idea Extend the original Dewey labeling
    scheme so that given the label of any element e,
    we can know the path of e from this label alone.


15
Modulo function
  • We need to know some schema information DTD
    (Document Type Definitions ) or XML schema
  • Given DTD information book ? author, title,
    chapter
  • Our solution using modulo function, we create a
    match between an element tag and a integer
    number.
  • We define Xauthormod 3 0 Xtitlemod 3 1
    Xchaptermod 3 2
  • where Xt is the last component of the label
    of tag t.


Why not 3 as the original Dewey ?
e
book
0
5
2
1
author
chapter
chapter
title
16
Derive element tag
  • From a label , we can derive its tag name.
  • book ? author, title, chapter
  • Recall that we define Xauthormod 3 0
    Xtitlemod 3 1 Xchaptermod 3 2.

e
book
0
5
2
1
author
chapter
chapter
title
?
?
?
?
17
Derive the path from a label
  • By following a finite state transducer (FST), we
    may recursively derive the whole path from any
    extended Dewey label.
  • For example

FST
DTD book ? author, title, chapter chapter ?
(paragraph section) section ? (paragraph
section)
Mod 30
author
Mod 31
book
title
paragraph
Mod 20
Mod 32
Mod 20
book
chapter
section
Document
Mod 21
Mod 21
chapter
chapter
title
author
Question Given a label 5.1.0 for an element,
what is the corresponding path ?
section
section
section
paragraph
18
Derive the path from a label
  • By following a finite state transducer (FST), we
    may recursively derive the whole path from any
    extended Dewey label.
  • For example

FST
DTD book ? author, title, chapter chapter ?
(paragraph section) section ? (paragraph
section)
Mod 30
author
Mod 31
book
paragraph
title
Mod 20
Mod 32
Mod 20
book
chapter
Document
section
Mod 21
Mod 21
chapter
chapter
Following the above red path, we get 5.1.0
denotes
title
author
section
section
book/ chapter/section/paragraph
section
paragraph
19
Two properties of extended Dewey
  • Find Ancestor Label
  • From a label of any element, we can derive the
    labels of its all ancestors.
  • Find Ancestor Name
  • From a label of any element, we can derive the
    tag names of its all ancestors.
  • Two properties enable us to design a new and
    efficient algorithm for XML twig pattern
    matching.

20
Outline
  • Background
  • Define our problem XML twig pattern matching
  • Previous work and challenges
  • Our new twig matching algorithms
  • A new labeling scheme extended Dewey
  • A new holistic algorithm TJFast (a Fast Twig
    Join algorithm)
  • Experiments
  • Conclusion

21
A new algorithm TJFast
  • For each node n in the query, there exists a
    corresponding input stream Tn.
  • Tn contains the extended Dewey labels of elements
    of tag n. Those labels are arranged by the
    document order.
  • For each branching node b of the twig pattern,
    there is a corresponding set Sb, which contains
    elements possibly involving query answers.
    (Compared to TwigStack, what difference? )
  • During any point of computing, the size of set Sb
    is bounded by the depth of the XML document.

22
A new algorithm TJFast
  • Two-phase algorithm
  • Phase 1 parts of intermediate root-leaf paths
    are output
  • Insert elements that possibly involve in query
    answers to sets
  • Output intermediate paths according to elements
    in sets
  • Phase 2 the intermediate paths are merge-joined
    to get the final results

23
An example for TJFast algorithm
A set for the branching node A
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
DTD a -gt a,d, b b -gt d, c d -gt c
c2
c1
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
Why do we not need TA, TB streams?
TC
0.3.2.1, 0.5.0.0
24
An example for TJFast algorithm
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
derive
0.0.1 a1/a2/d1
c2
c1
0.3.2.1
0.5.0.0
derive
0.3.2.1 a1/a3/b1/c1
TD
0.0.1 , 0.3.1, 0.5.0
By finite state transducer of extended Dewey
labeling scheme
TC
0.3.2.1, 0.5.0.0
25
An example for TJFast algorithm
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Both a1 and a3 possibly involve in query answers.
(Why not a2 ?)
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
26
An example for TJFast algorithm
e
Document
Root

Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
Then we insert a1 to the set, since a1 is an
ancestor of a3.
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
27
An example for TJFast algorithm
e
Document
Root
a1
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of TD from d1 to d2 and output
one path solution lta1, d1gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
28
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
derive
0.3.1 a1/a3/d2
0.3.2.1
0.5.0.0
We insert a3 to the set, since a3 definitely
involves in query answers.
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
29
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of stream TD from d2 to d3 and
output lta1,d2gt and lta3,d2gt.
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
30
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
Move the cursor of stream TC from c1 to c2 and
output the path lta3,b1,c1gt
TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
31
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
  • Move the cursor TD of to the end and output path
    solution lta1,d3gt

TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
32
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
  • Move the cursor of TC of to the end and output
    lta1,b2,c2gt

TD
0.0.1 , 0.3.1, 0.5.0
TC
0.3.2.1, 0.5.0.0
33
An example for TJFast algorithm
e
Document
Root
a1,a3
Query
A
0
a1

0.0
D
B
0.3
0.5
a3
b2
a2
0.3.2
C
d2
b1
d3
d1
0.5.0
0.0.1
0.3.1
c2
c1
0.3.2.1
0.5.0.0
TD
0.0.1 , 0.3.1, 0.5.0
Now all five elements has been scanned, in
the second phase we merge-join all output path
solutions.
TC
0.3.2.1, 0.5.0.0
34
An example for TJFast algorithm
A
Document
a1
Query
D
B
a3
b2
a2
C
d2
b1
d3
d1
c2
c1
Phase 1. Intermediate paths
Phase 2. Final solutions
A// D lta1, d1gt, lta1, d2gt, lta1, d3gt, lta3, d2gt
A/B//C lta1,b2, c2gt, lta3, b1,c1gt
ltA, D, B,Cgt
Join
lta1,d1,b2,c2gt,lta1,d2, b2,c2gt, lta1,d3,b2,c2gt,lta3,d2
, b1,c1gt,
35
Outline
  • Background
  • Define our problem XML twig pattern matching
  • Previous work and challenges
  • Our new twig matching algorithms
  • A new labeling scheme extended Dewey
  • A new holistic algorithm TJFast
  • Experimental results
  • Conclusion

36
Experiments
  • Benchmarks
  • XMark Synthetic Data
  • DBLP Real Data for DBLP database
  • Treebank Real Data from Wall Street Journal

XMark DBLP Treebank
Data size(MB) 582 130 82
Nodes(million) 8 3.3 2.4
Max/Avg depth 12/5 6/2.9 36/7.8
37
Path query
We compared PathStack1 and TJFast on the
following four path queries on XMark data.
Path Queries
PQ1 /site/closed-auctions/closed_auction/price
PQ2 /site/regions//item/location
PQ3 /site/people/person/gender
PQ4 /site/open_auctions/open_auction/reserve
38
Experiments Number of elements read and input
file size for path queries
Observation TJFast scans less elements than
PathStack does. Explanation TJFast only scans
labels for leaf nodes in queries, but PathStack
scans all nodes in the query.
39
Experiments Execution time for path queries
Observation TJFast has better performance for
all four path queries than PathStack. Explanation
TJFast reduces I/O cost by reading less elements.
40
Twig queries
We compared TwigStack, TwigStackList and TJFast
on the following five twig queries on DBLP and
TreeBank data.
Source Twig Queries
TQ1 DBLP //proceedings//title.//i//sup
TQ2 DBLP //article.//sup//title//sub
TQ3 Treebank /S.//VP/IN//NP
TQ4 Treebank /S/VP/PPIN/NP/VBN
TQ5 Treebank //VPDT//PRP_DOLLAR_
41
Experiments Number of elements read and input
file size for twig queries
Observation TJFast scans far less elements than
TwigStack and TwigStackList do in two twig
queries. Explanation TJFast only scans elements
for leaf nodes in queries. But
TwigStack/TwigStackList needs to scan elements
for all nodes. And the number of elements for
non-leaf nodes is much more than that of leaf
nodes.
42
Experiments Execution time for twig queries
TW-SS and TJ-SS denote the sequential scan time
of input data for TwigStack/TwigStacklist and
TJFast, respectively.
Observation For DBLP data, TJFast has much
better performance than that of
TwigStack/TwigStackList. Explanation TJFast
reduces I/O cost by reading less elements.
43
Outline
  • Background
  • Define our problem XML twig pattern matching
  • Previous work and challenges
  • Our new twig matching algorithms
  • A new labeling scheme extended Dewey
  • A new holistic algorithm TJFast
  • Experimental results
  • Conclusion

44
Conclusions
  • Efficient processing of twig queries is a core
    operation in XPath and XQuery
  • We have proposed a new labeling scheme, extended
    Dewey and a new holistic twig pattern matching
    algorithm TJFast.
  • Compared to previous work
  • TJFast reduces the input I/O cost
  • TJFast reduces the output I/O cost for
    intermediate results.

45
Reference
  • 1 S. Al-Khalifa , H.V. Jagadish, J. Patel, Y.
    Wu N. Koudas, D. Srivastava Structural Joins A
    Primitive for Efficient XML Query Pattern
    Matching. ICDE 2002 141- 152
  • Propose StackTree algorithm
  • 2 N. Bruno, D. Srivastava, and N. Koudas.
    Holistic twig joins optimal xml pattern
    matching. In Proceedings of ACM SIGMOD, 2002.
  • Propose TwigStack algorithm
  • 3 T. Chen, J. Lu, and T. Ling. On boosting
    holism in xml twig pattern matching using
    structural indexingtechniques. In SIGMOD, 2005.
  • Propose two new data streaming
    techniques
  • 4 Y. Chen, S. B. Davidson, and Y. Zheng. BLAS
    An efficient XPath processing system. In Proc. of
    SIGMOD, pages 47-58, 2004.
  • Propose a new algorithm for XPath
    query

46
Reference
  • 5 H. Jiang, W Wang and H. Lu Holistic twig
    joins on indexed XML documents VLDB 2003
  • Propose TSGeneric algorithm
  • 6 J. Lu, T. Chen, and T. W. Ling. Efficient
    processing of xml twig patterns with parent child
    edges a look-ahead approach. In CIKM, pages
    533-542, 2004.
  • Propose TwigStackList
    algorithm
  • 7 P. Rao and B. Moon PRIX Indexing and
    querying XML using prufer sequences In ICDE pages
    288-300 2004
  • Propose PRIX system
  • 8 H. Wang, S. park, W Fan and P.S. Yu ViST A
    dynamic index method for querying XML data by
    tree structures In SIGMOD 2003
  • Propose ViST system
  • 9 B. Yang M. Fontoura, E.J. Shekita, S.
    Rajagopalan and K.S. Beyer Virtual Corsors
    for XML joins CIKM pages 523-532 2004
  • Propose Virtual cursor
    algorithm

47
END
  • Thank you!
  • Q A

48
Related work
  • Comparison between Virtual Cursor (VC) Yang CIKM
    2004 and our work
  • Develop independently
  • Finite state transducer in TJFast, path table in
    VC
  • Size of path table depends on the distinct paths,
    but that of FST depends on the distinct elements
    types.
  • TJFast reduces the number of useless intermediate
    path when queries with parent-child edges, but VC
    has not this property

49
Backup
a1
Query
Document
a
b1
f1
b
c
a2
e
d
c1
d1
f2
c2
TwigStackList outputs lta1,b1gt . But TJFast does
not output this path solution.
e1
50
Labels size
Xmark DBLP TreeBank
Region encoding(MB) 71.9 21.6 23.3
Original Dewey(MB) 56.2 18.1 22.8
Extended Dewey(MB) 72.6 19.5 28.7
51
Optimal query classes
  • If an algorithm does not output any useless
    intermediate results for an query Q for all given
    documents, we call this algorithm is optimal for
    query Q.
  • If an algorithm has a larger optimal query class,
    this algorithm has better ability to control the
    size of intermediate results.

52
Optimal class of TJFast and TwigStack
TwigStack TJFast
Optimal query class All edges are ancestor-descendant relationships All edges connecting branching nodes and the children are ancestor-descendant relationship
Even for non-optimal queries, TJFast usually
output less useless intermediate paths than
TwigStack do.
53
Update of XML documents
  • In order to support the update of XML documents,
    we need to slightly modify extended Dewey
    labeling scheme.
  • Our idea comes from ORDPATH.
  • We can avoid to relabel the documents in any
    circumstance of update.

P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G.
Schaller, and N. Westbury. ORDPATHs
Insert-friendly XML node labels. In SIGMOD, pages
903--908, 2004.
54
More examples for assigning labels
  • Let us consider a more complicated DTD
  • a ? (b c ), d?, c
  • We define Xbmod 3 0 Xcmod 3 1 Xd mod 3
    2
  • (Why do we use mod 3 instead of 4?)


e
a
0
7
2
4
b
c
c
d
55
Computing cost of FST
  • The CPU time complexity of FST is linear in the
    length of an extended Dewey label, but
    independent of the complexity of schema
    definition.
  • The main memory size of FST is quadratic to the
    number of distinct element names in XML
    documents, as the number of transition in FST is
    quadratic in the worst case.
Write a Comment
User Comments (0)
About PowerShow.com