Title: Holistic Twig Joins: Optimal XML Pattern Matching
1Holistic Twig Joins Optimal XML Pattern Matching
- Author Nicolas Bruno
- Nick Koudas
- Divesh Srivastava
- Presented by Huang Yukai
2Outline
- 1. Background
- 2. Algorithm PathStack
- 3. Algorithm TwigStack
- 4. Experiments
- 5. Conclusion References
3Background (1)
- An XML database is a forest of rooted, ordered,
labeled trees, each node corresponding to an
element or a value, and the edges representing
the relationships.
ltbookgt lttitlegt XML lt/titlegt ltall
authorsgt ltauthorgt ltfngt jane lt/fngt ltlngt poe
lt/fngt ltauthorgt ...... lt/all authorsgt ltyeargt
2000 lt/yeargt ...... lt/bookgt
book
title
all authors
year
...
XML
author
2000
...
fn
ln
jane
poe
4Background (2)
- XML queries specify patterns of selection
predicates on multiple elements. - e.g. booktitle xml//authorfn jane AND
ln poe
book
- Finding all occurrences of such a twig pattern in
an XML database is a core operation for XML query
processing.
title
author
fn
ln
xml
jane
poe
5Background (3)
- The previous work
- (i) decompose the twig pattern into binary
structural relationships. - (ii) match the binary relationships against the
XML database. - (iii) stitch together these basic matches.
book
- Drawbacks
- the intermediate result size can get too large!
-
title
author
fn
ln
motivation algorithm complexity should be
independent of the size of intermediate results
xml
jane
poe
6Background (4)
7Outline
- 1. Background
- 2. Algorithm PathStack
- 3. Algorithm TwigStack
- 4. Experiments
- 5. Conclusion References
8PathStack (1)
- Notation
- q denote twig patterns (and the root node of
twig pattern) - Associated with each node q in query twig pattern
there is a stream Tq contains all the positional
representations of node q in database and sorted
by (DocId, LeftPos). - Associated with each q there is a stack Sq
contains the pairs (positional representation in
Tq, point to a node in Sparent(q))
9PathStack (2.1) an example
- position representations of each nodes in data
A1
B1
A1 (1,19,1) B1 (1,28,2) A1 (1,37,3) B1
(1,46,4) C1 (1,55,5)
A
A2
B
B2
C
C1
Q
D
10PathStack (2.2) an example
(1) For each node in q, we get stream Tq from XML
database
Ta A1, A2 (eof) Tb B1, B2 (eof) Tc C1 (eof)
(2) loop while eof(Tq) is false, where q is leaf
node.
Push stack
A1
B1
A2
C1
B2
A2
B2
C1
A1
B1
Sc
SB
SA
node C is leaf node and Tc is eof.
11PathStack (2.3) an example
- Phase 2 output all the solutions with the stacks
A2
B2
C1
A1
B1
Sc
SB
SA
Query results A1B1C1 A1B2C1 A2B2C1
12PathStack (3)
13PathStack (4)
14Outline
- 1. Background
- 2. Algorithm PathStack
- 3. Algorithm TwigStack
- 4. Experiments
- 5. Conclusion References
15TwigStack (1)
- A straightforward way is to decompose the twig
into multiple path patterns, use PathStack to get
partial solutions and merge them finally. - Drawback!
16TwigStack (2) an example
R
A
A1
A2
B
C
B1
C1
B2
C2
D
F
D1
E
D2
F
D
Q
17(No Transcript)
18XB-trees
- XB-tree a variant B-tree for indexing the
positional representation (DocId, LeftPos
RightPos, LevelNum) of elements in XML tree. - The nodes in page are sorted by DocId and
LeftPos. - The node in internal page contains bounding
segment N.L, N.R, all its child nodes included
in the segment. - Two operations over XB-trees Advance Drilldown
- TwigStackXB (omitted)
19Outline
- 1. Background
- 2. Algorithm PathStack
- 3. Algorithm TwigStack
- 4. Experiments
- 5. Conclusion References
20Experiments (1)
- Experimental Setting
- Implemented in C
- A computer with 550Mhz Pentium III processor,
768MB of main memory and a 2GB disk. - Datasets
- (a) synthetic data random generated trees with
parameters depth, fan-out and labels. - (b) real-world data an unfolded fragment of
the DBLP database.
21Experiments (2)
- PathStack vs. Binary Structural Joins
22Experiments (3.1)
23Experiments (3.2)
24Experiments (4)
- PathStack vs. TwigStack
- Queries
25(No Transcript)
26Experiments (5)
27Conclusions
- Holistic join algorithms PathStack and
TwigStack. - More issues
- to handle more complicated XPath expressions.
- value-based joins (e.g. links across documents)
28References
- 1 N. Bruno, N. Koudas, D. Srivastava. Holistic
Twig Joins Optimal XML Pattern Matching.
Technical Report. ... - 2 S.Al-Khalifa, H. V. Jagadish, N. Koudas, J.
M. Patel, D.Srivastava, and Y. Wu. Structural
joins A primitive for efficient XML query
pattern matching. ICDE 02. some stack-based
algorithms for joins - 3 C. Zhang, J. Naughton, D. Dewitt, Q. Luo, and
G. Lohman. On supporting containment queries in
relational database management systems. SIGMOD
01. the MPMGJN algorithm