Title: PRIX: Indexing and Querying XML Using Prufer Sequences
1PRIX Indexing and Querying XML Using Prufer
Sequences
- By Praveen Rao and Bongki Moon,
- ICDE Conf. 2004
- Presented by Van Le-Pham
- May 3, 2005
2Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
3Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
4Introduction
- XML as a new standard for information
representation - Major issues storing, indexing, and querying
- Two main researches based on
- Structural index and numbering schemes
(dataguide, 1-index, FB index ) - Numbering scheme that encodes each element by its
position (based on tree traversal pre and post
order )
5Background and Motivations I
- Most of researches process each of the
root-to-leaf paths in the query twig separately
and then merge the results.
6Background and Motivations II
- PathStack and TwigStack Algorithms
- Operate on the positional representation of the
elements to find twig matches - Variant of TwigStack TwigStackXB
- Using XB-trees to speed up when the input lists
are long - Limitations
- Skipping data depends on the distribution of the
matches in the input lists. If the matches are
spread all over the dataset, then TwigStackXB
will have to go down to lower regions of the tree
to avoid missing matches gt ineffective. - Parent-child relationships are not handled very
well.
7Background and Motivations III
- ViST (Vitural Suffix Tree)
- Transform XML data and queries into
structure-encoded sequences - Finding the matched subsequences
- Using virtual suffix tree to speed up
- Limitations
- The worst case storage for a D-Ancestorship is
higher than linear in the total number of
elements in XML documents - May produce false alarm results
8Example of false alarm by ViST
- The structure-encoded sequence of Q is a
subsequence of structure-encoded of Doc1 and
Doc2. However, Q occurs only in Doc1 gt match
detected in Doc2 is false alarm
9Background and Motivations IV
- Motivations
- Develop a method that allows holistic processing
without breaking a twig into root-to-leaf paths - Construct a tree-to-sequence transformed with
linear storage - Index the sequences to reduce the amount of data
that needed to be searched
10Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
11XML Documents
- Represented by labeled tree such that
- Each node is associated
- with its element tag and
- number
- Any number scheme can
- be used as long as it keep
- the labeled number unique
-
postorder traverse
12Prufer Sequences for Labeled Trees
- Tree-to-sequence Prufers method apply on tree
with n nodes labeled from 1 to n - Delete the smallest labeled leaf gt tree
- Collect the label of the deleted nodes
parent - Repeat the above 2 steps for tree and
get - Continue until only 1 node is left
- Create the sequence of length (n 1)
Tn
Tn - 1
a1
Tn - 1
a2
(a1, a2, , an-1)
13Prufer Sequences for Labeled Trees II
- LPS (Labeled Prufer Sequence) the sequence
collect from previous slide - NPS (Numbered Prufer Sequence) the sequence of
postorder numbers of the coressponding LPS
(a1, a2, , an-1)
14Example
Tn
n 15
15Example II
Tn
16Example III
Tn - 1
17Example IV
Tn - 2
18Example V
Tn - 3
19Example VI
- NPS 15 3 7 6 6
- LPS A C B C C
Tn - 4
20Example VII
- Keep doing like this until we have
- NPS 15 3 7 6 6 7 15 9 15 13 13 13 14 15
- LPS A C B C C B A C A E E E D A
- (length n 1 15 1 14)
21Processing Twig Queries by Prufer Sequences
- A query twig is transformed into Prufer Sequences
like XML documents - Non-matches are filtered out by subsequence
matching on the indexed sequences - Then twig matches are then found by applying a
series of refinement strategies
22Architectural Overview of PRIX
23Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
24Notations Used
25Problem Statement
- Given a collection of XML documents ? and a query
twig Q, report all the occurrences of twig Q in ? - Finding twig matches in PRIX includes four phases
of filtering and refinement
26Phase 1 Filtering by Subsequence Matching
- Definition 1 A subsequence is any string that
can be obtained by deleting zero or more symbols
from a given string - So the problem become
- Given query twig Q gt LPS(Q)
- Given XML documents ? gt G a set of LPSs of ?
- We need to find all subsequences in G that match
LPS(Q)
27Phase 1 Filtering by Subsequence Matching II
- Lemma 1 Given a tree T with n nodes, numbered
from 1 to n in postorder, the node deleted at the
time during Prufer sequence construction is
the node numbered i - Theorem 1 If tree Q is a subgraph of tree T,
then LPS(Q) is a subsequence of LPS(T). - (Prove omitted)
ith
28Example
-
- Tree T Tree Q
- LPS(T) A C B C C B A C A E E E D A
LPS(Q) B A E D A - NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
NPS(Q) 2 6 4 5 6
29Example II
- LPS(T) A C B C C B A C A E E E D A
- NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
- LPS(Q) B A E D A
- NPS(Q) 2 6 4 5 6
- We can say that Q is a (labeled) subgraph of T,
and LPS(Q) match of subsequence S of LPS(T) at
positions (6, 7, 11, 13, 14) - The postorder number sequence of subsequence S is
- 7 15 13 14 15
- There may be more than one subsequence in
LPS(T) that matches LPS(Q)
30Phase 2 Refinement by Connectedness
- Refine the matched subsequences in phase 1 for
their connectedness property - The sequences that satisfy Theorem 2 are called
tree sequences
31Intuition
- Let i be the index of the last occurrence of a
postorder number a in an NPS - This last occurrence is a result of deletion of
the last child of a - Based on Lemma 1, the next child to be deleted is
a itself - So if the tree is connected, then the postorder
number at the index should be the
parent of a
(i 1)th
32Example
- Subsequences
- C B C E D
- 3 7 9 13 14
- C B A C A E D A
- 3 7 15 9 15 13 14 15
SA
NA
SB
NB
33Example II
- NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14
15 - In case of max( , ,
, ) 14 - 3 7 9 13 14
- i 1, 3
- 7 7 gt Connected
here - 3 7 9 13 14
- i 2, 7
- 9 ? 15 gt Disconnect
here
NT
SA
NA1
NA2
NA5
NA
NA1
NA2
NT(3)
34Phase 3 Refinement by Twig Structure
- Refine the tree sequences in phase 2 for their
twig structure property - Define if the structure of the tree represented
by a tree sequence matches the query twig
structure
35Some definitions
- Gap between nodes
-
- (can be computed using the NPS of the tree)
- Gap consistency
36Example of gap consistency
37Example of gap consistency II
S1 B A E E A S2 B A E E A NS1 7
15 13 13 15 NS2 2 7 6 6 7
- 7 (-) 15 (-) 13 (-) 13 (-) 15
- gap -8 2 0 -2
- 2 (-) 7 (-) 6 (-) 6 (-) 7
- -5 1 0 -1
NS1
NS2
38Intuition
- Gap(a, b) number of nodes are encountered
during postorder traversal between a and b - Therefore, if more nodes are traversed in the
query twig as compared to the data twig gt there
is a structural difference between the data and
the query twig - The number of times a number a occurs in an NPS
the number of child nodes of a
39Definitions
40Example of frequency consistency
NS1
- 7 15 13 13 15 2 7 6 6 7
- Freq(7) 1 Freq(2), occurs at position 1
- Freq(15) 2 Freq(7), and occurs at positions
2,5 - Freq(13) 2 Freq(6), and occurs at positions
3,4 - gt and are frequency consistent
NS2
S1
S2
41What we have so far
(proof omitted)
42What we have so far
- From lemmas and theorems gt LPS(Q) and
subsequence S of data are identical - NPS(Q) is gap consistent and frequency consistent
with N
43Remind
- LPS contains only the non-leaf nodes, therefore
- need to store the label and postorder number of
every leaf in database - After what we have done, we can only find twig
matches whose - Tree structure is same as query tree
- Non-leaf node labels match the non-leaf node
l)abels of the query twig - We call it partial twig matches (because we
dont check for the leaves yet)
44Remind II
- To find complete twig match, the leaf node labels
of the partial twig match in the data should be
matched with the leaf node labels of query twig
45Phase 4 Refinement by Matching Leaf Nodes
- The leaf node labels of the query twig are tested
with leaf node labels of partial twig matches
found in phase 3
46Example (see example 6)
matches
- LPS(T) A C B C C B A C A E E E D A
LPS(Q) B A E D A - NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
NPS(Q) 2 6 4 5 6 - LPS(Q) matches a subsequence
- S B A E D A in LPS(T) at positions
- P (3th , 7th, 11th, 13th, 14th) and postorder
numbered sequence - N 7 15 13 14 15
- Leaf nodes(T) (D,2), (D,4), (E,5), (G,10),
(F,11), (F,12) - Leaf nodes(Q) (F,3)
matches
47Optimize
- Phase 4 can be eliminated if we put the leaf
nodes of the query twig and data trees inside the
LPSs
48Processing Wildcards (// and )
First wildcard
- Q //A//C/D
- LPS(Q) C A
- NPS(Q) 2 3
Second wildcard
Query Q
49Processing Wildcards
- The first wildcard is handled by the method
- To handle the second wildcard, we need to modify
the refinement-by-connectedness step - LPS(T) A C B C C B A C A E E E D A
LPS(Q) C A - NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
NPS(Q) 2 6 4 5 6 - S C A, N 3 15, and P (2th, 7th)
- Theorem 2 gt we can discard this subsequence
(since the last occurrence of 3 must be followed
by 7, but in this case it is followed by 15 gt
unconnected) - To avoid, we check recursively to see if node 3
can lead to node 15 by following a series of
edges in T
If the subsequences that pass this test will move
to next phase
50Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
51Building Prufer Sequences
- Tree-to-sequence Prufers method apply on tree
with n nodes labeled from 1 to n - Delete the smallest labeled leaf gt tree
- Collect the label of the deleted nodes
parent - Repeat the above 2 steps for tree and
get - Continue until only 1 node is left
- Create the sequence ( , , .,
) of length (n 1)
Tn
Tn - 1
a1
Tn - 1
a2
a1
,a2
an-1
52Indexing Sequence Using B-tree
- Index the set of LPSs of XML documents to fasten
the subsequence matching step - Use B-tree (in a similar way that is used in
ViST) to build a virtual trie
53Virtual Trie
- Each node in the trie is labeled with the range
(LeftPos, RightPos) (called positional
representation) such that the containment
property is satisfied - The range of the root is (1, MAX_INT)
- Each element tag e, we build a B-tree that
indexes the positional representation of every
occurrence of e using its LeftPos as a key gt
called Trie-Symbol index - Index each document tree identifier in a separate
B-tree using the LeftPos of the node where the
LPS ends in the virtual tree as the key gt called
Docid index
54Space Complexity
- The size of the trie grows linearly with the
length of the sequences stored in it - And the length of Prufer sequence is linear in
the number of nodes in the tree - index size is linear in the number of
nodes in the tree
55Filtering by Subsequence Matching
- Let LPS(Q) (sequence of length k)
- Algorithm 1 find all occurences of using the
Trie-Symbol index - Call FindSubsequence( , 1, 0, MAX_INT)
-
Qs Qs1Qs2..Qsk
Qs
Qs
56Filtering by Subsequence Matching
57Optimized Subsequence Matching
- Speed up subsequence matching more
- Reduce the number of range queries to be
performed by Algorithm 1 without causing any
false dismissals - Pruning some nodes (r in line 2) using the gap
between two adjacent nodes in the query sequence - Given ? set of XML documents and node label e in
?, define
58Example of MaxGap(e, ?)
- In tree P, difference(A) 14 8 6
- In tree Q, difference(A) 3 1 2
- MaxGap(A, P,Q) max6, 2 6
- MaxGap(e, ?) 0 if every occurrence of e has at
most 1 child
59How MaxGap useful
- Suppose we have the query twig with B is the
parent of C gt LPS(Q) C B - Then C B has 8 matches in LPS(P) corresponding to
these positions (1,3), (1,4), (1, 7), (1,9),
(2,3), (2,4), (2, 7), and (2,9) -
- LPS(P) C C B B E E B A B C D D C A
- Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
- MaxGap(C,P,Q) max(2-1), (13 -10), (8-7)
3, so - The gap between first and last occurences of C
3 - The gap between first occurrence of C and parent
B 4 - only 4 matches will be considered for next
step (1,3), (1,4), (2,3) and (2,4)
60Using MaxGap
61The Refinement Phases
- Using Algorithm 2 to refine the set of ordered
pairs (D,S) return by Algorithm 1 - D set of document identifiers
- S positions of match subsequences
- Refinement includes
- Test for connectedness (refinement by
connectedness) - Test for gap consistency (refinement by
structure) - Test for frequency consistency (refinement by
structure) - Find matched leaves (refinement by leaf matching)
62Connectedness (Theorem 2)
Gap consistency (Definition 3)
Frequency consistency (Definition 4)
Match leaves
63Extended Prufer Sequences
- Regular-Prufer sequence contains only non-leaf
nodes - Add a dummy child node to each leaf node to
create Extended-Prufer sequence, which contains
the labels of all nodes - RPIndex index based on Regular-Prufer sequence
- EPIndex index based on Extended-Prufer sequence
64Regular vs Extended Prufer Sequences
- RPSequence
- Short
- RPIndex
- Good for processing twig queries having no values
- EPSequence
- Longer than Regular Prufer sequence
- EPIndex
- Good for processing twig queries with values
Optimized Both can coexist in the PRIX system
65Ordered and Unordered Twig Match
- All ordered twig matches can be found by using
the Prufer sequence of query twig - To find unordered matches
- Construct the Prufer sequence for different
arrangement of the branches of query twig - Usually, number of twig branches in query is
small gt number of arrangement is small
66Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
67- Compared the query performance of
- PRIX
- ViST
- TwigStack/TwigStackXB
68Experimental Setup
- 1.8GHz Pentium IV processor
- 512 MB RAM running Solaris 8
- 40GB EIDE disk drive (store data and indexes)
- Compiled by GNU g compiler version 2.95.3
- Buffer pool size 2000 pages of size 8K
69Data Sets
- Each has different characteristic
- DBLP good similarity in structure and shallow
- SWISSPROT bushy and shallow
- TREEBANK skinny and deep recursions of element
names
70Queries
The following queries are tested
These queries have different characteristics in
terms of selectivity, presence of value, and twig
structure
71Performance inn term of time
72Performance AnalysisPRIX vs ViST
73(No Transcript)
74PRIX vs ViST
75PRIX vs TwigStack/TwigStackXB
(comparable)
1- Distribution of possible solutions in the data
set 2- Sub-optimality for parent/child
relationship)
76Outline
- Introduction
- Background and Motivations
- Overview of PRIX Approach
- Finding Twig Matches
- Implementation Issues in the PRIX System
- Experimental Result
- Conclusions and Future Work
77Conclusion and Future Work
- Transform XML and twig query into Prufer
sequences - Perform querying without breaking the queries
into root-to-leaf paths and process them
individually - Perform
- Subsequence matching
- Refinement by connectedness
- Refinement by twig structure
- Refinement by matching leaf nodes
- Future work
- explore the behavior of PRIX system for different
query characteristics - Analyze the complexity of query processing time
78Thank You!
(Want to have detail proofing go to
http//www.cs.arizona.edu/research/reports.html)
79Question?