PRIX: Indexing and Querying XML Using Prufer Sequences - PowerPoint PPT Presentation

1 / 79

About This Presentation

Title:

PRIX: Indexing and Querying XML Using Prufer Sequences

Description:

PRIX: Indexing and Querying XML Using Prufer Sequences. By Praveen ... Frequency consistency. Example of frequency consistency. NS1 = 7 15 13 13 15 = 2 7 6 6 7 ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 80

Provided by: leph

Category:

more less

Transcript and Presenter's Notes

Title: PRIX: Indexing and Querying XML Using Prufer Sequences

1
PRIX Indexing and Querying XML Using Prufer
Sequences

By Praveen Rao and Bongki Moon,
ICDE Conf. 2004
Presented by Van Le-Pham
May 3, 2005

2
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

3
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

4
Introduction

XML as a new standard for information
representation
Major issues storing, indexing, and querying
Two main researches based on
Structural index and numbering schemes
(dataguide, 1-index, FB index )
Numbering scheme that encodes each element by its
position (based on tree traversal pre and post
order )

5
Background and Motivations I

Most of researches process each of the
root-to-leaf paths in the query twig separately
and then merge the results.

6
Background and Motivations II

PathStack and TwigStack Algorithms
Operate on the positional representation of the
elements to find twig matches
Variant of TwigStack TwigStackXB
Using XB-trees to speed up when the input lists
are long
Limitations
Skipping data depends on the distribution of the
matches in the input lists. If the matches are
spread all over the dataset, then TwigStackXB
will have to go down to lower regions of the tree
to avoid missing matches gt ineffective.
Parent-child relationships are not handled very
well.

7
Background and Motivations III

ViST (Vitural Suffix Tree)
Transform XML data and queries into
structure-encoded sequences
Finding the matched subsequences
Using virtual suffix tree to speed up
Limitations
The worst case storage for a D-Ancestorship is
higher than linear in the total number of
elements in XML documents
May produce false alarm results

8
Example of false alarm by ViST

The structure-encoded sequence of Q is a
subsequence of structure-encoded of Doc1 and
Doc2. However, Q occurs only in Doc1 gt match
detected in Doc2 is false alarm

9
Background and Motivations IV

Motivations
Develop a method that allows holistic processing
without breaking a twig into root-to-leaf paths
Construct a tree-to-sequence transformed with
linear storage
Index the sequences to reduce the amount of data
that needed to be searched

10
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

11
XML Documents

Represented by labeled tree such that
Each node is associated
with its element tag and
number
Any number scheme can
be used as long as it keep
the labeled number unique
postorder traverse

12
Prufer Sequences for Labeled Trees

Tree-to-sequence Prufers method apply on tree
with n nodes labeled from 1 to n
Delete the smallest labeled leaf gt tree
Collect the label of the deleted nodes
parent
Repeat the above 2 steps for tree and
get
Continue until only 1 node is left
Create the sequence of length (n 1)

Tn
Tn - 1
a1
Tn - 1
a2
(a1, a2, , an-1)
13
Prufer Sequences for Labeled Trees II

LPS (Labeled Prufer Sequence) the sequence
collect from previous slide
NPS (Numbered Prufer Sequence) the sequence of
postorder numbers of the coressponding LPS

(a1, a2, , an-1)
14
Example
Tn
n 15
15
Example II

NPS 15
LPS A

Tn
16
Example III

NPS 15 3
LPS A C

Tn - 1
17
Example IV

NPS 15 3 7
LPS A C B

Tn - 2
18
Example V

NPS 15 3 7 6
LPS A C B C

Tn - 3
19
Example VI

NPS 15 3 7 6 6
LPS A C B C C

Tn - 4
20
Example VII

Keep doing like this until we have
NPS 15 3 7 6 6 7 15 9 15 13 13 13 14 15
LPS A C B C C B A C A E E E D A
(length n 1 15 1 14)

21
Processing Twig Queries by Prufer Sequences

A query twig is transformed into Prufer Sequences
like XML documents
Non-matches are filtered out by subsequence
matching on the indexed sequences
Then twig matches are then found by applying a
series of refinement strategies

22
Architectural Overview of PRIX
23
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

24
Notations Used
25
Problem Statement

Given a collection of XML documents ? and a query
twig Q, report all the occurrences of twig Q in ?
Finding twig matches in PRIX includes four phases
of filtering and refinement

26
Phase 1 Filtering by Subsequence Matching

Definition 1 A subsequence is any string that
can be obtained by deleting zero or more symbols
from a given string
So the problem become
Given query twig Q gt LPS(Q)
Given XML documents ? gt G a set of LPSs of ?
We need to find all subsequences in G that match
LPS(Q)

27
Phase 1 Filtering by Subsequence Matching II

Lemma 1 Given a tree T with n nodes, numbered
from 1 to n in postorder, the node deleted at the
time during Prufer sequence construction is
the node numbered i
Theorem 1 If tree Q is a subgraph of tree T,
then LPS(Q) is a subsequence of LPS(T).
(Prove omitted)

ith
28
Example

Tree T Tree Q
LPS(T) A C B C C B A C A E E E D A
LPS(Q) B A E D A
NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
NPS(Q) 2 6 4 5 6

29
Example II

LPS(T) A C B C C B A C A E E E D A
NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
LPS(Q) B A E D A
NPS(Q) 2 6 4 5 6
We can say that Q is a (labeled) subgraph of T,
and LPS(Q) match of subsequence S of LPS(T) at
positions (6, 7, 11, 13, 14)
The postorder number sequence of subsequence S is
7 15 13 14 15
There may be more than one subsequence in
LPS(T) that matches LPS(Q)

30
Phase 2 Refinement by Connectedness

Refine the matched subsequences in phase 1 for
their connectedness property
The sequences that satisfy Theorem 2 are called
tree sequences

31
Intuition

Let i be the index of the last occurrence of a
postorder number a in an NPS
This last occurrence is a result of deletion of
the last child of a
Based on Lemma 1, the next child to be deleted is
a itself
So if the tree is connected, then the postorder
number at the index should be the
parent of a

(i 1)th
32
Example

Subsequences
C B C E D
3 7 9 13 14
C B A C A E D A
3 7 15 9 15 13 14 15

SA
NA
SB
NB
33
Example II

NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14
15
In case of max( , ,
, ) 14
3 7 9 13 14
i 1, 3
7 7 gt Connected
here
3 7 9 13 14
i 2, 7
9 ? 15 gt Disconnect
here

NT
SA
NA1
NA2
NA5
NA
NA1
NA2
NT(3)
34
Phase 3 Refinement by Twig Structure

Refine the tree sequences in phase 2 for their
twig structure property
Define if the structure of the tree represented
by a tree sequence matches the query twig
structure

35
Some definitions

Gap between nodes
(can be computed using the NPS of the tree)
Gap consistency

36
Example of gap consistency
37
Example of gap consistency II
S1 B A E E A S2 B A E E A NS1 7
15 13 13 15 NS2 2 7 6 6 7

7 (-) 15 (-) 13 (-) 13 (-) 15
gap -8 2 0 -2
2 (-) 7 (-) 6 (-) 6 (-) 7
-5 1 0 -1

NS1
NS2
38
Intuition

Gap(a, b) number of nodes are encountered
during postorder traversal between a and b
Therefore, if more nodes are traversed in the
query twig as compared to the data twig gt there
is a structural difference between the data and
the query twig
The number of times a number a occurs in an NPS
the number of child nodes of a

39
Definitions

Frequency consistency

40
Example of frequency consistency
NS1

7 15 13 13 15 2 7 6 6 7
Freq(7) 1 Freq(2), occurs at position 1
Freq(15) 2 Freq(7), and occurs at positions
2,5
Freq(13) 2 Freq(6), and occurs at positions
3,4
gt and are frequency consistent

NS2
S1
S2
41
What we have so far
(proof omitted)
42
What we have so far

From lemmas and theorems gt LPS(Q) and
subsequence S of data are identical
NPS(Q) is gap consistent and frequency consistent
with N

43
Remind

LPS contains only the non-leaf nodes, therefore
need to store the label and postorder number of
every leaf in database
After what we have done, we can only find twig
matches whose
Tree structure is same as query tree
Non-leaf node labels match the non-leaf node
l)abels of the query twig
We call it partial twig matches (because we
dont check for the leaves yet)

44
Remind II

To find complete twig match, the leaf node labels
of the partial twig match in the data should be
matched with the leaf node labels of query twig

45
Phase 4 Refinement by Matching Leaf Nodes

The leaf node labels of the query twig are tested
with leaf node labels of partial twig matches
found in phase 3

46
Example (see example 6)
matches

LPS(T) A C B C C B A C A E E E D A
LPS(Q) B A E D A
NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
NPS(Q) 2 6 4 5 6
LPS(Q) matches a subsequence
S B A E D A in LPS(T) at positions
P (3th , 7th, 11th, 13th, 14th) and postorder
numbered sequence
N 7 15 13 14 15
Leaf nodes(T) (D,2), (D,4), (E,5), (G,10),
(F,11), (F,12)
Leaf nodes(Q) (F,3)

matches
47
Optimize

Phase 4 can be eliminated if we put the leaf
nodes of the query twig and data trees inside the
LPSs

48
Processing Wildcards (// and )
First wildcard

Q //A//C/D
LPS(Q) C A
NPS(Q) 2 3

Second wildcard
Query Q
49
Processing Wildcards

The first wildcard is handled by the method
To handle the second wildcard, we need to modify
the refinement-by-connectedness step
LPS(T) A C B C C B A C A E E E D A
LPS(Q) C A
NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
NPS(Q) 2 6 4 5 6
S C A, N 3 15, and P (2th, 7th)
Theorem 2 gt we can discard this subsequence
(since the last occurrence of 3 must be followed
by 7, but in this case it is followed by 15 gt
unconnected)
To avoid, we check recursively to see if node 3
can lead to node 15 by following a series of
edges in T

If the subsequences that pass this test will move
to next phase
50
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

51
Building Prufer Sequences

Tree-to-sequence Prufers method apply on tree
with n nodes labeled from 1 to n
Delete the smallest labeled leaf gt tree
Collect the label of the deleted nodes
parent
Repeat the above 2 steps for tree and
get
Continue until only 1 node is left
Create the sequence ( , , .,
) of length (n 1)

Tn
Tn - 1
a1
Tn - 1
a2
a1
,a2
an-1
52
Indexing Sequence Using B-tree

Index the set of LPSs of XML documents to fasten
the subsequence matching step
Use B-tree (in a similar way that is used in
ViST) to build a virtual trie

53
Virtual Trie

Each node in the trie is labeled with the range
(LeftPos, RightPos) (called positional
representation) such that the containment
property is satisfied
The range of the root is (1, MAX_INT)
Each element tag e, we build a B-tree that
indexes the positional representation of every
occurrence of e using its LeftPos as a key gt
called Trie-Symbol index
Index each document tree identifier in a separate
B-tree using the LeftPos of the node where the
LPS ends in the virtual tree as the key gt called
Docid index

54
Space Complexity

The size of the trie grows linearly with the
length of the sequences stored in it
And the length of Prufer sequence is linear in
the number of nodes in the tree
index size is linear in the number of
nodes in the tree

55
Filtering by Subsequence Matching

Let LPS(Q) (sequence of length k)
Algorithm 1 find all occurences of using the
Trie-Symbol index
Call FindSubsequence( , 1, 0, MAX_INT)

Qs Qs1Qs2..Qsk
Qs
Qs
56
Filtering by Subsequence Matching

57
Optimized Subsequence Matching

Speed up subsequence matching more
Reduce the number of range queries to be
performed by Algorithm 1 without causing any
false dismissals
Pruning some nodes (r in line 2) using the gap
between two adjacent nodes in the query sequence
Given ? set of XML documents and node label e in
?, define

58
Example of MaxGap(e, ?)

In tree P, difference(A) 14 8 6
In tree Q, difference(A) 3 1 2
MaxGap(A, P,Q) max6, 2 6
MaxGap(e, ?) 0 if every occurrence of e has at
most 1 child

59
How MaxGap useful

Suppose we have the query twig with B is the
parent of C gt LPS(Q) C B
Then C B has 8 matches in LPS(P) corresponding to
these positions (1,3), (1,4), (1, 7), (1,9),
(2,3), (2,4), (2, 7), and (2,9)
LPS(P) C C B B E E B A B C D D C A
Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MaxGap(C,P,Q) max(2-1), (13 -10), (8-7)
3, so
The gap between first and last occurences of C
3
The gap between first occurrence of C and parent
B 4
only 4 matches will be considered for next
step (1,3), (1,4), (2,3) and (2,4)

60
Using MaxGap
61
The Refinement Phases

Using Algorithm 2 to refine the set of ordered
pairs (D,S) return by Algorithm 1
D set of document identifiers
S positions of match subsequences
Refinement includes
Test for connectedness (refinement by
connectedness)
Test for gap consistency (refinement by
structure)
Test for frequency consistency (refinement by
structure)
Find matched leaves (refinement by leaf matching)

62
Connectedness (Theorem 2)
Gap consistency (Definition 3)
Frequency consistency (Definition 4)
Match leaves
63
Extended Prufer Sequences

Regular-Prufer sequence contains only non-leaf
nodes
Add a dummy child node to each leaf node to
create Extended-Prufer sequence, which contains
the labels of all nodes
RPIndex index based on Regular-Prufer sequence
EPIndex index based on Extended-Prufer sequence

64
Regular vs Extended Prufer Sequences

RPSequence
Short
RPIndex
Good for processing twig queries having no values

EPSequence
Longer than Regular Prufer sequence
EPIndex
Good for processing twig queries with values

Optimized Both can coexist in the PRIX system
65
Ordered and Unordered Twig Match

All ordered twig matches can be found by using
the Prufer sequence of query twig
To find unordered matches
Construct the Prufer sequence for different
arrangement of the branches of query twig
Usually, number of twig branches in query is
small gt number of arrangement is small

66
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

Compared the query performance of
PRIX
ViST
TwigStack/TwigStackXB

68
Experimental Setup

1.8GHz Pentium IV processor
512 MB RAM running Solaris 8
40GB EIDE disk drive (store data and indexes)
Compiled by GNU g compiler version 2.95.3
Buffer pool size 2000 pages of size 8K

69
Data Sets

Each has different characteristic
DBLP good similarity in structure and shallow
SWISSPROT bushy and shallow
TREEBANK skinny and deep recursions of element
names

70
Queries
The following queries are tested
These queries have different characteristics in
terms of selectivity, presence of value, and twig
structure
71
Performance inn term of time
72
Performance AnalysisPRIX vs ViST
73
(No Transcript)
74
PRIX vs ViST
75
PRIX vs TwigStack/TwigStackXB
(comparable)
1- Distribution of possible solutions in the data
set 2- Sub-optimality for parent/child
relationship)
76
Outline

Introduction
Background and Motivations
Overview of PRIX Approach
Finding Twig Matches
Implementation Issues in the PRIX System
Experimental Result
Conclusions and Future Work

77
Conclusion and Future Work

Transform XML and twig query into Prufer
sequences
Perform querying without breaking the queries
into root-to-leaf paths and process them
individually
Perform
Subsequence matching
Refinement by connectedness
Refinement by twig structure
Refinement by matching leaf nodes
Future work
explore the behavior of PRIX system for different
query characteristics
Analyze the complexity of query processing time