PRIX: Indexing and Querying XML Using Prufer Sequences - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

PRIX: Indexing and Querying XML Using Prufer Sequences

Description:

PRIX: Indexing and Querying XML Using Prufer Sequences. By Praveen ... Frequency consistency. Example of frequency consistency. NS1 = 7 15 13 13 15 = 2 7 6 6 7 ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 80
Provided by: leph
Category:

less

Transcript and Presenter's Notes

Title: PRIX: Indexing and Querying XML Using Prufer Sequences


1
PRIX Indexing and Querying XML Using Prufer
Sequences
  • By Praveen Rao and Bongki Moon,
  • ICDE Conf. 2004
  • Presented by Van Le-Pham
  • May 3, 2005

2
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

3
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

4
Introduction
  • XML as a new standard for information
    representation
  • Major issues storing, indexing, and querying
  • Two main researches based on
  • Structural index and numbering schemes
    (dataguide, 1-index, FB index )
  • Numbering scheme that encodes each element by its
    position (based on tree traversal pre and post
    order )

5
Background and Motivations I
  • Most of researches process each of the
    root-to-leaf paths in the query twig separately
    and then merge the results.

6
Background and Motivations II
  • PathStack and TwigStack Algorithms
  • Operate on the positional representation of the
    elements to find twig matches
  • Variant of TwigStack TwigStackXB
  • Using XB-trees to speed up when the input lists
    are long
  • Limitations
  • Skipping data depends on the distribution of the
    matches in the input lists. If the matches are
    spread all over the dataset, then TwigStackXB
    will have to go down to lower regions of the tree
    to avoid missing matches gt ineffective.
  • Parent-child relationships are not handled very
    well.

7
Background and Motivations III
  • ViST (Vitural Suffix Tree)
  • Transform XML data and queries into
    structure-encoded sequences
  • Finding the matched subsequences
  • Using virtual suffix tree to speed up
  • Limitations
  • The worst case storage for a D-Ancestorship is
    higher than linear in the total number of
    elements in XML documents
  • May produce false alarm results

8
Example of false alarm by ViST
  • The structure-encoded sequence of Q is a
    subsequence of structure-encoded of Doc1 and
    Doc2. However, Q occurs only in Doc1 gt match
    detected in Doc2 is false alarm

9
Background and Motivations IV
  • Motivations
  • Develop a method that allows holistic processing
    without breaking a twig into root-to-leaf paths
  • Construct a tree-to-sequence transformed with
    linear storage
  • Index the sequences to reduce the amount of data
    that needed to be searched

10
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

11
XML Documents
  • Represented by labeled tree such that
  • Each node is associated
  • with its element tag and
  • number
  • Any number scheme can
  • be used as long as it keep
  • the labeled number unique

  • postorder traverse

12
Prufer Sequences for Labeled Trees
  • Tree-to-sequence Prufers method apply on tree
    with n nodes labeled from 1 to n
  • Delete the smallest labeled leaf gt tree
  • Collect the label of the deleted nodes
    parent
  • Repeat the above 2 steps for tree and
    get
  • Continue until only 1 node is left
  • Create the sequence of length (n 1)

Tn
Tn - 1
a1
Tn - 1
a2
(a1, a2, , an-1)
13
Prufer Sequences for Labeled Trees II
  • LPS (Labeled Prufer Sequence) the sequence
    collect from previous slide
  • NPS (Numbered Prufer Sequence) the sequence of
    postorder numbers of the coressponding LPS

(a1, a2, , an-1)
14
Example
Tn
n 15
15
Example II
  • NPS 15
  • LPS A

Tn
16
Example III
  • NPS 15 3
  • LPS A C

Tn - 1
17
Example IV
  • NPS 15 3 7
  • LPS A C B

Tn - 2
18
Example V
  • NPS 15 3 7 6
  • LPS A C B C

Tn - 3
19
Example VI
  • NPS 15 3 7 6 6
  • LPS A C B C C

Tn - 4
20
Example VII
  • Keep doing like this until we have
  • NPS 15 3 7 6 6 7 15 9 15 13 13 13 14 15
  • LPS A C B C C B A C A E E E D A
  • (length n 1 15 1 14)

21
Processing Twig Queries by Prufer Sequences
  • A query twig is transformed into Prufer Sequences
    like XML documents
  • Non-matches are filtered out by subsequence
    matching on the indexed sequences
  • Then twig matches are then found by applying a
    series of refinement strategies

22
Architectural Overview of PRIX
23
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

24
Notations Used
25
Problem Statement
  • Given a collection of XML documents ? and a query
    twig Q, report all the occurrences of twig Q in ?
  • Finding twig matches in PRIX includes four phases
    of filtering and refinement

26
Phase 1 Filtering by Subsequence Matching
  • Definition 1 A subsequence is any string that
    can be obtained by deleting zero or more symbols
    from a given string
  • So the problem become
  • Given query twig Q gt LPS(Q)
  • Given XML documents ? gt G a set of LPSs of ?
  • We need to find all subsequences in G that match
    LPS(Q)

27
Phase 1 Filtering by Subsequence Matching II
  • Lemma 1 Given a tree T with n nodes, numbered
    from 1 to n in postorder, the node deleted at the
    time during Prufer sequence construction is
    the node numbered i
  • Theorem 1 If tree Q is a subgraph of tree T,
    then LPS(Q) is a subsequence of LPS(T).
  • (Prove omitted)

ith
28
Example
  • Tree T Tree Q
  • LPS(T) A C B C C B A C A E E E D A
    LPS(Q) B A E D A
  • NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
    NPS(Q) 2 6 4 5 6

29
Example II
  • LPS(T) A C B C C B A C A E E E D A
  • NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
  • LPS(Q) B A E D A
  • NPS(Q) 2 6 4 5 6
  • We can say that Q is a (labeled) subgraph of T,
    and LPS(Q) match of subsequence S of LPS(T) at
    positions (6, 7, 11, 13, 14)
  • The postorder number sequence of subsequence S is
  • 7 15 13 14 15
  • There may be more than one subsequence in
    LPS(T) that matches LPS(Q)

30
Phase 2 Refinement by Connectedness
  • Refine the matched subsequences in phase 1 for
    their connectedness property
  • The sequences that satisfy Theorem 2 are called
    tree sequences

31
Intuition
  • Let i be the index of the last occurrence of a
    postorder number a in an NPS
  • This last occurrence is a result of deletion of
    the last child of a
  • Based on Lemma 1, the next child to be deleted is
    a itself
  • So if the tree is connected, then the postorder
    number at the index should be the
    parent of a

(i 1)th
32
Example
  • Subsequences
  • C B C E D
  • 3 7 9 13 14
  • C B A C A E D A
  • 3 7 15 9 15 13 14 15

SA
NA
SB
NB
33
Example II
  • NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14
    15
  • In case of max( , ,
    , ) 14
  • 3 7 9 13 14
  • i 1, 3
  • 7 7 gt Connected
    here
  • 3 7 9 13 14
  • i 2, 7
  • 9 ? 15 gt Disconnect
    here

NT
SA
NA1
NA2
NA5
NA
NA1
NA2
NT(3)
34
Phase 3 Refinement by Twig Structure
  • Refine the tree sequences in phase 2 for their
    twig structure property
  • Define if the structure of the tree represented
    by a tree sequence matches the query twig
    structure

35
Some definitions
  • Gap between nodes
  • (can be computed using the NPS of the tree)
  • Gap consistency

36
Example of gap consistency
37
Example of gap consistency II
S1 B A E E A S2 B A E E A NS1 7
15 13 13 15 NS2 2 7 6 6 7
  • 7 (-) 15 (-) 13 (-) 13 (-) 15
  • gap -8 2 0 -2
  • 2 (-) 7 (-) 6 (-) 6 (-) 7
  • -5 1 0 -1

NS1
NS2
38
Intuition
  • Gap(a, b) number of nodes are encountered
    during postorder traversal between a and b
  • Therefore, if more nodes are traversed in the
    query twig as compared to the data twig gt there
    is a structural difference between the data and
    the query twig
  • The number of times a number a occurs in an NPS
    the number of child nodes of a

39
Definitions
  • Frequency consistency

40
Example of frequency consistency
NS1
  • 7 15 13 13 15 2 7 6 6 7
  • Freq(7) 1 Freq(2), occurs at position 1
  • Freq(15) 2 Freq(7), and occurs at positions
    2,5
  • Freq(13) 2 Freq(6), and occurs at positions
    3,4
  • gt and are frequency consistent

NS2
S1
S2
41
What we have so far
(proof omitted)
42
What we have so far
  • From lemmas and theorems gt LPS(Q) and
    subsequence S of data are identical
  • NPS(Q) is gap consistent and frequency consistent
    with N

43
Remind
  • LPS contains only the non-leaf nodes, therefore
  • need to store the label and postorder number of
    every leaf in database
  • After what we have done, we can only find twig
    matches whose
  • Tree structure is same as query tree
  • Non-leaf node labels match the non-leaf node
    l)abels of the query twig
  • We call it partial twig matches (because we
    dont check for the leaves yet)

44
Remind II
  • To find complete twig match, the leaf node labels
    of the partial twig match in the data should be
    matched with the leaf node labels of query twig

45
Phase 4 Refinement by Matching Leaf Nodes
  • The leaf node labels of the query twig are tested
    with leaf node labels of partial twig matches
    found in phase 3

46
Example (see example 6)
matches
  • LPS(T) A C B C C B A C A E E E D A
    LPS(Q) B A E D A
  • NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
    NPS(Q) 2 6 4 5 6
  • LPS(Q) matches a subsequence
  • S B A E D A in LPS(T) at positions
  • P (3th , 7th, 11th, 13th, 14th) and postorder
    numbered sequence
  • N 7 15 13 14 15
  • Leaf nodes(T) (D,2), (D,4), (E,5), (G,10),
    (F,11), (F,12)
  • Leaf nodes(Q) (F,3)

matches
47
Optimize
  • Phase 4 can be eliminated if we put the leaf
    nodes of the query twig and data trees inside the
    LPSs

48
Processing Wildcards (// and )
First wildcard
  • Q //A//C/D
  • LPS(Q) C A
  • NPS(Q) 2 3

Second wildcard
Query Q
49
Processing Wildcards
  • The first wildcard is handled by the method
  • To handle the second wildcard, we need to modify
    the refinement-by-connectedness step
  • LPS(T) A C B C C B A C A E E E D A
    LPS(Q) C A
  • NPS(T) 15 3 7 6 6 7 15 9 15 13 13 13 14 15
    NPS(Q) 2 6 4 5 6
  • S C A, N 3 15, and P (2th, 7th)
  • Theorem 2 gt we can discard this subsequence
    (since the last occurrence of 3 must be followed
    by 7, but in this case it is followed by 15 gt
    unconnected)
  • To avoid, we check recursively to see if node 3
    can lead to node 15 by following a series of
    edges in T

If the subsequences that pass this test will move
to next phase
50
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

51
Building Prufer Sequences
  • Tree-to-sequence Prufers method apply on tree
    with n nodes labeled from 1 to n
  • Delete the smallest labeled leaf gt tree
  • Collect the label of the deleted nodes
    parent
  • Repeat the above 2 steps for tree and
    get
  • Continue until only 1 node is left
  • Create the sequence ( , , .,
    ) of length (n 1)

Tn
Tn - 1
a1
Tn - 1
a2
a1
,a2
an-1
52
Indexing Sequence Using B-tree
  • Index the set of LPSs of XML documents to fasten
    the subsequence matching step
  • Use B-tree (in a similar way that is used in
    ViST) to build a virtual trie

53
Virtual Trie
  • Each node in the trie is labeled with the range
    (LeftPos, RightPos) (called positional
    representation) such that the containment
    property is satisfied
  • The range of the root is (1, MAX_INT)
  • Each element tag e, we build a B-tree that
    indexes the positional representation of every
    occurrence of e using its LeftPos as a key gt
    called Trie-Symbol index
  • Index each document tree identifier in a separate
    B-tree using the LeftPos of the node where the
    LPS ends in the virtual tree as the key gt called
    Docid index

54
Space Complexity
  • The size of the trie grows linearly with the
    length of the sequences stored in it
  • And the length of Prufer sequence is linear in
    the number of nodes in the tree
  • index size is linear in the number of
    nodes in the tree

55
Filtering by Subsequence Matching
  • Let LPS(Q) (sequence of length k)
  • Algorithm 1 find all occurences of using the
    Trie-Symbol index
  • Call FindSubsequence( , 1, 0, MAX_INT)

Qs Qs1Qs2..Qsk
Qs
Qs
56
Filtering by Subsequence Matching

57
Optimized Subsequence Matching
  • Speed up subsequence matching more
  • Reduce the number of range queries to be
    performed by Algorithm 1 without causing any
    false dismissals
  • Pruning some nodes (r in line 2) using the gap
    between two adjacent nodes in the query sequence
  • Given ? set of XML documents and node label e in
    ?, define

58
Example of MaxGap(e, ?)
  • In tree P, difference(A) 14 8 6
  • In tree Q, difference(A) 3 1 2
  • MaxGap(A, P,Q) max6, 2 6
  • MaxGap(e, ?) 0 if every occurrence of e has at
    most 1 child

59
How MaxGap useful
  • Suppose we have the query twig with B is the
    parent of C gt LPS(Q) C B
  • Then C B has 8 matches in LPS(P) corresponding to
    these positions (1,3), (1,4), (1, 7), (1,9),
    (2,3), (2,4), (2, 7), and (2,9)
  • LPS(P) C C B B E E B A B C D D C A
  • Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • MaxGap(C,P,Q) max(2-1), (13 -10), (8-7)
    3, so
  • The gap between first and last occurences of C
    3
  • The gap between first occurrence of C and parent
    B 4
  • only 4 matches will be considered for next
    step (1,3), (1,4), (2,3) and (2,4)

60
Using MaxGap
61
The Refinement Phases
  • Using Algorithm 2 to refine the set of ordered
    pairs (D,S) return by Algorithm 1
  • D set of document identifiers
  • S positions of match subsequences
  • Refinement includes
  • Test for connectedness (refinement by
    connectedness)
  • Test for gap consistency (refinement by
    structure)
  • Test for frequency consistency (refinement by
    structure)
  • Find matched leaves (refinement by leaf matching)

62
Connectedness (Theorem 2)
Gap consistency (Definition 3)
Frequency consistency (Definition 4)
Match leaves
63
Extended Prufer Sequences
  • Regular-Prufer sequence contains only non-leaf
    nodes
  • Add a dummy child node to each leaf node to
    create Extended-Prufer sequence, which contains
    the labels of all nodes
  • RPIndex index based on Regular-Prufer sequence
  • EPIndex index based on Extended-Prufer sequence

64
Regular vs Extended Prufer Sequences
  • RPSequence
  • Short
  • RPIndex
  • Good for processing twig queries having no values
  • EPSequence
  • Longer than Regular Prufer sequence
  • EPIndex
  • Good for processing twig queries with values

Optimized Both can coexist in the PRIX system
65
Ordered and Unordered Twig Match
  • All ordered twig matches can be found by using
    the Prufer sequence of query twig
  • To find unordered matches
  • Construct the Prufer sequence for different
    arrangement of the branches of query twig
  • Usually, number of twig branches in query is
    small gt number of arrangement is small

66
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

67
  • Compared the query performance of
  • PRIX
  • ViST
  • TwigStack/TwigStackXB

68
Experimental Setup
  • 1.8GHz Pentium IV processor
  • 512 MB RAM running Solaris 8
  • 40GB EIDE disk drive (store data and indexes)
  • Compiled by GNU g compiler version 2.95.3
  • Buffer pool size 2000 pages of size 8K

69
Data Sets
  • Each has different characteristic
  • DBLP good similarity in structure and shallow
  • SWISSPROT bushy and shallow
  • TREEBANK skinny and deep recursions of element
    names

70
Queries
The following queries are tested
These queries have different characteristics in
terms of selectivity, presence of value, and twig
structure
71
Performance inn term of time
72
Performance AnalysisPRIX vs ViST
73
(No Transcript)
74
PRIX vs ViST
75
PRIX vs TwigStack/TwigStackXB
(comparable)
1- Distribution of possible solutions in the data
set 2- Sub-optimality for parent/child
relationship)
76
Outline
  • Introduction
  • Background and Motivations
  • Overview of PRIX Approach
  • Finding Twig Matches
  • Implementation Issues in the PRIX System
  • Experimental Result
  • Conclusions and Future Work

77
Conclusion and Future Work
  • Transform XML and twig query into Prufer
    sequences
  • Perform querying without breaking the queries
    into root-to-leaf paths and process them
    individually
  • Perform
  • Subsequence matching
  • Refinement by connectedness
  • Refinement by twig structure
  • Refinement by matching leaf nodes
  • Future work
  • explore the behavior of PRIX system for different
    query characteristics
  • Analyze the complexity of query processing time

78
Thank You!
(Want to have detail proofing go to
http//www.cs.arizona.edu/research/reports.html)
79
Question?
Write a Comment
User Comments (0)
About PowerShow.com