QED: A Novel Quaternary Encoding to Completely Avoid Relabeling in XML Updates - PowerPoint PPT Presentation

About This Presentation
Title:

QED: A Novel Quaternary Encoding to Completely Avoid Relabeling in XML Updates

Description:

Need to re-label all the ancestor nodes and all the nodes after the inserted ... nodes are inserted into the XML tree, needs to re-calculate the SC values, which ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 34
Provided by: hl83
Category:

less

Transcript and Presenter's Notes

Title: QED: A Novel Quaternary Encoding to Completely Avoid Relabeling in XML Updates


1
QED A Novel Quaternary Encoding to Completely
Avoid Re-labeling in XML Updates
  • Changqing Li, Tok Wang Ling

2
Outline
  • Background and related work
  • Our QED encoding
  • Completely avoid re-labeling in XML updates based
    on our QED
  • Experiments
  • Conclusion

3
Background
  • Three main categories of labeling schemes to
    process XML queries
  • Containment labeling scheme Zhang et al SIGMOD01
    etc.
  • Prefix labeling scheme Tatarinov et al SIGMOD02
    etc.
  • Prime number labeling scheme Wu et al ICDE04

4
(1) Containment Scheme
  • start, end, and level
  • Determine ancestor-descendant and parent-child
    relationships based on the containment property
  • 5,6,3 is a descendant of 1,16,1 because
    interval 5,6 is contained in interval 1,16
  • 5,6,3 is a child of 4,9,2 because interval
    5,6 is contained in interval 4,9, and levels
    3-21

5
Containment Scheme, Containment is bad to
process updates
  • Need to re-label all the ancestor nodes and all
    the nodes after the inserted node in document
    order

1,16,1
4,9,2
2,3,2
10,13,2
14,15,2
5,6,3
7,8,3
11,12,3
6
Containment Scheme, Containment is bad to
process updates
  • Need to re-label all the ancestor nodes and all
    the nodes after the inserted node in document
    order
  • All the red color numbers need to be changed,
    very expensive

1,18,1
4,9,2
10,11,2
12,15,2
2,3,2
16,17,2
5,6,3
7,8,3
13,14,3
7
Containment Scheme, Approaches to solve the
update problem
  • Increase the interval size and leave some values
    unused Li et al VLDB01
  • When unused values are used up, have to re-bel
  • Use float-point value Amagasa et al ICDE03
  • Float-point value represented in a computer with
    a fixed number of bits
  • Due to float-point precision, have to re-label
  • They both can not completely avoid re-labeling

8
(2) Prefix Scheme
  • Determine ancestor-descendant and parent-child
    relationships based on the prefix property
  • 2.1 is a descendant of the root, because the
    label of the root is empty which is a prefix of
    2.1
  • 2.1 is a child of 2 because 2 is an
    immediate prefix of 2.1, i.e. when removing 2
    from the left side of 2.1, 2.1 has no other
    prefixes.

9
(2) Prefix Scheme,Prefix is bad to process
order-sensitive updates
  • To maintain the document order when updates are
    performed ---- order-sensitive updates
  • Need to re-label all the sibling nodes after the
    inserted node and all the descendants of these
    siblings

2
1
3
4
2.1
2.2
3.1
10
(2) Prefix Scheme,Prefix is bad to process
order-sensitive updates
  • To maintain the document order when updates are
    performed ---- order-sensitive updates
  • Need to re-label all the sibling nodes after the
    inserted node and all the descendants of these
    siblings
  • All the red color numbers need to be changed,
    very expensive

11
(2) Prefix Scheme,Approaches to solve the update
problem
  • OrdPath O'Neil et al SIGMOD04
  • At the beginning, use odd numbers only

1
3
5
7
3.1
3.3
5.1
12
(2) Prefix Scheme,Approaches to solve the update
problem
  • OrdPath O'Neil et al SIGMOD04
  • In insertion, use even number together with odd
    numbers

Label of node a -1 Label of node b
6.1 Label of node c 6.3 Label of node d
6.2.1
a
1
3
5
7
c
b
d
3.1
3.1
5.1
3.3
  • All are at the same level, bad

13
(2) Prefix Scheme,Problems of OrdPath
  • Nodes a, b, and c are at the same level, but
    their labels -1, 6.1, and 6.3 do not look
    like this need more time to determine this will
    decrease the query performance
  • Waste half numbers (even numbers) will make
    label size increase
  • Need to calculate the even number between two odd
    numbers update cost not cheap
  • Use a fixed length size to indicate the size of a
    label, the fixed length size field will
    eventually encounter the overflow problem when a
    lot of nodes are inserted, so OrdPath can not
    completely avoid re-labeling

14
(3) Prime scheme
  • Based on a top-down approach, each node is given
    a unique prime number (self_label) and the label
    of each node is the product of its parent nodes
    label (parent_label) and its own self_label.
  • Query
  • Use the modular and division operations to
    determine the ancestor-descendant and ordering
    relationships, which are very expensive
  • Update
  • When nodes are inserted into the XML tree, needs
    to re-calculate the SC values, which is much more
    expensive than re-labeling
  • Details can be found in Wu et al ICDE04

15
Our QED encoding
  • Dynamic Quaternary Encoding (QED)
  • Four quaternary numbers 0, 1, 2 and 3 are
    used in the code and each number is stored with
    two bits, i.e. 00, 01, 10 and 11.
  • The quaternary number 0 is used as the
    separator, and only 1, 2, and 3 are used in
    the QED encoding.
  • Compare QED codes based on the lexicographical
    order

16
Example about QED
  • We show how to encode 16 numbers we choose 16
    because the total start and end values in the
    containment scheme is 16 this is only an example
  • Any other number is ok to be encoded by our QED
  • Every time encode the (1/3)th and (2/3)th numbers
    between two numbers
  • 0 is the separator, and only 1, 2, and 3
    appear in the QED codes, so (1/3)th and (2/3)th

17
Example about QED
0
17
18
Example about QED
0
17
19
Example about QED
0
17
20
Overflow problem of other methods
  • In the previous page, we can see that the
    FixedLenth codes are stored with length 5, i.e.
    the length of each code is 5 bits
  • When a lot of codes are inserted, the length 5 is
    not large enough, all the FixedLength codes need
    to be changed.
  • For the VarLength codes, we also need to store
    the length of each VarLength code, e.g., the
    length of 10000 is 5. We need to store this 5
    using fixed length of bits (101 3 bits). The
    sizes of other codes should also be stored using
    fixed length of bits (3 bits).
  • When a lot of codes are inserted, this size of
    the size field 3 is not large enough, then all
    the codes must be changed
  • This is called the overflow problem.

21
Our QED use 0 to separate different codes ----
will never encounter the overflow problem
  • For the QED codes 112, 12, and 122 etc. in
    the table, they are separated with 0
  • Stored as 11201201220, based on the separator
    0, we can separate different codes
  • 0 will never encounter the overflow problem
  • Our QED encoding can help to completely avoid the
    re-labeling

22
Lexicographical order for our QED
  • Our QED compares codes based on the
    lexicographical order
  • The QED codes in the table are lexicographically
    ordered from top to bottom.
  • E.g., 132 lt 2 lexicographically because the
    comparison is from left to right, and the 1st
    symbol of 132 is 1, while the 1st symbol of
    2 is 2.
  • Another example, 23 lt 232 lexicographically
    because 23 is a prefix of 232.

23
(a) Applying QED encoding to the containment
scheme
  • Replace the start and end values 1 to 16
    with our QED codes
  • A QED encoding based on containment scheme is
    formed
  • Compare labels based on lexicographical order
  • Note that we drop the level values from the right
    graph just for a clear presentation

24
(b) Applying QED encoding to the prefix scheme
  • The root has 4 children. To encode 4 numbers
    based on our QED, the codes will be 12, 2,
    3 and 32.
  • Similarly if there are 2 siblings, their
    self_labels (last component, e.g., 3 in 2.3
    is the self_label) are 2 and 3.
  • If there is only 1 sibling, its self_label is 2.

12
2
3
32
2.2
2.3
3.2
25
(b) Processing the delimiters of the prefix
scheme based on our QED
  • For the prefix scheme, the delimiter . can not
    be stored together with the numbers in the
    implementation to separate different components.
  • For our QED encoding, we use the following
    approach to process the delimiters.
  • We use one 0 as the delimiter to separate
    different components of a prefix label
  • e.g. separate 12 and 3 in 12.3 the
    delimiter 0 is equivalent to the . 12.3 is
    stored as 1203 in the implementation
  • use two consecutive separators 00 as the
    separator to separate different labels
  • e.g. 1202001203 represents 2 labels, i.e.
    1202 and 1203.

26
Algorithm for insertion based on QED
Algorithm GetInsertedCode Input Left_Code,
Right_Code Output Inserted_Code, such that
Left_Code lt Inserted_Code lt Right_Code
lexicographically. 1 get the sizes of
Left_Code and Right_Code 2 if size(Left_Code)
lt size(Right_Code) //Case (1) 3 then
Inserted_Code (the Right_Code with the last 4

symbol changed to 1) concatenate 2 5 else
if size(Left_Code) gt size(Right_Code) 6 if
the last symbol of Left_Code is 2 //Case (2) 7
then Inserted_Code the Left_Code with
the 8
last symbol changed from 2 to 3 9
else if the last symbol of Left_Code is 3
//Case (3) 10 then Inserted_Code
Left_Code concatenate 2 11 else if
size(Left_Code) size(Right_Code) //Case (4) 12
then Inserted_Code Left_Code concatenate 2
27
XML updates based on our QEDcontainment
  • When we insert a node as shown in the below
    figure
  • We should insert two QED codes between 23 and
    232
  • First create the start value
  • i.e. a code between 23 and 232, the new code
    is 2312
  • see Case (1) of the GetInsertedCode algorithm
  • Then create the end value
  • i.e. a code between 2312 and 232, the new
    code is 2313
  • see Case (2) of the GetInsertedCode algorithm
  • 23 lt 2312 lt 2313 lt 232 lexicographically,
    we need not re-label any existing nodes.

28
XML updates based on our QED based on prefix
scheme
  • When we insert a node as shown in the below
    figure
  • We should insert one QED code between 2 and 3
  • The new QED code between 2 and 3 is 22
  • see Case (4) of the GetInsertedCode algorithm
  • 2 lt 22 lt 3 lexicographically, we need not
    re-label any existing nodes, but we can keep the
    order.

29
Experimental results Experimental setup
  • We mainly report the results in updates
  • We select the Hamlet file in Shakespeares play
    dataset
  • Intermittent updates
  • Hamlet file has 5 act elements, 6 insertion
    cases, i.e. before act1, between act1 and
    act2, , between act4 and act5, and after
    act5.
  • Uniformly frequent updates
  • Insertions happens randomly at different places
    of the Hamlet file
  • Skewed frequent updates
  • Insertions always happen at a fixed place of the
    Hamlet file

30
Experimental results intermittent updates
  • Prime needs to re-calculate less SC values, but
    its re-calculation time is very large
  • Theorem. Our QED never needs to re-label any
    existing nodes
  • The update time of our QED is much smaller
  • The update performance differences among OrdPath,
    Float-point, and our QED can be seen in the next
    page
  • Note that QED represents both the QED encoding
    and the QED-containment scheme, QED-PREFIX
    represents the scheme when we apply QED encoding
    to the prefix scheme.

(a) Number of nodes to re-label
(b) Time to re-label
31
Experimental results uniformly frequent updates
  • When uniformly frequent updates are performed,
  • The update time of OrdPath and Float-Point is
    much larger (more than 386 times) than the time
    required by our QED approaches
  • Our QED encoding only needs to modify the last 2
    bits of the neighbor label, which is very cheap
  • Both OrdPath and Float-point can not completely
    avoid re-labeling

(a) OrdPath12 vs QED-PREFIX
(b) Float-point vs QED
32
Experimental results skewed frequent updates
  • When skewed frequent updates are performed,
  • The update time of OrdPath and Float-Point is
    much larger (more than 8126 times) than the time
    required by our QED approaches
  • The very large update time makes OrdPath and
    Float-point unsuitable to answer queries in the
    frequent insertion environment.
  • Our QED still works the best to answer queries in
    the environment that frequent insertions are
    executed

(a) OrdPath12 vs QED-PREFIX
(b) Float-point vs QED
33
Conclusion
  • We propose the QED encoding
  • QED can be applied broadly to different labeling
    schemes
  • QED can completely avoid re-labeling in XML
    updates
Write a Comment
User Comments (0)
About PowerShow.com