QED: A Novel Quaternary Encoding to Completely Avoid Relabeling in XML Updates - PowerPoint PPT Presentation

About This Presentation

Title:

QED: A Novel Quaternary Encoding to Completely Avoid Relabeling in XML Updates

Description:

Need to re-label all the ancestor nodes and all the nodes after the inserted ... nodes are inserted into the XML tree, needs to re-calculate the SC values, which ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 34

Provided by: hl83

Category:

more less

Transcript and Presenter's Notes

Title: QED: A Novel Quaternary Encoding to Completely Avoid Relabeling in XML Updates

1
QED A Novel Quaternary Encoding to Completely
Avoid Re-labeling in XML Updates

Changqing Li, Tok Wang Ling

2
Outline

Background and related work
Our QED encoding
Completely avoid re-labeling in XML updates based
on our QED
Experiments
Conclusion

3
Background

Three main categories of labeling schemes to
process XML queries
Containment labeling scheme Zhang et al SIGMOD01
etc.
Prefix labeling scheme Tatarinov et al SIGMOD02
etc.
Prime number labeling scheme Wu et al ICDE04

4
(1) Containment Scheme

start, end, and level
Determine ancestor-descendant and parent-child
relationships based on the containment property

5,6,3 is a descendant of 1,16,1 because
interval 5,6 is contained in interval 1,16
5,6,3 is a child of 4,9,2 because interval
5,6 is contained in interval 4,9, and levels
3-21

5
Containment Scheme, Containment is bad to
process updates

Need to re-label all the ancestor nodes and all
the nodes after the inserted node in document
order

1,16,1
4,9,2
2,3,2
10,13,2
14,15,2
5,6,3
7,8,3
11,12,3
6
Containment Scheme, Containment is bad to
process updates

Need to re-label all the ancestor nodes and all
the nodes after the inserted node in document
order

All the red color numbers need to be changed,
very expensive

1,18,1
4,9,2
10,11,2
12,15,2
2,3,2
16,17,2
5,6,3
7,8,3
13,14,3
7
Containment Scheme, Approaches to solve the
update problem

Increase the interval size and leave some values
unused Li et al VLDB01
When unused values are used up, have to re-bel
Use float-point value Amagasa et al ICDE03
Float-point value represented in a computer with
a fixed number of bits
Due to float-point precision, have to re-label
They both can not completely avoid re-labeling

8
(2) Prefix Scheme

Determine ancestor-descendant and parent-child
relationships based on the prefix property

2.1 is a descendant of the root, because the
label of the root is empty which is a prefix of
2.1
2.1 is a child of 2 because 2 is an
immediate prefix of 2.1, i.e. when removing 2
from the left side of 2.1, 2.1 has no other
prefixes.

9
(2) Prefix Scheme,Prefix is bad to process
order-sensitive updates

To maintain the document order when updates are
performed ---- order-sensitive updates
Need to re-label all the sibling nodes after the
inserted node and all the descendants of these
siblings

2
1
3
4
2.1
2.2
3.1
10
(2) Prefix Scheme,Prefix is bad to process
order-sensitive updates

To maintain the document order when updates are
performed ---- order-sensitive updates
Need to re-label all the sibling nodes after the
inserted node and all the descendants of these
siblings

All the red color numbers need to be changed,
very expensive

11
(2) Prefix Scheme,Approaches to solve the update
problem

OrdPath O'Neil et al SIGMOD04
At the beginning, use odd numbers only

1
3
5
7
3.1
3.3
5.1
12
(2) Prefix Scheme,Approaches to solve the update
problem

OrdPath O'Neil et al SIGMOD04
In insertion, use even number together with odd
numbers

Label of node a -1 Label of node b
6.1 Label of node c 6.3 Label of node d
6.2.1
a
1
3
5
7
c
b
d
3.1
3.1
5.1
3.3

All are at the same level, bad

13
(2) Prefix Scheme,Problems of OrdPath

Nodes a, b, and c are at the same level, but
their labels -1, 6.1, and 6.3 do not look
like this need more time to determine this will
decrease the query performance
Waste half numbers (even numbers) will make
label size increase
Need to calculate the even number between two odd
numbers update cost not cheap
Use a fixed length size to indicate the size of a
label, the fixed length size field will
eventually encounter the overflow problem when a
lot of nodes are inserted, so OrdPath can not
completely avoid re-labeling

14
(3) Prime scheme

Based on a top-down approach, each node is given
a unique prime number (self_label) and the label
of each node is the product of its parent nodes
label (parent_label) and its own self_label.
Query
Use the modular and division operations to
determine the ancestor-descendant and ordering
relationships, which are very expensive
Update
When nodes are inserted into the XML tree, needs
to re-calculate the SC values, which is much more
expensive than re-labeling
Details can be found in Wu et al ICDE04

15
Our QED encoding

Dynamic Quaternary Encoding (QED)
Four quaternary numbers 0, 1, 2 and 3 are
used in the code and each number is stored with
two bits, i.e. 00, 01, 10 and 11.
The quaternary number 0 is used as the
separator, and only 1, 2, and 3 are used in
the QED encoding.
Compare QED codes based on the lexicographical
order

16
Example about QED

We show how to encode 16 numbers we choose 16
because the total start and end values in the
containment scheme is 16 this is only an example
Any other number is ok to be encoded by our QED
Every time encode the (1/3)th and (2/3)th numbers
between two numbers
0 is the separator, and only 1, 2, and 3
appear in the QED codes, so (1/3)th and (2/3)th

17
Example about QED
0
17
18
Example about QED
0
17
19
Example about QED
0
17
20
Overflow problem of other methods

In the previous page, we can see that the
FixedLenth codes are stored with length 5, i.e.
the length of each code is 5 bits
When a lot of codes are inserted, the length 5 is
not large enough, all the FixedLength codes need
to be changed.
For the VarLength codes, we also need to store
the length of each VarLength code, e.g., the
length of 10000 is 5. We need to store this 5
using fixed length of bits (101 3 bits). The
sizes of other codes should also be stored using
fixed length of bits (3 bits).
When a lot of codes are inserted, this size of
the size field 3 is not large enough, then all
the codes must be changed
This is called the overflow problem.

21
Our QED use 0 to separate different codes ----
will never encounter the overflow problem

For the QED codes 112, 12, and 122 etc. in
the table, they are separated with 0
Stored as 11201201220, based on the separator
0, we can separate different codes
0 will never encounter the overflow problem
Our QED encoding can help to completely avoid the
re-labeling

22
Lexicographical order for our QED

Our QED compares codes based on the
lexicographical order
The QED codes in the table are lexicographically
ordered from top to bottom.
E.g., 132 lt 2 lexicographically because the
comparison is from left to right, and the 1st
symbol of 132 is 1, while the 1st symbol of
2 is 2.
Another example, 23 lt 232 lexicographically
because 23 is a prefix of 232.

23
(a) Applying QED encoding to the containment
scheme

Replace the start and end values 1 to 16
with our QED codes
A QED encoding based on containment scheme is
formed
Compare labels based on lexicographical order

Note that we drop the level values from the right
graph just for a clear presentation

24
(b) Applying QED encoding to the prefix scheme

The root has 4 children. To encode 4 numbers
based on our QED, the codes will be 12, 2,
3 and 32.
Similarly if there are 2 siblings, their
self_labels (last component, e.g., 3 in 2.3
is the self_label) are 2 and 3.
If there is only 1 sibling, its self_label is 2.

12
2
3
32
2.2
2.3
3.2
25
(b) Processing the delimiters of the prefix
scheme based on our QED

For the prefix scheme, the delimiter . can not
be stored together with the numbers in the
implementation to separate different components.
For our QED encoding, we use the following
approach to process the delimiters.
We use one 0 as the delimiter to separate
different components of a prefix label
e.g. separate 12 and 3 in 12.3 the
delimiter 0 is equivalent to the . 12.3 is
stored as 1203 in the implementation
use two consecutive separators 00 as the
separator to separate different labels
e.g. 1202001203 represents 2 labels, i.e.
1202 and 1203.

26
Algorithm for insertion based on QED
Algorithm GetInsertedCode Input Left_Code,
Right_Code Output Inserted_Code, such that
Left_Code lt Inserted_Code lt Right_Code
lexicographically. 1 get the sizes of
Left_Code and Right_Code 2 if size(Left_Code)
lt size(Right_Code) //Case (1) 3 then
Inserted_Code (the Right_Code with the last 4

symbol changed to 1) concatenate 2 5 else
if size(Left_Code) gt size(Right_Code) 6 if
the last symbol of Left_Code is 2 //Case (2) 7
then Inserted_Code the Left_Code with
the 8
last symbol changed from 2 to 3 9
else if the last symbol of Left_Code is 3
//Case (3) 10 then Inserted_Code
Left_Code concatenate 2 11 else if
size(Left_Code) size(Right_Code) //Case (4) 12
then Inserted_Code Left_Code concatenate 2
27
XML updates based on our QEDcontainment

When we insert a node as shown in the below
figure
We should insert two QED codes between 23 and
232
First create the start value
i.e. a code between 23 and 232, the new code
is 2312
see Case (1) of the GetInsertedCode algorithm
Then create the end value
i.e. a code between 2312 and 232, the new
code is 2313
see Case (2) of the GetInsertedCode algorithm
23 lt 2312 lt 2313 lt 232 lexicographically,
we need not re-label any existing nodes.

28
XML updates based on our QED based on prefix
scheme

When we insert a node as shown in the below
figure
We should insert one QED code between 2 and 3
The new QED code between 2 and 3 is 22
see Case (4) of the GetInsertedCode algorithm
2 lt 22 lt 3 lexicographically, we need not
re-label any existing nodes, but we can keep the
order.

29
Experimental results Experimental setup

We mainly report the results in updates
We select the Hamlet file in Shakespeares play
dataset
Intermittent updates
Hamlet file has 5 act elements, 6 insertion
cases, i.e. before act1, between act1 and
act2, , between act4 and act5, and after
act5.
Uniformly frequent updates
Insertions happens randomly at different places
of the Hamlet file
Skewed frequent updates
Insertions always happen at a fixed place of the
Hamlet file

30
Experimental results intermittent updates

Prime needs to re-calculate less SC values, but
its re-calculation time is very large
Theorem. Our QED never needs to re-label any
existing nodes
The update time of our QED is much smaller
The update performance differences among OrdPath,
Float-point, and our QED can be seen in the next
page
Note that QED represents both the QED encoding
and the QED-containment scheme, QED-PREFIX
represents the scheme when we apply QED encoding
to the prefix scheme.

(a) Number of nodes to re-label
(b) Time to re-label
31
Experimental results uniformly frequent updates

When uniformly frequent updates are performed,
The update time of OrdPath and Float-Point is
much larger (more than 386 times) than the time
required by our QED approaches
Our QED encoding only needs to modify the last 2
bits of the neighbor label, which is very cheap
Both OrdPath and Float-point can not completely
avoid re-labeling

(a) OrdPath12 vs QED-PREFIX
(b) Float-point vs QED
32
Experimental results skewed frequent updates

When skewed frequent updates are performed,
The update time of OrdPath and Float-Point is
much larger (more than 8126 times) than the time
required by our QED approaches
The very large update time makes OrdPath and
Float-point unsuitable to answer queries in the
frequent insertion environment.
Our QED still works the best to answer queries in
the environment that frequent insertions are
executed

(a) OrdPath12 vs QED-PREFIX
(b) Float-point vs QED
33
Conclusion