Title: Core Labeling: A New Way to Compress Transitive Closure
1Core Labeling A New Way to Compress Transitive
Closure
- Yangjun Chen
- Dept. Applied Computer Science,
- University of Winnipeg
- 515 Portage Ave.
- Winnipeg, Manitoba, Canada R3B 2E9
2Outline
- Motivation
- Tree labeling
- Main algorithm
- - Core tree
- - Graph labeling Core-I
- - Graph labeling Core-II
- Conclusion
3Motivation
- Efficient method to evaluate sparse graph
reachability queries - Given a directed sparse graph G, check whether a
node v is reachable from another node u through a
path in G. - Application
- XML data processing, gene-regulatory networks or
metabolic networks. It is well known that XML
documents are often represented by tree
structure. However, an XML document may contain
IDREF/ID references that turn itself into a
directed, but sparse graph a tree structure plus
a few reference links. For a metabolic network,
the graph reachability models a relationship
whether two genes interact with each other or
whether two proteins participate in a common
pathway. Many such graphs are sparse.
4Motivation
- A simple method
- - store a transitive closure as a matrix
O(n2) space
M ? M
5Tree labeling
- Tree encoding
- Let G be a sparse graph. we will first find a
spanning tree T of G. - Each node v in T will be assigned an interval
start, end), where start is vs preorder number
and end - 1 is the largest preorder number among
all the nodes in Tv. So another node u labeled
start, end) is a descendant of v (with respect
to T) iff start ? start, end).
0, 12)
a
r
5, 9)
9, 12)
e
b
1, 5)
6, 9)
h
d
f
g
4, 5)
c
i
11, 12)
7, 8)
2, 4)
j
8, 9)
10, 11)
k
3, 4)
Let v and u be two nodes in T, labeled a, b) and
a, b), respectively. If a ? a, b), v is a
descendant of u. In this case, we say, a, b) is
subsumed by a, b). Also, we must have b ? b.
Therefore, if v and u are not on the same path in
T, we have either a ? b or a ? b. In the former
case, we say, a, b) is smaller than a,
b), denoted a, b) ? a, b). In the latter
case, a, b) is smaller than a, b).
6Tree labeling
Interval sequences (label space)
0, 12)
a
r
5, 9)
9, 12)
e
b
1, 5)
6, 9)
h
d
f
g
4, 5)
c
11, 12)
7, 8)
2, 4)
i
j
8, 9)
10, 11)
k
3, 4)
7Main Algorithm
- Core tree (core of G)
- Let T be a spanning tree. We denote E the set
of all the non-tree edges. Denote V the set of
all the end points of the non-tree edges. Then,
V - Vstart ? Vend, where Vstart stands for a set
containing all the start nodes of the non-tree
edges and Vend for all the end nodes of the
non-tree edges. - Definition 1. (anti-subsuming subset) A subset S
? Vstart is called an anti-subsuming set iff S
gt 1 and no two nodes in S are related by
ancestor-descendant relationship with respect to
T.
anti-subsumming subsets
Vstart d, f, g, h Vend c, k, e, d, g
a
d, f d, g d, h f, g f, h g, h
d, f, g d, f, h d, g, h f, g, h d, f, g,
h
r
e
b
h
d
f
g
c
i
j
k
8Main Algorithm
- Core tree (core of G)
- Definition 2. (critical node) A node v in a
spanning tree T of G is critical if - v ? Vstart or there exists an anti-subsuming
subset S v1, v2, ..., vk for k ? 2 such that
v is the lowest common ancestor of v1, v2, ...,
vk. We denote Vcritical the set of all critical
nodes. ? - In the graph, node e is the lowest common
ancestor of f, g, and node a is the lowest
common ancestor of d, f, g, h. So e and a are
critical nodes. In addition, each v ? Vstart is a
critical node. So all the critical nodes of G
with respect to T are d, f, g, h, e, a. -
a
r
e
b
h
d
f
g
c
i
j
k
9Main Algorithm
- Core tree (core of G)
- Definition 3. (core of G) Let G (V, E) be a
directed graph. Let T be a spanning tree of G.
The core of G with respect to T is a tree
structure with the node set being Vcritical and
there is an edge from u to v (u, v ? Vcritical)
iff there is a path p from u to v in T and p
contains no other critical nodes. The core of G
with respect to T is denoted Gcore (Vcore,
Ecore).
a h e f d g
0, 12) 2, 4)4, 5)6, 9)9, 12) 2, 4)4, 5)6,
9) 3, 4)4, 5)7, 8) 3, 4)4, 5) 2, 4)8, 9)
Gcore
a
a
r
e
e
b
h
d
f
h
g
d
f
g
c
i
j
k
10Main Algorithm
- Core generation
- Algorithm core-generation(T)
- Mark any node in T, which belongs to Vstart.
- Let v be the first marked node encountered during
the bottom-up searching of T. Create the first
node for v in Gcore. - Let u be the currently encountered node in T. Let
u be a node in T, for which a node in Gcore is
created just before u is met. Do (4) or (5),
depending on whether u is a marked node or not. - If u is a marked node, then do the following.
- (a) If u is not a child (descendant) of u,
create a link from u to u, called a
left-sibling link and denoted as
left-sibling(u) u.
11Main Algorithm
- Core generation
- Algorithm core-generation(T) (continued)
- (b) If u is a child (descendant) of u, we will
first create a link from u to u, called a
parent link and denoted as parent(u) u.
Then, we will go along a left-sibling chain
starting from u until we meet a node u which
is not a child (descendant) of u. For each
encountered node w except u, set parent(w) ? u.
Set left- sibling(u) ? u. Remove
left-sibling(w) for each child w of u. - 5. If u is a non-marked node, then do the
following. - (c) If u is not a child (descendant) of u, no
node will be created. - (d) If u is a child (descendant) of u, we will
go along a left-sibling chain starting from u
until we meet a node u which is not a child
(descendant) of u. If the number of the nodes
encountered during the chain navigation (not
including u) is more than 1, we will create
new node in Gcore and do the same operation as
(4.b). Otherwise, no node is created.
12Main Algorithm
u is not a child of u.
u
u
u
u
u
u
link to the left sibling
d
d
f
d
f
(c)
(b)
(a)
a
h
r
(e)
(d)
d
f
g
d
f
g
e
b
h
a
d
f
g
c
e
i
(f)
j
f
h
g
d
k
13Main Algorithm
- Graph labeling Core-I
- Definition 4. Let Vcore v1, ..., vg be the
node set of Gcore. The core label for G is a set
L(v1), ..., L(vg), where each L(vl) (l 1,
..., g) is an interval sequence associated with
vl, satisfying the following two properties - (1) Let L(vl) al1, bl1), ..., alr, blr) for
some r. Then, for any i, j ? 1, ..., r, ali ?
blj if i lt j. That is, ali, bli) ? alj, blj)
for i lt j. (In this sense, the intervals in
L(vl) are considered to be sorted.) - (2) Let a, b) be the interval associated with a
descendant of vl with respect to G. There exists
an interval ali, bli) (1 ? i ? r) in L(vl) such
that a ? ali, bli). - Definition 5. (link graph) Let G (V, E) be a
directed graph. Let T be a spanning tree of G.
The link graph of G with respect to T is a graph,
denoted Glink, with the node set being V (the
end points of all the non-tree edges) and the
edge set E ? E, where (v, u) ? E iff v ?
Vend, u ? Vstart, and there exists a path from v
to u in T.
14Main Algorithm
Glink
e
h
g
c
d
f
k
Gcom Gcore ? Glink
a h e f d k g c
0, 12) 2, 4)4, 5)6, 9)9, 12) 2, 4)4, 5)6,
9) 3, 4)4, 5)7, 8) 3, 4)4, 5) 3, 4) 2,
4)8, 9) 2, 4)
0, 12)
a
h
reverse topological order
6, 9)
e
9, 12)
c
d
f
g
2, 4)
8, 9)
7, 8)
4, 5)
k
3, 4)
15Main Algorithm
- Generation of interval sequences 1. Scan the
reverse topological order of Gcom. 2. For each
node v, the interval sequence L(v) is stored in a
linked list Av. Initially, Av contains only one
interval, which is generated by labeling
T. 3. Let v1, ..., vk be the children of v (in
Gcom). Merge Av with each Avl for the child
node vl (l 1, ..., k) as follows.
Assume Av p1 ? p2 ? ... ? pg and Avl q1
? q2 ? ... ? qh. Assume that both Av and Avl
are increasingly ordered. (As we will see soon,
any interval sequence generated by the following
algorithm has this nice property. It contains
only the intervals not on the same path in T.
Initially, Av contains only one interval. It is
considered to be sorted.)
16Main Algorithm
- - Generation of interval sequences
- 4. We step through both Av and Avl from left
to right. Let pi ai, bi) and qj aj, bj)
be the intervals encountered. We will conduct
the following checkings. - (i) If ai ? bj, insert qj into Av after pi-1 and
before pi and move to qj1. - (ii) If ai ? aj, bj), remove pi from Av and
move to pi1. (pi is subsumed by qj.) - (iii) If aj ? ai, bi), ignore qj and move to
qj1. (qj is subsumed by pi but it should
not be removed from Avl.) - (iv) If aj ? bi, ignore pi and move to pi1.
- (v) If ai aj and bi bj, ignore both pi and
qj, and move to pi
17Main Algorithm
- Generation of interval sequences Example.
p
A1 2, 4)4, 5)7, 8) A2 2, 4)8, 9)
q
p
P nil
A
A1 2, 4)4, 5)7, 8) A2 2, 4)8, 9)
A1 2, 4)4, 5)7, 8)8, 9) A2 2, 4)8, 9)
q
q
18Main Algorithm
- Core labels
0, 12)
a
2, 4)4, 5)6, 9)
e
2, 4)4, 5)6, 9)9, 12)
3, 4)4, 5)
g
f
d
h
2, 4)8, 9)
3, 4)4, 5)7, 8)
19Main Algorithm
- Non-tree labeling Let Vcore v1, ..., vj.
We store the core label of G as a list s1
L(v1), ..., sj L(vj). Then, we define a
function f Vcore ? 1, ..., j such that for
each v ? Vcore f(v) i iff si L(v). Based on
the above concepts, we define Core-I below.
f(a) f (h) f (e) f (f) f (d) f (g)
1 2 3 4 5 6
s1 L(a) s2 L(h) s3 L(e) s4 L(f) s5 L(d) s6
L(g)
0, 12) 2, 4)4, 5)6, 9)9, 12) 2, 4)4,
5)6, 9) 3, 4)4, 5)7, 8) 3, 4)4, 5)
2, 4)8, 9)
20Main Algorithm
- Non-tree labeling Each node v in V is
associated with two nodes v- and v. v- - a
critical node in Tv, which is closest to v.
v - the lowest ancestor of v (in T), which has a
non-tree incoming edge. Example.
r- e, r does not exist. e- e, e e.
a
r
h
e
b
d
f
g
i
c
j
k
21Main Algorithm
- Non-tree labeling Definition (Core-I) Let v be
a node in G. The non-tree label of v is a pair
ltd, tgt, where - d i if v- exists and f(v-)
i. If v- does not exists, let d be the special
symbol -. - t x, y) if v exists and x,
y) is the interval of v. If v does not exist,
let y be -.
22Main Algorithm
- Non-tree labeling Proposition Assume that u
and v are two nodes in G, labeled (a1, b1),
ltx1, y1gt) and (a2, b2), ltx2, y2gt),
respectively. Node v is reachable from u iff one
of the following conditions holds (i) a2, b2)
is subsumed by a1, b1), or (ii) There exists an
interval a, b) in sx1 such that for y2 a,
b) we have a ? a, b) (i.e., y2 is subsumed
by a, b) .)
23Main Algorithm
- Graph labeling Core-II
- We can store the core label of G as a d ? g
boolean matrix M, where d is the number of the
end nodes of all non-tree edges and g the number
of the nodes in Gcore. - Let u1, u2, ..., ud be all the end nodes of the
non-tree edges. Let v1, v2, ..., vg be all the
nodes in Gcore. Assign each ui an index, denoted
index(ui) (i.e., u1, u2, ..., ud will be assigned
contiguous integers, starting from 0.) Assign
each vj an index, denoted index(vj). An entry
Mindex(ui), index(vj) is set to 1 if there
exists an interval a, b) in L(vj) such that
for uis interval a, b) we have a ? a, b)
otherwise, it is set to 0.
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 0 1 1 0 0
4 0 1 1 0 0
5 1 0 0 0 1
index(c) 0 index(k) 1 index(d) 2 index(e)
3 index(g) 4
Index(a) 0 Index(h) 1 Index(e)
2 Index(f) 3 Index(d) 4 Index(g) 5
0 1 2 3 1
24- A new algorithm for graph recheabiliy
- - Core tree
- - Graph labeling Core-I
- query time O(log(minb, s))
- labeling time O(n e t minb, s)
- space overhead O(n s minb, s )
- - Graph labeling Core-II
- query time O(1)
- labeling time O(n e t minb, s ds
log(minb, s) - space overhead O(n d s)
25Evaluation of Twig Pattern Queries Based on
Ordered Tree matching
Yangjun Chen Dept. Applied Computer
Science, University of Winnipeg 515 Portage
Ave. Winnipeg, Manitoba, Canada R3B 2E9
26Outline
- Motivation
- Algorithm for tree pattern query evaluation based
on ordered tree matching - - Tree encoding
- - Algorithm description
- Index-based algorithm
- Conclusion
27Motivation
- XPath evaluation against XML documents
- - XPath expression
- abc and .//d/bc and e//d
- booktitle Art of Programming//authorfn
Donald and - ln Knuth
book
ltdocumentgt ltbookgt lttitlegt Art of
Programming lt/titlegt ltauthorgt ltfngtDonald
Knuthlt/fngt
title
author
Art of Programming
fn
ln
Knuth
Donald
28Motivation
- XPath evaluation against XML documents
- Evaluation based on unordered tree matching
- XPath expression
- Definition An embedding of a twig pattern Q into
an XML document T is a mapping f Q ? T, from the
nodes of Q to the nodes of T, which satisfies the
following conditions - (i) Preserve node label For each u ? Q,
label(u) matches label(f(u)). - (ii) Preserve parent-child/ancestor-descendant
relationships If u ? v in Q, then f(v) is a
child of f(u) in T if u ? v in Q, then f(v) is a
descendant of f(u) in T.
Q
T
a
d
b
c
e
g
f
29Motivation
- XPath evaluation against XML documents
- - Evaluation based on ordered tree matching
- XPath expression
- abc/following-sibling .//d/following-sibli
ngbc/following- sibling e//d
30Motivation
- XPath evaluation against XML documents
- - Evaluation based on ordered tree matching
- Definition An embedding of a twig pattern Q into
an XML document T is a mapping f Q ? T, from the
nodes of Q to the nodes of T, which satisfies the
following conditions - (i) Preserve node label For each u ? Q,
label(u) matches label(f(u)). - (ii) Preserve parent-child/ancestor-descendant
relationships If u ? v in Q, then f(v) is a
child of f(u) in T if u ? v in Q, then f(v) is a
descendant of f(u) in T. - (iii) Preserve sibling order For any two nodes
v1 ? Q and v2 ? Q, if v1 is to the left of v2,
then f(v1) is to the left of f(v2) in T.
T
Q
q3
v6
q1
q2
v4
v5
v1
v3
v2
31Algorithm for tree pattern query evaluation
- Tree encoding
- Let T be a document tree. We associate each node
v in T with a quadruple (DocId, LeftPos,
RightPos, LevelNum), denoted as a(v), where DocId
is the document identifier LeftPos and RightPos
are generated by counting word numbers from the
beginning of the document until the start and end
of the element, respectively and LevelNum is the
nesting depth of the element in the document. - (i) ancestor-descendant a node v1 associated
with (d1, l1, r1, ln1) is an ancestor of another
node v2 with (d2, l2, r2, ln2) iff d1 d2, l1 lt
l2, and r1 gt r2. - (ii) parent-child a node v1 associated with
(d1, l1, r1, ln1) is the parent of another node
v2 with (d2, l2, r2, ln2) iff d1 d2, l1 lt l2,
r1 gt r2, and ln2 ln1 1. - (iii)from left to right a node v1 associated
with (d1, l1, r1, ln1) is to the left of another
node v2 with (d2, l2, r2, ln2) iff d1 d2, r1 lt
l2.
32Algorithm for tree pattern query evaluation
T
(1, 1, 9, 1)
v6
(1, 2, 7, 2)
(1, 8, 8, 2)
v4
v5
(1, 3, 3, 3)
(1, 4, 6, 3)
v3
v1
v2
(1, 5, 5, 4)
33Algorithm for tree pattern query evaluation
- Main algorithm
- 1. First, we will number both T and Q in
postorder. So the nodes in both trees will be
referenced by their postorder numbers.
T
Q
q3
v6
6
3
q1
q2
4
v4
5
v5
2
1
v1
v3
3
1
2
v2
2. We will access the nodes in T and the nodes
in Q along their postorder numbers. Each time
we meet a node i in Q, we will associate it with
an array, Ai, of length T, indexed from 0 to
T - 1. Ais are manipulated as follows.
34Algorithm for tree pattern query evaluation
(i) We set a virtual node for T, numbered 0,
which is considered to be to the left of any
node in T. (ii) If we find Qi can be embedded
in Tj, we will set Aij1, ..., Aijk (0 ? k
? j - 1) to j, where each jl (0 ? l ? k) is a
node to the left of j, to record the fact that j
is the closest node to the right of jl such that
Tj embeds Qi.
T
v6
6
v0
4
v4
5
v5
v1
v3
3
1
2
v2
35Algorithm for tree pattern query evaluation
- (iii) If some time later we find another node p
such that Qi can be embedded in Tp, we will
set Aip1, ..., Aipq to p, where each ps (1 ?
s ? q) is to the left of p but to the right of
jk. - For all the other nodes j such that Tj embeds
Qi, we will set values for the entries in Ai in
the same way as (ii) and (iii). - 3. During the process, when we meet i in Q and j
in T, we will do the following - Let i1, ..., ik be the child nodes of i in Q. We
first check starting from Ai1l, where - l mindesc(j) - 1 and desc(j) represents all
the descendants of j. We begin the - searching from mindesc(j) - 1 because it is
the closest node to the left of a - descendant of j, which has the least postorder
number. Let Ai1l j. If (i, i1) is /- - edge, we will check whether (j, j) is a /-edge.
Otherwise, we only check whether - j is descendant of j. If it is not the case, we
will check Ai1j. This process continues until
one of the following conditions is satisfied - (i) Ai1 is exhausted (we cannot find a
descendant j of j such that Tj contains
Qi1 or - (ii) we find an j satisfying the parent-child
or ancestor-descendant relationship, depending on
whether (i, i1) is a /-edge or a //-edge. Then,
we will check Ai2j.
36Algorithm for tree pattern query evaluation
- If Ai1l, is exhausted (case (i)), it shows that
Qi1 cannot be embedded in any subtree rooted at
a child node (for /-edge) or a descendant (for
//-edge) of j. It indicates that Qi1 cannot be
embedded into Tj and thus Tj cannot embed
Qi. We will continue to check i against a next
node in T. - If it is case (ii), we will check Ai2, starting
from j. For all the other Ails (l 3, ...,
k), we will do the same checkings. If for each il
(l 1, ..., k) we can find j such that Tj
embeds Qil , it shows that Tj embeds Qi and
we will set some new values in Ai as described in
(2).
l
Q
T
j
i
i2
i1
ik
j
j
l
37Algorithm for tree pattern query evaluation
Example.
T
v6
6
v0
4
v4
5
v5
v1
v3
3
1
2
v2
(f)
The time complexity of the algorithm is O(TQ).
(e)
38Index-base algorithm
- XB-tree
- An XB-tree is a variant of B-tree over a
quadruple sequences.
(1, 3, 3, 3) (1, 5, 5, 4) (1, 4, 6, 3) (1, 2, 7,
2) (1, 8, 8, 2) (1, 1, 9, 1)
sorted by RightPos values
P1
P.parentIndex
3, 5 2, 7 1, 9
P.parent
P2
P3
P4
3, 3 5, 5
4, 6 2, 7
8, 8 1, 9
c
b
c
b
c
a
39Index-base algorithm
- Searching an XB-tree
- - ? (P, i) indicates that the ith entry in
the page P is currently accessed. - - advance(b) (going up from a page to its
parent) If b (P, i) does not point to the
last entry of P, i ? i 1. Otherwise, b ?
(P.parent, P.parentIndex). - - drilldown(b) (going down from a page to one of
its children) If b (P, i) and P is not a leaf
page, b ? (P, 1), where P is the ith child
page of P. - - Initially, b ? (rootPage, 1), pointing to the
first entry in the root page. We finish a
traversal of the XB-tree when b (rootPage,
last), where last points to the last entry in
the root page, and we advance it (in this case,
we set b to nil).
40Index-base algorithm
- Searching an XB-tree
- Assume that i in Q is the node currently
encountered. We will find, by - searching the XB-tree, a node j of T with
label(i) label(j), for which it is possible
that Tj embeds Qi. - - L(i) - the most recently found node such that
Qi can be embedded into TL(i). - Procedure search(XB, i)
- Let i1, ..., ik be the children of i. Assume that
L(ik) v. l ? v.LeftPos. r ? v.RightPos. If i is
a leaf node, then l ? ?, r ? 0. - Assume that ? (P, c). Let j be the entry
pointed to by ?. We will do the following
checkings. - If P is a leaf page, label(j) label(i) and
j.LeftPos lt l and j.RightPos gt r, then - ? ? advance(?), return j.
- If P is an internal page, and j.LeftPos lt l and
j.RightPos gt r, ? ? drilldown(?). - If j.RightPos lt r, then ? ? advance(?). If ?
nil, return nil. - Repeat (2) until the whole XB-tree is traversed
(i.e., when ? nil) or a node j is found (i.e.,
the condition in (2)-(i) is satisfied).
41- Algorithm for evaluating tree pattern
- queries based on ordered tree matching
- time complexity O(TQ).
- Space complexity O(TQ).
- The algorithm can be integrated into an
- index environment by using XB-trees.