A Summary of XISS and Index Fabric - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

A Summary of XISS and Index Fabric

Description:

Absolute Path Expression (APE) ... APE queries are translated to prefix to keys and submitted to the index trie ... solve APE by single index lookup ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 42

Provided by: CSI115

Category:

more less

Transcript and Presenter's Notes

Title: A Summary of XISS and Index Fabric

1
A Summary of XISS and Index Fabric

Ho Wai Shing

2
Contents

Definition of Terms
XISS (Li and Moon, VLDB2001)
Numbering Scheme
Indices Stored
Join Algorithms
Index Fabric (Cooper et al, VLDB2001)
Patricia
Balanced Trie
Raw Path Index

3
Definition of Terms

Absolute Path Expression (APE)
the path which start from root, each step is a
traversal of child axis or attribute axis, no
wildcards
e.g., /, /A/B, /A/_at_C

4
Definition of Terms

Regular Path Expression (RPE)
may start from root or not,
may traverse different axes (restricted to child,
descendant-or-self, attribute for discussions
since they are the most commonly used ones)
may contain wildcards
e.g., //, /A//C, /A/_/B, //A/B//C/D/_at_E

5
XISS

XISS XML Indexing and Storage System
by Li and Moon, published in VLDB 2001, with
title Indexing and Querying XML Data for Regular
Path Expressions
decomposes and stores XML documents in the
indices
can answer regular path expressions

6
XISS - General Idea

solve RPE by decomposing RPE into these 5 basic
subexpressions
element retrieval
attribute retrieval
steps involve an element and an attribute
steps involve two elements
a Kleene Closure of another subexpression

7
XISS - General Idea

each subexpression is solved by its own method
element index lookup
attribute index lookup
EA-join
EE-join
KC-join

8
XISS - General Idea

result lists from the subexpressions are joined
to produce the final result
to make this decomposition and join efficient, an
efficient method to determine ancestor-descendant
relationship is needed
XISS uses an extended preorder based numbering
scheme

9
XISS - Numbering Scheme

number all the nodes with a ltorder, sizegt tuple
order is assigned based on an extended preorder
traversal
size can be imagined as the size of the subtree
rooted at that node

10
XISS - Numbering Scheme

The rules for number assignment
if x precedes y in the preorder traversal,
x.order lt y.order (preorder)
if x and y are siblings, either x.order x.size
lt y.order or y.order y.size lt x.order(siblings
wont overlap)
if x is an ancestor of y, x.order lt y.order lt
x.order x.size (ancestor contains descendant)

11
XISS - Numbering Scheme

Actual Assignment
uses heuristics to reserve some space between
orders
reserve more space to the sizes for future node
insertions
attributes are place before sibling elements

12
XISS - Index Organization

There are 5 indices
Name Index
Element Index
Attribute Index
Structure Index
Value Table

13
XISS - Name Index

maps element or attribute name to a name
identifier (or nid)
nid is used for further query evaluation
representing that element or attribute
reduce the time for string comparison in further
index lookup
stored in a B-tree

14
XISS - Name Index
Name
nid
B-tree
15
XISS - Value Table

stores all the string values of the XML document

16
XISS - Element Index

input nid, output list of element records
implemented by a B-tree
leaves are pointers to list of document ID (did),
each list element points to a list of all
elements with the same name in the same document

17
XISS - Element Index
element list
did list
nid
element list
ltorder, sizegt,Depth,ParentID
B-tree
element record
18
XISS - Attribute Index

Very similar to element index
always has a value identifier, vid

19
XISS - Structure Index

Input did, Output array containing all the
element and attributes in the document
implemented by a B-tree

20
XISS - Structure Index
did
nidltorder, sizegt,Parent order,Child
order,Sibling order,Attribute order
B-tree
record array
21
XISS - Indices

When to use which index?
first use Name Index to find nid of the
element/attribute to be queried
search Element/Attribute index for the records
if we need values, lookup Value Table
use Structure Index to rebuild or traverse the
XML document tree

22
XISS - Join Algorithms

After getting the record lists from each
subexpression, we need to find out which are
answers to the original query
e.g., to find /A/B, we found a record list of all
element A, another list of all element B, and we
have to find out which Bs are A/B

23
XISS - Join Algorithms

Three join algorithms proposed
EA-join - merges an element record list and an
attribute record list (solves A/_at_B)
EE-join - merges two element record lists (solves
A/B or A//B)
KC-join - self-merge an element record list
(solves (E))

24
XISS - EA-Join

to solve E/_at_A
input an element record list and an attribute
record list
find out the attribute records which have parents
in the element record list
two lists are sorted by did and then order

25
XISS - EA-join

2-stage sort-merge
group by did first
merge using order then
output criterion E is a parent of A
single scan on both list is enough

26
XISS - EE-join

to solve E/_/E, e.g., E/E, E//E, E/_/E
input two Element record lists, E, F
output (e,f) where e is an ancestor of f
also use 2-stage sort-merge
however, may need scanning of lists multiple
times (for special cases, e.g., the document has
/A/A/B/B)

27
XISS - KC-join

to solve Kleene Closure of a subexpression
input a list of element records fits the base
case
recursively use EE join on the list, and stop
until no more grow in the result list

28
Index Fabric

by Cooper at el, published in VLDB 2001, with
title A fast index for semistructured data
has 2 subtypes, raw path index and refined path
index
use Patricia technique to compress the index

29
Index Fabric - General Idea

it is a disk balanced indexing structure based on
Patricia
each data node is associated with a key string
and this string is stored in the trie index for
retrieval
the layered approach in building the index ensure
the number of disk pages accessed per query

30
Index Fabric - General Idea

raw path index answers absolute path queries
refined path index answers any predefined queries
the difference is how to generate the key

31
Patricia

Patricia Practical Algorithm To Retrieve
Information Coded in Alphanumeric
by Morrison, in JACM 1968
a method to store and retrieve strings in a
space efficient way
binary, use bit comparisons, has a skip in each
internal node

32
Patricia

an example Patricia trie

0
1
0
0
1
1
101110
101111
110000
110011
33
Patricia

its basically a trie with internal nodes having
single child removed
search is done by
branch according to the value of bit at skip
retrieve the string at leaf
compare it with the query string

34
Index Fabric - Balanced Trie

The number of disk pages accessed per query is
bounded by the number of layers in the layered
index
The idea is similar to that of B-tree, The
Patricia trie is decomposed into blocks, and
there is an upper layer trie which traverse the
blocks

35
Index Fabric - Balanced Trie
1

e.g.

0
1
0
0
1
1
101110
101111
110000
110011
Layer 0
Layer 1
36
Index Fabric - Balanced Trie

There are 3 types of links in the balanced trie
far link across layer, a result of branching
near link within the same block, a result of
branching
direct link across layer, the root nodes are the
same
Each query will access 1 block in 1 layer

37
Index Fabric - Balanced Trie

increase the speed by skipping nodes of original
trie using traversals in upper layers
number of page accessed is bounded

38
Index Fabric - Raw Path

each data node is associated with a key
key path (encoded in designators) value
designators are special characters, each
represents a name
APE queries are translated to prefix to keys and
submitted to the index trie

39
Index Fabric - Raw Path

Example
ltinvoicegtltbuyergtltnamegtHKUlt/namegtlt/buyergtlt/invoice
gt is translated to IBNHKU (bolded underlined
are designators
query of /invoice/buyer/nameHKU is translated
to query string IBNHKU

40
Index Fabric - Refined Path

Special designators can be assigned to special
queries (can be regular)
e.g., we define P as the path //buyer/name, and
PHKU means there is a buyer/name has value HKU in
the document
can answer any predefined RPE very quickly

41
Comparison