Title: A Summary of XISS and Index Fabric
1A Summary of XISS and Index Fabric
2Contents
- Definition of Terms
- XISS (Li and Moon, VLDB2001)
- Numbering Scheme
- Indices Stored
- Join Algorithms
- Index Fabric (Cooper et al, VLDB2001)
- Patricia
- Balanced Trie
- Raw Path Index
3Definition of Terms
- Absolute Path Expression (APE)
- the path which start from root, each step is a
traversal of child axis or attribute axis, no
wildcards - e.g., /, /A/B, /A/_at_C
4Definition of Terms
- Regular Path Expression (RPE)
- may start from root or not,
- may traverse different axes (restricted to child,
descendant-or-self, attribute for discussions
since they are the most commonly used ones) - may contain wildcards
- e.g., //, /A//C, /A/_/B, //A/B//C/D/_at_E
5XISS
- XISS XML Indexing and Storage System
- by Li and Moon, published in VLDB 2001, with
title Indexing and Querying XML Data for Regular
Path Expressions - decomposes and stores XML documents in the
indices - can answer regular path expressions
6XISS - General Idea
- solve RPE by decomposing RPE into these 5 basic
subexpressions - element retrieval
- attribute retrieval
- steps involve an element and an attribute
- steps involve two elements
- a Kleene Closure of another subexpression
7XISS - General Idea
- each subexpression is solved by its own method
- element index lookup
- attribute index lookup
- EA-join
- EE-join
- KC-join
8XISS - General Idea
- result lists from the subexpressions are joined
to produce the final result - to make this decomposition and join efficient, an
efficient method to determine ancestor-descendant
relationship is needed - XISS uses an extended preorder based numbering
scheme
9XISS - Numbering Scheme
- number all the nodes with a ltorder, sizegt tuple
- order is assigned based on an extended preorder
traversal - size can be imagined as the size of the subtree
rooted at that node
10XISS - Numbering Scheme
- The rules for number assignment
- if x precedes y in the preorder traversal,
x.order lt y.order (preorder) - if x and y are siblings, either x.order x.size
lt y.order or y.order y.size lt x.order(siblings
wont overlap) - if x is an ancestor of y, x.order lt y.order lt
x.order x.size (ancestor contains descendant)
11XISS - Numbering Scheme
- Actual Assignment
- uses heuristics to reserve some space between
orders - reserve more space to the sizes for future node
insertions - attributes are place before sibling elements
12XISS - Index Organization
- There are 5 indices
- Name Index
- Element Index
- Attribute Index
- Structure Index
- Value Table
13XISS - Name Index
- maps element or attribute name to a name
identifier (or nid) - nid is used for further query evaluation
representing that element or attribute - reduce the time for string comparison in further
index lookup - stored in a B-tree
14XISS - Name Index
Name
nid
B-tree
15XISS - Value Table
- stores all the string values of the XML document
16XISS - Element Index
- input nid, output list of element records
- implemented by a B-tree
- leaves are pointers to list of document ID (did),
each list element points to a list of all
elements with the same name in the same document
17XISS - Element Index
element list
did list
nid
element list
ltorder, sizegt,Depth,ParentID
B-tree
element record
18XISS - Attribute Index
- Very similar to element index
- always has a value identifier, vid
19XISS - Structure Index
- Input did, Output array containing all the
element and attributes in the document - implemented by a B-tree
20XISS - Structure Index
did
nidltorder, sizegt,Parent order,Child
order,Sibling order,Attribute order
B-tree
record array
21XISS - Indices
- When to use which index?
- first use Name Index to find nid of the
element/attribute to be queried - search Element/Attribute index for the records
- if we need values, lookup Value Table
- use Structure Index to rebuild or traverse the
XML document tree
22XISS - Join Algorithms
- After getting the record lists from each
subexpression, we need to find out which are
answers to the original query - e.g., to find /A/B, we found a record list of all
element A, another list of all element B, and we
have to find out which Bs are A/B
23XISS - Join Algorithms
- Three join algorithms proposed
- EA-join - merges an element record list and an
attribute record list (solves A/_at_B) - EE-join - merges two element record lists (solves
A/B or A//B) - KC-join - self-merge an element record list
(solves (E))
24XISS - EA-Join
- to solve E/_at_A
- input an element record list and an attribute
record list - find out the attribute records which have parents
in the element record list - two lists are sorted by did and then order
25XISS - EA-join
- 2-stage sort-merge
- group by did first
- merge using order then
- output criterion E is a parent of A
- single scan on both list is enough
26XISS - EE-join
- to solve E/_/E, e.g., E/E, E//E, E/_/E
- input two Element record lists, E, F
- output (e,f) where e is an ancestor of f
- also use 2-stage sort-merge
- however, may need scanning of lists multiple
times (for special cases, e.g., the document has
/A/A/B/B)
27XISS - KC-join
- to solve Kleene Closure of a subexpression
- input a list of element records fits the base
case - recursively use EE join on the list, and stop
until no more grow in the result list
28Index Fabric
- by Cooper at el, published in VLDB 2001, with
title A fast index for semistructured data - has 2 subtypes, raw path index and refined path
index - use Patricia technique to compress the index
29Index Fabric - General Idea
- it is a disk balanced indexing structure based on
Patricia - each data node is associated with a key string
and this string is stored in the trie index for
retrieval - the layered approach in building the index ensure
the number of disk pages accessed per query
30Index Fabric - General Idea
- raw path index answers absolute path queries
- refined path index answers any predefined queries
- the difference is how to generate the key
31Patricia
- Patricia Practical Algorithm To Retrieve
Information Coded in Alphanumeric - by Morrison, in JACM 1968
- a method to store and retrieve strings in a
space efficient way - binary, use bit comparisons, has a skip in each
internal node
32Patricia
0
1
0
0
1
1
101110
101111
110000
110011
33Patricia
- its basically a trie with internal nodes having
single child removed - search is done by
- branch according to the value of bit at skip
- retrieve the string at leaf
- compare it with the query string
34Index Fabric - Balanced Trie
- The number of disk pages accessed per query is
bounded by the number of layers in the layered
index - The idea is similar to that of B-tree, The
Patricia trie is decomposed into blocks, and
there is an upper layer trie which traverse the
blocks
35Index Fabric - Balanced Trie
1
0
1
0
0
1
1
101110
101111
110000
110011
Layer 0
Layer 1
36Index Fabric - Balanced Trie
- There are 3 types of links in the balanced trie
- far link across layer, a result of branching
- near link within the same block, a result of
branching - direct link across layer, the root nodes are the
same - Each query will access 1 block in 1 layer
37Index Fabric - Balanced Trie
- increase the speed by skipping nodes of original
trie using traversals in upper layers - number of page accessed is bounded
38Index Fabric - Raw Path
- each data node is associated with a key
- key path (encoded in designators) value
- designators are special characters, each
represents a name - APE queries are translated to prefix to keys and
submitted to the index trie
39Index Fabric - Raw Path
- Example
- ltinvoicegtltbuyergtltnamegtHKUlt/namegtlt/buyergtlt/invoice
gt is translated to IBNHKU (bolded underlined
are designators - query of /invoice/buyer/nameHKU is translated
to query string IBNHKU
40Index Fabric - Refined Path
- Special designators can be assigned to special
queries (can be regular) - e.g., we define P as the path //buyer/name, and
PHKU means there is a buyer/name has value HKU in
the document - can answer any predefined RPE very quickly
41Comparison
- XISS
- can solve general RPE
- solve APE by dividing it into steps
- Index Fabric
- RPE solved by compile time expansion of RPE or
using predefined Refined Path Index - solve APE by single index lookup