A Documentbased Approach to Indexing XML Data - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

A Documentbased Approach to Indexing XML Data

Description:

Rural and Agricultural Finance in Tajikistan. Current Status, Challenges and Perspectives ... Tajikistan - Expanding Finance in Rural Areas. 4. Agriculture and ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 36
Provided by: east7
Category:

less

Transcript and Presenter's Notes

Title: A Documentbased Approach to Indexing XML Data


1
A Document-based Approach to Indexing XML Data
  • Ya-Hui Chang and Tsan-Lung Hsieh
  • Department of Computer Science
  • National Taiwan Ocean University
  • yahui_at_cs.ntou.edu.tw
  • Sept. 10th, 2002

2
Overview
  • XML introduction
  • Element block
  • Element tree
  • Two types of index structures
  • Document index
  • Element index
  • Experiment results
  • Conclusion

3
Element Block
Principles of database
systems UllmanLastname Jeffrey Author Computer Science Pressisher 1999 databaseKeyword
4
Element Tree
Example of Offset Blocks
5
the Query Processor
DocumentIndex
ElementIndex
XMLDocument
IdentifyingDocument
DeterminingPosition
RetrievingData
Query
Result
6
the Index Structures
  • Purpose
  • Providing efficient query processing over
    multiple XML documents
  • Two types
  • Document index
  • Representing the correspondence of document
    identifiers and element values
  • Element index
  • Representing the positions of elements

7
Document Index
  • Based on B-Tree
  • the size of each node is restricted by order
  • the tree is balanced.

Order5
8
Document Index (cont)
  • Each node is represented as an XML document.
  • Search-key value is represented as the attribute
    key of the element Pointer, while the document
    identifier is represented as the content.



B0001Pointer B0002
B0001
B3.bt XML



y CDATA REQUIRED
DTD
9
Element Index
  • The position information of elements is
    represented based on the order specified in DTD,
    or the element tree.
  • The element indexes are partitioned into offset
    blocks corresponding to element blocks to capture
    the nesting structures of elements.
  • It is named offset since we keep the relative
    position of elements, to reduce the cost of
    maintenance.
  • Offset tuples constitute the offset block
  • the first component records the offset to the
    parent element
  • the last component records the pointer to the
    offset tuple for the next sibling element
  • the other components record the relative
    positions of sub-elements.

10
Example of Offset Blocks
Books pointer null
Child link
Book1 Title1 pointer Publisher1 Date1
Keyword1 pointer
Author1 Lastname1 Firstname1 point
Author2 Lastname2 Firstname2 null
Sibling link
Book2 Title2 pointer Publisher2 Date2
Keyword2 null
Author3 Lastname3 Firstname3 null
Element tree
11
Example of Retrieving Offsets
  • Suppose we plan to retrieve all the data
    corresponding to the path /Books/Book/Title.
  • Based on the element tree, Book is the first
    child of Books, and Title is the first child of
    Book.
  • This information tells us which components to
    retrieve in the offset tuples of Books and Book.

  • We also need to follow the sibling links.

12
Example of Retrieving Offsets (cont)
  • Suppose the input path is /Books/Book/Author/Last
    name, where Book is the first child, Author is
    the second child and Lastname is the first
    child.
  • We need to process the sibling elements for both
    Author and Book.

13
Constructing Algorithm
  • Idea performing a linear scan on the XML
    document retrieving the absolute positions of
    all tags to calculate offsets.
  • data structures used
  • StartTagList the sequence of start-tags and
    their absolute positions
  • EndTagList the sequence of end-tags and their
    absolute positions
  • Stack all unfinished elements on top is the
    most recent one, which is also the parent of the
    current element
  • Each internal node of the element tree will need
    to record how many child nodes it has.

14
Initial Data
StartTagList
EndTagList
Offset Tuples
'Title', 18 'Book', 9 'Books', 0
'Firstname', 138 'Lastname', 104 'Title'
, 62
Principles of dat
abase systems astnameUllman Jef
frey rComputer Science Press
1999 databaseKeyword
/, 0, -1
Stack
15
Round 1
StartTagList
EndTagList
Offset Tuples
'Title', 18 'Book', 9 'Books', 0
0 0, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
4
2
1
3
'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman

Stack
16
Round 2
StartTagList
EndTagList
Offset Tuples
'Author', 66 'Title', 18 'Book', 9
0 0, 1, _ 1 9, _, _, _, _, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
4
2
1
3
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman

Stack
17
Round 3
StartTagList
EndTagList
Offset Tuples
'Lastname', 78 'Author', 66 'Title', 18
0 0, 1, _ 1 9, 9, _, _, _, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman

Stack
18
Round 4
StartTagList
EndTagList
Offset Tuples
'Firstname', 109 'Lastname', 78 'Author'
, 66
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
_, _, _
'Author', 150 'Firstname', 138 'Lastname
', 104
4
2
1
3
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Principles of daa
tabase systems LastnameUllman

Stack
19
Round 5
StartTagList
EndTagList
Offset Tuples
'Publisher', 154 'Firstname', 109 'Lastn
ame', 78
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, _, _
'Author', 150 'Firstname', 138 'Lastname
', 104
3
2
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Principles of daa
tabase systems LastnameUllman

Stack
20
Round 6
StartTagList
EndTagList
Offset Tuples
'Date', 202 'Publisher', 154 'Firstname'
, 109
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, 43, _
'Publisher', 198 'Author', 150 'Firstnam
e', 138
3
2
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999

Stack
21
Round 7
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202 'Publisher', 1
54
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, 43, 0
'Date', 218 'Publisher', 198 'Author', 1
50
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999

Stack
22
Round 8
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202 'Publisher', 1
54
0 0, 1, _ 1 9, 9, 2, 145, _, _, _ 2 5
7, 12, 43, 0
'Keyword', 248 'Date', 218 'Publisher',
198
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999

Stack
23
Round 9
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202
0 0, 1, _ 1 9, 9, 2, 145, 193, _, _ 2
57, 12, 43, 0
'Books', 266 'Book', 257 'Keyword', 248
'Date', 218
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
24
Round 10
StartTagList
EndTagList
Offset Tuples
'Keyword', 222
0 0, 1, _ 1 9, 9, 2, 145, 193, 213, _ 2
57, 12, 43, 0
'Books', 266 'Book', 257 'Keyword', 248
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
25
Round 11
StartTagList
EndTagList
Offset Tuples

0 0, 1, _ 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
'Books', 266 'Book', 257
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
26
Round 12
StartTagList
EndTagList
Offset Tuples

0 0, 1, 0 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
'Books', 266
2
1
'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
27
Final Data
StartTagList
EndTagList
Offset Tuples

0 0, 1, 0 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0

Principles of dat
abase systems astnameUllman Jef
frey rComputer Science Press
1999 databaseKeyword
/, 0, -1
Stack
28
Performance Evaluation
  • Comparison with DOM showing the efficiency of
    utilizing the pre-built element index
  • DOM (Document Object Model) a tree-based parsing
    mechanism where each element is a node
  • Using Microsoft MSXML 3.0 DOM API
  • Construction of the cost model showing the
    scalability of our indexing scheme
  • Comparison with Lore showing the performance of
    the whole query processor
  • Lore a specialized database system for
    semi-structured/XML data

29
Comparison with DOM
30
Cost Model
  • The I/O cost consists of processing the following
    four portions of data
  • The internal nodes of the document index
  • The leaf nodes of the document index
  • The offset blocks
  • The XML files
  • The cost model is as follows

31
Experiment Setups
32
Experiment Data
33
Queries to Compare with Lore
34
Experiment Results
35
Conclusions
  • Summary
  • We construct a query processor to retrieve data
    from multiple XML documents, which utilizes two
    index structures
  • the document index could quickly identify the
    required document
  • the maintainable element index could quickly
    determine the precise location of desired data
  • Experiment results show the efficiency of our
    approach.
  • Future work
  • Supporting more complicated queries
  • Improving space utilization
Write a Comment
User Comments (0)
About PowerShow.com