Efficient XML Storage, Query, and Update - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient XML Storage, Query, and Update

Description:

Implemented within the Monet database server ... Monet focuses on XML storage and query. Though lacking equivalent test, Monet is faster than Natix on query. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 46
Provided by: pcgu
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient XML Storage, Query, and Update


1
Efficient XML Storage, Query, and Update
  • Shi Xu
  • Heng Yuan
  • Spring 2004 CS240B
  • Prof. Zaniolo

2
XML Storage Methods
  • Flat Streams
  • Metamodeling
  • Mixed
  • Redundant
  • Hybrid

3
Method Covered
  • Efficient storage of XML data covers hybrid
    method using a custom made storage system called
    Natix.
  • Efficient relational storage and retrieval of
    XML documents covers Metamodeling using their
    Monet database.

4
Natix Overview
  • Natix is an efficient, native repository for
    storing, retrieving and managing XML documents.
  • It supports tree-structured objects like XML
    documents at low architecture level.

5
Natix architectural overview
6
Logic Model
  • Tree is often used in logic model of
    semistructured data.
  • Each non-leaf node is labeled with a symbol taken
    from an alphabet ?DTD.
  • Leaf nodes can be labeled as the data itself.

7
A sample XML with its associated logical tree
Example XML ltSPEECHgt ltSPEAKERgtOTHELLOlt/SPEAKERgt
ltLINEgtLet me see your eyeslt/LINEgt ltLINEgtLook
in my face.lt/LINEgt lt/SPEECHgt
8
Physical Model
  • Object Content
  • Node and objects are used interchangeably.
  • A record contains a set of nodes/objects.
  • Aggregate nodes are inner nodes of the tree.
    They contain their respective child nodes.
  • Literal nodes are leaf nodes containing an
    uninterpreted stream of bytes, like text strings,
    graphics, etc.
  • Proxy nodes are nodes which point to different
    records.

9
Node Representation
  • Whole documents (or subtrees of documents) can be
    stored in one record.
  • Each record contains exactly one subtree.
  • The root nodes of each records subtree are
    called standalone objects, other nodes are called
    embedded objects.
  • The record size has an upper limit, the page size.

10
Large Trees
  • For a large tree, physical model must provide a
    mechanism for distributing data trees over
    several pages.
  • Method 1 flat representation. It wastes the
    available structural information about the data.
  • Method 2 split large objects based on the
    underlying tree structure.
  • Use proxy objects to connect subtrees of the
    large object residing in other records.

11
A Sample Distribution of logical nodes on records
  • Proxies (p1, p2)
  • Helper aggregate objects (h1, h2)
  • Scaffolding objects include proxies and helper
    aggregates.
  • Facade objects (f i)

12
Dynamic maintenance of an efficient storage
  • The principle problem is that a record containing
    a subtree can grow larger than a page if a node
    is added or grows.
  • Subtree contains in the record has to be
    partitioned into several subtrees.
  • Scaffolding nodes link the new records together
    in the physical tree.

13
Multiway tree representation of records
14
Tree Growth Procedure
  • Step 1 Determine the record r into which the
    node has to be inserted.
  • Step 2 If there is not enough on the page, try
    to move r. If the record still does not fit,
    split the record
  • (a) Determine the separator by recursively
    descending into the rs subtree
  • (b) Distribute the resulting partitions onto
    records
  • (c) Insert the separator into the parent record,
    recursively calling this procedure
  • Step 3 Insert the new node

15
Determining the Insertion Location
  • There are several possibilities to insert a new
    node f n into the physical tree.
  • This choice can be determined by a configuration
    parameters.

16
Determining the separator
  • Separator a tree structure with proxies
    pointing to the new records to indicate where
    which part of the old record was moved.
  • Consists of all the nodes on the path from d to
    the subtrees root.
  • Partition the tree into left partition L, right
    partition R and Separator S.

17
A records subtree before a split occurs
18
Splitting a Record
  • Distributing the nodes on records
  • After determining the partitioning, the contents
    of the record has to be distributed onto new
    records.
  • Each resulting subtree is then stored in its own
    record, called partition records.
  • Inserting the separator
  • The separator is moved to the parent record.

19
Split Algorithm
  • Find a node d, such that the resulting L and R.
  • The ratio between the sizes of L and R is
    determined by a configuration parameter (split
    target).
  • Another configuration parameter Split tolerance
    specifies the minimum size for the subtree of d.
    It is used to prevent fragmentation.

20
Record assembly for the subtree from previous
figure
21
Physical storage of the tree represented inside
one record
22
Performance Test
  • XML markup version of Shakspeares play with 8MB
    with 320,000 nodes.
  • Pentium-II 333Mhz with 128MB under Windows NT4.0
    with IBM DCAS 34330 disk.
  • The implementation of the record and tree storage
    managers was done in C.

23
Test Conditions
  • RecordNode 11 indicating smart record splitting
    being inhibited.
  • RecordNode 1n indicating that the algorithm has
    full control over distribution of nodes on
    records.
  • Incremental updates distributed over the whole
    document.
  • Updates in pre-order (append).

24
Insertion
25
Full tree traversal
26
Queries
  • Retrieve all speakers in the third act and second
    scene of every play, which means it accesses all
    leaf nodes of a certain type in one selected
    subtree of the document.
  • Recreate the textual representation of the
    complete first speech in every scene, hence
    reading a lot of small contiguous fragments of
    each document.
  • A simple path query was evaluated by reading only
    the opening speech of each play.

27
Selection on leaf nodes of document subtree
28
Small contiguous fragments
29
Single path for each document
30
Space requirements
31
Monet Model
  • XML document is decomposed into binary relations.
  • Efficient for storage and retrieval of XML
    documents in a relational database.
  • The database used is their Monet database server
    which supports the Monet model.

32
Some Definitions
  • An XML document is a rooted treed (V, E, r,
    labelE, labelA, rank) with nodes V and edges
    E?V?V and a distinguished node r?V.
  • The function labelE V?string assigns labels to
    nodes
  • labelA V?string?string assigns pairs of
    strings, attributes and their values, to nodes.
  • rank V?int establishes a ranking to allow for
    an order among nodes with the same parent node.

33
A sample XML document
ltbibliographygt ltarticle keyBB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/arti
clegt ltarticle keyBK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltautho
rgtKen Keylt/authorgt lttitlegtHacking
RSIlt/titlegt lt/articlegt lt/bibliographygt
34
Syntax Tree of the Previous XML Document
35
Monet Transform
  • Given an XML document d, the Monet transform is a
    quadruple Mt(d)(r,R,A,T) where
  • R is the set of binary relations that contain all
    associations between nodes
  • A is the set of binary relations that contain all
    associations between nodes and their attribute
    values, including character data
  • T is set of binary relations that contain all
    pairs of nodes and their rank
  • r is the root of the document

36
Monet Transform of the Example Document
37
OQL-like query
38
Query Handling
39
Assessment
  • Implemented within the Monet database server
  • Tested on 550 MHz Silicon Graphics 1400 Server
    with 1 GB main memory.
  • Also used Sun UltraSparc-IIi with 360 MHz and 256
    MB main memory to contrast with a related work.

40
Size of document collections in XML and Monet XML
format
41
Scaling of Document
  • Scaled the ACM Anthology from 30 to 3x106 which
    corresponds to XML source size between 10KB and
    1GB.
  • Run 4 queries consisting of path expressions of
    length 1 through 4 for various sizes of the
    anthology.

42
Response Time vs. Result Size
43
Comparison of response time for query set of
SYU, another method for storage/retrieval of XML
document.
44
Compare/Contrast Natix and Monet
  • Natix uses custom database while Monet is built
    on top of relational database
  • Neither uses DTD.
  • Natix focuses on XML query as well as update.
  • Monet focuses on XML storage and query.
  • Though lacking equivalent test, Monet is faster
    than Natix on query.
  • Monet seems to be more space efficient than Natix
    as well.

45
References
  • Efficient storage of XML data By Carl-Christian
    Kanne, et al. ICDE 2000 http//citeseer.nj.nec.co
    m/kanne99efficient.html
  • Efficient Relational Storage and Retrieval of
    XML Documents By Albrecht Schmidt, et al. WebDB
    2000 http//www.research.att.com/conf/webdb2000/p
    rogram.html
Write a Comment
User Comments (0)
About PowerShow.com