Anatomy of a Native XML Database - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Anatomy of a Native XML Database

Description:

Sleepycat Software Makers of Berkeley DB. Storage... Sleepycat Software Makers of Berkeley DB. Typed Data Model Issues. Type information and schema ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 26
Provided by: georgef3
Category:

less

Transcript and Presenter's Notes

Title: Anatomy of a Native XML Database


1
Anatomy of a Native XML Database
  • George Feinberg
  • Sleepycat Software
  • gmf_at_sleepycat.com
  • http//www.sleepycat.com

2
Anatomy 101
  • XML database processing model
  • Storage and retrieval
  • Querying
  • Indexes
  • Storage format details

3
XML Database Processing Model
  • Store XML
  • Query XML
  • Retrieve/process results
  • Driven by XML specifications
  • Driven by XML applications

4
XML Database Processing
retrieve
store
query
indexes
5
What to Store?
  • XML documents
  • DOM nodes
  • XML document fragments
  • Another data model (XQuery)

6
Example Storage Options
ltrootgt ltagtxlt/agt ltbgtylt/bgt lt/rootgt
root
root
a
b
ltagtxlt/agt
ltbgtylt/bgt
x
y
7
Storagedepends on query processing
  • Query Processing Overview
  • Operates on some data model
  • DOM
  • Infoset
  • Xquery
  • Other
  • XML must be turned into a datamodel for
    processing (I.e. it must be parsed)
  • Uses indexes for speed
  • Indexes target documents or nodes?

8
Storage depends on retrieval and application
processing
  • Intact XML Documents
  • Document fragments
  • DOM
  • Pipelining
  • Read-only vs modification

9
Storage Choices
  • XML documents
  • Document fragments
  • Nodes
  • DOM
  • Other
  • Something else
  • What is the data model???

10
Whole Document Storage
  • Advantages
  • 100 round-tripping
  • Good if document is desired for retrieval
  • Query overhead is reasonable for small documents

11
Whole Document Storage
  • Disadvantages
  • Parsing required for queries
  • Parsing (probably) materializes many nodes
    irrelevant to the query
  • Partial retrievals require parsing the entire
    document
  • Cannot perform partial updates
  • Huge documents may be difficult to store
    (requires streaming)

12
Materialize on Demand
13
Node Storage
  • Advantages
  • Query data model is available (no parsing)
  • Should only materialize nodes relevant to the
    query, and post-query processing
  • Indexes can point directly to target nodes
  • Efficient partial retrieval
  • Can be partially updated

14
Node Storage
  • Disadvantages
  • Slower document insertion
  • Possible inflation in storage size
  • Slower to serialize entire document
  • Difficult to get 100, byte-for-byte,
    round-tripping.

15
Node Storage
  • Issues
  • Data model choice
  • Format/granularity
  • Node numbering
  • Updates

16
Data Model
  • Infoset
  • Structural information, not typed
  • Xquery
  • Structure type (schema-aware)
  • Semi-structured data
  • Mapping required to target query languages
  • Driven by Query language(s)
  • Part of what makes XML database native

17
Typed Data Model Issues
  • Type information and schema
  • Available during parse
  • Where does it go for node storage?
  • Reference/reload on query?
  • Store type info?
  • Loading/parsing type information can be expensive

18
Format and Granularity
  • Trade off
  • Degree of addressability and update
  • Cost of storage/retrieval
  • Some options
  • DOM objects (too fine-grained)
  • Elements attrs text
  • Mixed (some parsing required)

19
Node Numbers
  • Required for index targets
  • Useful for comparisons
  • Document order
  • Implicit sibling, ancestor/descendent
    relationships
  • Issue for updates -- renumbering

20
Node Renumbering
1
1
2
4
2
4
3
3
?
Intelligent node numbering allows for
insertions and deletions of nodes in document
order
21
Indexes
  • Critical for performance
  • Can target documents or nodes
  • Types of indexes
  • Structural
  • Value

22
Structural Indexes
  • Path
  • Find all elements, /a/b/c
  • Existence vs value
  • Edge
  • Index all paths (partial paths) to c
  • Includes b/c x/c
  • More general than path, more space
  • Some systems implicitly index based on structure

23
Value Indexes
  • Typed (especially for XPath2, Xquery)
  • Support value comparisons
  • Equality -- //color.green
  • Range -- //size.lt50

24
Index Issues
  • Cost in space
  • Cost in updates
  • Document modification
  • Document insertion
  • Document removal (often forgotten)
  • Indexes require careful consideration

25
Summary
  • Native XML databases store/query/retrieve XML
  • Many options in design and implementation
  • Native XML databases are driven by XML and XML
    applications
  • Even so, one size does not fit all
Write a Comment
User Comments (0)
About PowerShow.com