Anatomy of a Native XML Base Management System - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Anatomy of a Native XML Base Management System

Description:

... Traditional DBMS. Manage large XML document collections based on traditional ... Enhanced a traditional full text index by storing lists of document references ... – PowerPoint PPT presentation

Number of Views:244
Avg rating:3.0/5.0
Slides: 27
Provided by: rob875
Category:

less

Transcript and Presenter's Notes

Title: Anatomy of a Native XML Base Management System


1
Anatomy of a Native XML Base Management System
  • By Yaojun Wu

2
Outline
  • Background
  • Approaches based on traditional database
    management systems
  • Introduction of Natix
  • System Architecture and Major components of Natix
  • Architecture of the system
  • The storage engine
  • Transaction management
  • Query execution engine
  • Conclusion

3
Approaches Based on Traditional DBMS
  • Manage large XML document collections based on
    traditional database management systems
  • Storing XML in relational DBMSs or
    object-oriented DBMSs
  • Drawbacks of mapping XML documents onto other
    data models
  • Have to decide on the actual schema
  • For document-centric view , retain all
    information of one document in a single data
    item. Manipulating fragments of documents need
    to read and parse the whole document each time
  • For data-centric view, each document is broken
    down into small parts, importing or exporting a
    whole document has become a time-consuming task.

4
Introduction of Natix
  • Natix, a native XML base management system,
    designed for storing and processing XML data.
  • Requirements for XBMS
  • To store documents effectively and to support
    efficient retrieval and update of these documents
    or parts of them
  • To support standardized declarative query
    languages like XPath and XQuery
  • To support standardized application programming
    interfaces like SAX and DOM
  • To provide a safe multi-user environment
    including recovery and synchronization of
    concurrent access.

5
Architecture
  • Natix components form three layers
  • Storage Layer
  • Contains classes for efficient XML storage,
    indexes and metadata storage. It also manages the
    storage of the recovery log and controls the
    transfer of data between main and secondary
    storage.
  • Service Layer
  • Provides all DBMS functionality required in
    addition to simple storage and retrieval
  • The database services communicate with each other
    and with applications using the Natix Engine
    Interface
  • Binding Layer
  • Map between the Natix Engine Interface and
    different application interfaces. Each such
    mapping is called a binding

6
Architectural Overview
7
Storage Engine
  • Storage Engine
  • Manages all persistent data structures and their
    transfer between main and secondary memory.
  • The systems overall speed, robustness, and
    capability are determined by the storage engines
    design.
  • Architecture
  • Storage is organized into partitions
  • Disk pages are logically grouped into segments
  • Disk pages resident in main memory are managed by
    the buffer manager and their contents are
    accessed using page interpreters

8
Storage Engine XML storage
  • Existing approaches to store XML documents
  • Flat stream the document trees are serialized
    into byte streams.
  • Metamodeling model and store documents or data
    trees using some conventional DBMS
  • Mixed Two redundant repositories one flat and
    one metamodeled, leads to slow updates and incurs
    significant overhead for consistency control.
  • Natix Native Storage Features
  • Subtrees of the original XML Document are stored
    together in a single record
  • The inner structure of the subtrees is retained
  • To satisfy special application requirements their
    clustering requirements can be specified by split
    matrix

9
Storage Engine Natix Native Storage
  • Logical data model
  • Set of trees, New nodes can be inserted as
    children or siblings of existing nodes.
  • Mapping between XML and the Logical model
  • Elements are mapped one-to one to tree nodes of
    the logical data model.
  • Attributes, PCDAtA, CData nodes and comments are
    stored as leaf nodes.
  • Three types of nodes in physical trees
  • Aggregate nodes Represent inner nodes of the
    logical tree
  • Literal nodes Represent leaf nodes of the
    logical tree and contain byte sequences like text
    strings, graphics or audio sequences
  • Proxy nodes are nodes which point to other
    records they are used to link trees together that
    were partitioned into subtrees.

10
A fragment of XML with its associated logical tree
11
Storage Engine Splitting tree
  • XML segment mapping for large trees
  • Semantically split large documents based on their
    tree structure, partition the tree into subtrees.
  • Updating documents
  • A record containing a subtree grows larger than a
    page.
  • Algorithm for tree growth
  • Determining the insertion location this choice
    is determined by the split matrix
  • Splitting a record Having decided on the
    insertion location, if the record exceeds the net
    page capacity, the record is split
  • To express preferences regarding the clustering
    of a node type with its parent node type, a split
    matrix as an additional parameter was introduced.

12
Example of XML segment mapping for large trees
13
Example of Splitting tree in Updating documents
14
(No Transcript)
15
Storage Engine Index Structures in Natix
  • Full text index framework
  • Enhanced a traditional full text index by storing
    lists of document references to indicate in which
    documents search terms appear
  • The list implementation in Natix consists of four
    components the index, the ListManager, the lists
    themselves and the ContextDescription
  • Extended access support relation
  • An index preserves the parent/child,
    ancestor/descendant, and preceding/following
    relationships among nodes
  • For each node in the tree we store a row in an
    XASR table with relevant nodes information
  • The XASR combined with a full text index provides
    a powerful method to search on contents of nodes

16
Transaction Management
  • Two areas covered in Transaction Management
  • Recovery
  • Adapt the ARIES protocol, further introduce the
    novel techniques of subsidiary logging,
    annihilator undo, and selective redo to exploit
    certain opportunities to improve logging and
    recovery performance
  • Synchronization
  • An S2PL-based scheduler is introduced that
    provides lock modes and a protocol that are
    suitable for typical access patterns occurring
    for tree structured documents
  • Granularity hierarchies of arbitrary depth are
    supported
  • Contrary to existing tree locking protocols,
    jumps into the tree do not violate capability of
    serialization.

17
Transaction Management Recovery components
18
Transaction Management Subsidiary logging
  • Existing logging-based recovery systems
  • Every modification operation is immediately
    preceded or followed by the creation of a log
    record for that operation
  • A conventional logging approach would not only
    create bulky log records for every single node
    insertion, but would also log all of the split
    operations.
  • Natix approach
  • Compositing update operations as one big
    operation to avoid the overhead of log record
    headers and reduces the number of calls to the
    log manager
  • Using the page contents as the subsidiary log
    instead of immediately creating a log record.
  • Effects of subsidiary logging
  • Log records are only created when necessary for
    recoverability. Only the final state of the
    documents logged upon commit
  • Reduces log size and increases concurrency,
    boosting overall performance

19
Transaction Management Two types of Subsidiary
logging
  • Page-level subsidiary logging
  • The page interpreters recorder, modify, or use
    some optimized representation for these private
    log entries before they are published to the log
    manager
  • Abide by the write-ahead-logging rule
  • Force the subsidiary logs to disk before a commit
    occurs
  • XML-Page subsidiary logging
  • The log entries for the subsidiary log are not
    explicitly stored. Instead, the XML page
    interpreters reuse the data pages as a
    representation for log records before publishing
    them to the global log.
  • If a node in the data page is modified, only the
    final version is included in the log record

20
Transaction Management Annihilator Undo
  • Transaction undo often wastes CPU resources,
    because more operations than necessary are
    executed to recreate the desired result of a
    rollback. For example, any update operations to a
    record that has been created by the same
    transaction need not be undone when the
    transaction is aborted.
  • Annihilator
  • Undo operations that imply undo of other
    operations following them in the log
  • If we know that undo for an operation is never
    required because an annihilator exists, the
    operation can be logged as a redo-only operation.
    No undo information has to be included in the log.

21
(No Transcript)
22
Transaction Management Selective Redo and
Selective undo
  • The ARIES protocol is designed around the
    redo-history paradigm, meaning that the complete
    state of the cached database is restored after a
    crash, including updates of loser transaction.
  • If all uncommitted updates were only in the
    buffer at the time of the crash, and thus no redo
    and undo of loser transactions would be necessary
    at restart
  • Natix improve on the restart performance recovery
    system by avoiding redo and undo when possible
  • Adding a transaction ID field to the main-memory
    page interpreter, it can easily be determined
    whether one or more transactions have updates on
    a dirty page. The information is included in the
    dirty page checkpoint log records, and as a
    result is available after restart analysis

23
Transaction Management Synchronization
Components
  • XML documents are semi-structured,
    synchronization mechanisms used in traditional
    structured relational databases can not be
    applied.
  • Lock protocol
  • Two cases
  • For operations on non-structure data, we acquire
    a lock on the node containing the data
  • For structural changes we request a lock on each
    node in the vicinity of the affected node
  • Top-down navigation The transaction holds locks
    on all nodes along the path from the root to the
    current node
  • Jumps lock all nodes from the node N the
    transaction jumped to up to the root node of the
    document
  • Special issues
  • Lock escalation if a transaction holds an
    excessive number of locks causing a lot of
    overhead. Lock escalation will be invoked.
  • Deadlock detection if a transaction has waited
    for a lock request longer than a specified
    timeout value, we start a deadlock detection

24
Natix Query Execution Engine
  • The design goals efficiency , expressiveness,
    and flexibility
  • NPA Natix Physical Algebra
  • Works on sequences of tuples.
  • A tuple consists of a sequence of attribute
    values.
  • Each value can be a number, a string, or a node
    handler. A node handler can be a handler to any
    XML node type text node, an element node or an
    attribute node.
  • Natix Virtual Machine
  • Interprets commands on register sets. Each
    register set is capable of holding a tuple
  • Parameterize algebraic operators in a very
    flexible way, Do not need to implement different
    algebraic operators to perform the implied
    different result constructions such as DOM, SAX
    or textual representation.

25
Construction Plan of Query
26
Conclusion
  • Contribution
  • Introduced a storage format that clusters
    subtrees of an XML document tree into physical
    records of limited size .(efficient retrieval of
    whole documents and document fragments.)
  • To improve recovery in the XML context, a novel
    techniques subsidiary logging to reduce the log
    size, annihilator undo to accelerate undo and
    selective redo to accelerate restart recovery.
  • To allow for high concurrency a flexible
    multi-granularity locking protocol with an
    arbitrary level of granularities is presented.
  • Support representations such as XML documents,
    DOM tree and a string or a stream of SAX events.
  • Potential drawbacks
  • Large volume of operations traversing the tree
    may lead to deadlock
  • Traversing in a subtree which will be deleted by
    other transactions is dangerous
  • Computation Complexity Need lots of computation
    for inserting or splitting record.
Write a Comment
User Comments (0)
About PowerShow.com