XML Databases - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

XML Databases

Description:

'Streaming XML' RDBMS XML export. Partitioning of computation between source and mediator 'Streaming XPath' engines. XML databases ... Streaming XML ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 23
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: XML Databases


1
XML Databases
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Database Information Systems
  • March 23, 2005

2
Administrivia
  • Were moving beyond simple databases now
  • For Monday read compare focus of
  • Hanson Scalable Trigger Processing
  • Stanford STREAM processor
  • For Wednesday
  • Retrospective on Aurora

3
Todays Trivia Question
4
XML What Makes It Hard?
  • Its not normalized
  • It conceptually centers around some origin,
    meaning that navigation becomes central
  • Contrast with E-R diagrams
  • How to store the hierarchy?
  • Complex navigation
  • Updates, locking
  • Optimization
  • Also, its ordered
  • May restrict order of evaluation (or at least
    presentation)
  • Makes updates more complex
  • Many of these issues arent unique to XML
  • Semistructured databases, esp. with ordered
    collections, were similar
  • But our efforts in that area basically failed

5
XML Whats It Good For?
  • Collections of text documents, e.g., the Web, doc
    DBs
  • How would we want to query those?
  • IR/text queries, path queries, XQueries?
  • Interchanging data
  • SOAP messages, RSS, XML streams
  • Perhaps subsets of data from RDBMSs
  • Storing native, database-like XML data
  • Caching
  • Logging of XML messages
  • ?

6
Lots of XML Research Out There
  • Text
  • Hybrids of database and IR techniques for search
  • (e.g., Amer-Yahia Shanmugasundaram, Weikum
    Ramakrishnan, )
  • Interchange
  • Web service verification
  • XML stream processing
  • XML databases
  • Natix, TIMBER,
  • Tamino, DB2 UDB, Oracle,

7
The Main Focal Points
  • XML with documents
  • Inverted indices
  • Integration of ranking into DBMS
  • Interaction between structure and content
  • Streaming XML
  • RDBMS ? XML export
  • Partitioning of computation between source and
    mediator
  • Streaming XPath engines
  • XML databases
  • Hierarchical storage locking (Natix, TIMBER,
    BerkeleyDB, Tamino, )
  • Query optimization

8
Text-Based XML
  • The fundamental questions
  • How should we model ranking in query processing?
  • Simply as another value (e.g., Amer-Yahia
    Shanmugasundaram)
  • Using a probabilistic model or as an undefined
    metric
  • e.g., Weikum and Ramakrishnan work-in-progress
  • How does structure affect ranking?
  • PageRank-style (e.g., Shanmugasundaram et al.)
  • Query relaxation (FleXPath)
  • Other?
  • How do we achieve efficient pruning?
  • A search Cohen 98
  • Fagins Threshold Algorithm
  • Custom logic?
  • How do we integrate keyword indexing with
    structural indexing?
  • Multiple indices (e.g., Lore, Natix, )
  • Integrated indices (e.g., ViST)

9
XML as a Wire Format
  • RDBMS ? XML export
  • SilkRoute and Xperanto, outer unions
  • Interaction with RDBMS optimization techniques
  • Updates Tatarinov01
  • Cascading updates are already possible in RDBMSs
  • Updating XML views
  • Streaming XML
  • SAX-based XPath-matching engines
    Ives01AltinelFranklin00Green02
    DiaoFranklinChen
  • Push-down of XPath matching as early as possible
  • Query decomposition (still in need of a standard
    means of pushing XQuery to a source)
  • Subsets of XQuery that are amenable to streaming

10
XML in a Database
  • Use a legacy RDBMS
  • Shredding Shanmugasundaram99 and many others
  • Path-based encodings Cooper01
  • Region-based encodings Bruno02Chen04
  • Order preservation in updates Tatarinov02,
  • Whats novel here? How does this relate to
    materialized views and warehousing?
  • Native XML databases
  • Hierarchical storage (Natix, TIMBER, BerkeleyDB,
    Tamino, )
  • Updates and locking
  • Query optimization (e.g., that on Galax)

11
Query Processing for XML
  • Why is optimization harder?
  • Hierarchy means many more joins (conceptually)
  • traverse, tree-match, x-scan, unnest,
    path, op
  • Though typically parent-child relationships
  • Often dont have good measure of fan-out
  • More ways of optimizing this
  • Order preservation limits processing in many ways
  • Nested content left outer join
  • Except that we need to cluster a collection with
    the parent
  • Relationship with NF2 approach
  • Tags (dont really add much complexity except in
    trying to encode efficiently)
  • Complex functions and recursion
  • Few real DB systems implement these fully
  • Why is storage harder?
  • Thats the focus of Natix, really

12
The Natix System
  • In contrast to many pieces of work on XML,
    focuses on the bottom layers, equivalent to
    System Rs RSS
  • Physical layout
  • Indexing
  • Locking/concurrency control
  • Logging/recovery

13
Physical Layout
  • What are our options in storing XML trees?
  • At some level, its all smoke-and-mirrors
  • Need to map to flat byte sequences on disk
  • But several options
  • Shred completely, as in many RDBMS mappings
  • Each path may get its own contiguous set of pages
  • e.g., vectorized XML Buneman et al.
  • An element may get its 11 children
  • e.g., shared inlining Shanmugasundaram and
    Chen
  • All content may be in one table
  • e.g., Florescu/Kossmann and most interval
    encoded XML
  • We may embed a few items on the same page and
    overflow the rest
  • How collections are often stored in ORDBMS
  • We may try to cluster XML trees on the same page,
    as interpreted BLOBs
  • This is Natixs approach (and also IBMs DB2)
  • Pros and cons of these approaches?

14
Challenges of the Page-per-Tree Approach
  • How big of a tree?
  • What happens if the XML overflows the tree?
  • Natix claims an adaptive approach to choosing the
    trees granularity
  • Primarily based on balancing the tree,
    constraints on children that must appear with a
    parent
  • What other possibilities make sense?
  • Natix uses a B Tree-like scheme for achieving
    balance and splitting a tree across pages

15
Example
Split point in parent page
Note proxy nodes
16
That Was Simple But What about Updates?
  • Clearly, insertions and deletions can affect
    things
  • Deletion may ultimately require us to rebalance
  • Ditto with insertion
  • But insertion also may make us run out of space
    what to do?
  • Their approach add another page ultimately may
    need to split at multiple levels, as in B Tree
  • Others have studied this problem and used integer
    encoding schemes (plus B Trees) for the order

17
Does this Help?
  • According to general lore, yes
  • The Natix experiments in this paper were limited
    in their query and adaptivity loads
  • But the IBM guys say their approach, which is
    similar, works significantly better than Oracles
    shredded approach

18
Theres More to Updates than the Pages
  • What about concurrency control and recovery?
  • We already have a notion of hierarchical locks,
    but they claim
  • If we want to support IDREF traversal, and
    indexing directly to nodes, we need more
  • Whats the idea behind SPP locking?

19
Logging
  • They claim ARIES needs some modifications why?
  • Their changes
  • Need to make subtree updates more efficient
    dont want to write a log entry for each subtree
    insertion
  • Use (a copy of) the page itself as a means of
    tracking what was inserted, then batch-apply to
    WAL
  • Annihilators if we undo a tree creation, then
    we probably dont need to worry about undoing
    later changes to that tree
  • A few minor tweaks to minimize undo/redo when
    only one transaction touches a page

20
Annihilators
21
Assessment
  • Native XML storage isnt really all that
    different from other means of storage
  • There are probably some good reasons to make a
    few tweaks in locking
  • Optimization stays harder
  • A real solution to materialized view creation
    would probably make RDBMSs come close to
    delivering the same performance, modulo locking

22
Questions
  • Where are the main challenges of XML processing
    at this point?
  • Impact of BinaryXML?
  • Are we working on the right problems? Whats XML
    going to be used for, anyway?
Write a Comment
User Comments (0)
About PowerShow.com