XML Data Management - PowerPoint PPT Presentation

About This Presentation
Title:

XML Data Management

Description:

XML Data Management – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 43
Provided by: www249
Learn more at: https://www2.cs.uh.edu
Category:
Tags: xml | data | harry | how | is | long | management | movie | new | potter | the

less

Transcript and Presenter's Notes

Title: XML Data Management


1
XML Data Management
  • Ning Zhang
  • University of Waterloo

2
What is XML?
  • XML documents have elements and attributes
  • Elements (indicated by begin end tags)
  • can be nested but cannot interleave each other
  • can have arbitrary number of sub-elements
  • can have free text as values
  • ltchap title Introduction To XMLgt
  • some free text
  • ltsect title What is XML?gt lt/sectgt
  • ltsect title Elementsgt lt/sectgt
  • ltsect title Why XML?gt lt/sectgt
  • possibly more free text
  • lt/chapgt

end element
attribute
begin element
Elements w/ same name can be nested
3
Why XML?
  • Database Side XML is a new way to organize data
  • Relational databases organize data in tables
  • XML documents organize data in ordered trees
  • Document Side XML is a semantic markup language
  • HTML focuses on presentation
  • XML focuses on semantics/structure in the data

chap
sect
sect
sect
sect
sect
sect
lthtmlgt lth1gt Chapter 1 lt/h1gt some free
text lth2gt Section 1 lt/h2gt some more free
text lth3gt Section 1.1 lt/h3gt lt/htmlgt
4
Data Management -- Relational vs. XML
  • Relational data are well organized fully
    structured (more strict)
  • E-R modeling to model the data structures in the
    application
  • E-R diagram is converted to relational tables and
    integrity constraints (relational schemas)
  • XML data are semi-structured (more flexible)
  • Schemas may be unfixed, or unknown (flexible
    anyone can author a document)
  • Suitable for data integration (data on the web,
    data exchange between different enterprises).

5
More about Relational vs. XML
  • XML is not meant to replace relational database
    systems
  • RDBMSs are well suited to OLTP applications
    (e.g., electronic banking) which has 1000 small
    transactions per minute.
  • XML is suitable data exchange over heterogeneous
    data sources (e.g., Web services) that allow them
    to talk.

6
When should we use XML? (1)
  • Document representation language
  • XML can be transformed to other format (e.g., by
    XSLT)
  • XML ? HTML
  • XML ? LaTeX, bibTeX
  • XML ? PDF
  • DocBook (standard schema for authoring
    document/book)

7
When should we use XML? (2)
  • Data integration and exchange language
  • Web services (SOAP, WSDL, UDDI)
  • Amazon.com, eBay, Microsoft MapPoint,
  • Domain specific data exchange schemas (gt1000)
  • legal document exchange language
  • business information exchange
  • RSS XML news feed
  • CNN, slashdot, blogs,

8
When should we use XML? (3)
  • Any data having hierarchical structure
  • Email
  • Header from, to, cc, bcc
  • Body my message, replied email
  • Network log file
  • IP address, time, request type, error code
  • Advances of translating to XML
  • Exploit high-level declarative XML query languages

9
XML Databases
  • Advantages
  • Manage large volume of XML data
  • Provide high-level declarative language
  • Efficiently evaluate complex queries
  • XML Data Management Issues
  • XML Data Model
  • XML Query Languages
  • XML Query Processing and Optimization

10
XML Data Model
  • Hierarchical data model
  • An XML document is an ordered tree
  • Nodes in the tree are labeled with element names.
  • Element nesting relationship corresponds to
    parent-child relationship

chap
sect
_at_title

some free text
Introduction to XML
sect
_at_title

_at_title
What is XML?

11
XML Schema Languages
  • Schema languages defines the structure
  • Document Type Definition (DTD)
  • Context-free grammar
  • Structurally richer than relational schema
    definition language because of recursion.
  • XML Schema
  • Also context-free
  • Richer than DTD because of data types definition
    (integer, date, sequence).

12
XML Query Languages
  • XPath
  • 13 axes (navigation directions in the tree)
  • child (/), descendant (//), following-sibling,
    following
  • NameTest, predicates
  • E.g,
  • doc(bib.xml)//booktitleHarry Potter/ISBN
  • XQuery (superset of XPath)
  • FLWOR expression
  • for x in doc(bib.xml)//booktitle
  • Harry Potter/ISBN,
  • y in doc(imdb.xml)//movie
  • where y//novel/ISBN x
  • return y//title

13
Important Problems in XML Data Management
What follows is not covered in COSC 3480!!
  • How to store XML data?
  • How to efficiently evaluate XPath/XQuery
    languages?
  • Efficient physical operators
  • Query optimization
  • How to support XML update languages?
  • How to support transaction management?
  • Recovery management?

14
Agenda
  • XML Storage
  • XML Path Query Processing
  • XML Optimization

15
XML Storage
  • Extended Relational Storage
  • Convert XML documents to relational tables
  • Native Storage
  • Treat XML elements as first-class citizens
  • Hybrid of Relational and Native Storage
  • XML documents can be stored in columns of
    relational tables (XML typed column)

16
Extended Relational Storage
  • Edge-based Storage Scheme (Florescu and Kossman
    99)
  • Each node has an ID
  • Each tuple in the edge table consists of
    (parentID, childID, type of data, reference to
    data)
  • Pro easy to convert XML to relational tables
  • Con impossible to answer path queries such as
    //a//b using SQL (needs transitive closure
    operator)

17
Extended Relational Storage
  • Path-based Storage Scheme XRel (Yoshikawa et al.
    01)
  • Each node corresponds to a tuple in the table
  • Each tuple keeps a rooted path to the node (e.g.,
    /article/chap/sec/sec/_at_title)
  • Pro also easy to convert XML to tables
  • Con answering path queries, such as //a//b,
    needs expensive string pattern matching

18
Extended Relational Storage
  • Node-based Storage Scheme Niagara, TIMBER etc.
    (Zhang et al. 01)
  • Each node is encoded with a begin and end
    integers.
  • Begin corresponds to the order of in-order
    traversal of tree end corresponds to the order
    in post-order traversal.
  • Pro checking parent-child/ancestor-descendant
    relationships is efficient (constant time using
    begin and end)
  • Con inefficient for updating XML

19
Native Storage
  • Subtree partition-based scheme Natix (Kanne and
    Moerkotte 00)
  • A large XML tree is partitioned into small
    subtrees, each of which can be fit into one disk
    page
  • Introducing aproxy and aggregate nodes to connect
    different subtrees
  • Pro easy to update and traversal
  • Con complex update algorithm frequent
    deletion/addition may deteriorate page usage ratio

20
Native Storage
  • Binary tree-based scheme Arb (Koch 03)
  • Convert a tree with arbitrary number of children
    to a binary tree (first child translates to left
    child next sibling translate to right child)
  • Tree nodes are stored in document order
  • Each node has 2 bits indicating whether it has a
    left right child
  • Pro easy to do depth-first search (DFS)
    traversal
  • Con inefficient to do next_sibling navigation
    and hard to update

21
Native Storage
  • String-based scheme NoK (Zhang 04)
  • Convert a tree to a parenthesized string
  • E.g., a having b and c as children is converted
    to ab)c)), by DFS of the tree and )
    representing end-of-subtree
  • Tree can be reconstructed by the string
  • A long string can be cut into substrings and fit
    them into disk pages
  • Page header can contains simple statistics to
    expedite next_sibling navigation
  • Pro particularly optimized for DFS navigational
    evaluation plan
  • Con inefficient to do for breadth-first search
    (BFS)

22
Hybrid of Relational and Native Storage
  • All major commercial RDBMS vendaors (IBM, Oracle,
    Microsoft and Sybase) support XML type in their
    RDBMS
  • A table can have a column whose type is XML
  • When inserting a tuple in the table, the XML
    field could be an XML document
  • XML documents are stored natively

23
Hybrid of Relational and Native Storage
  • IBM DB2 UDB
  • System RX XML storage is similar to Natix
  • Microsoft SQL Server
  • Uses BLOB (binary large object) to represent XML
    documents
  • Oracle
  • Can use multiple format
  • CLOB (character large object)
  • Serialized object
  • Shredded relational table

24
Agenda
  • XML Storage
  • XML Path Query Processing
  • XML Optimization

25
XML Path Processing
  • Extended Relational Approach
  • Translate XML queries to SQL statements
  • Native Approach (may be based on extended
    relational storage)
  • Join-based approach
  • Navigational approach
  • Hybrid approach

26
Extended Relational Query Processing
  • Regular expression based approach XRel
    (Yoshikawa et al. 01)
  • Linear path expression (without branches) are
    translated to regular expressions on strings
    (rooted paths)
  • Use the like predicate in SQL to evaluate
    regular expressions
  • Pro easy to implement
  • Con cannot answer branching path queries

27
Extended Relational Query Processing
  • Dynamic Interval based approach DI (DeHaan et
    al. 03)
  • Use the node labeling (begin,end) interval
    storage scheme
  • Dynamically calculate (begin,end) intervals for
    resulting nodes give a path/FLWOR expression
  • Pro can handle all types of queries including
    FLWOR expression
  • Con inefficient for answering complex path
    queries

28
Native Path Query Processing
  • Merge-Join based approach Multi-predicate Merge
    Join (MPMGJN) algorithm (Zhang et al. 01)
  • Modify the merge join algorithm to reduce
    unnecessary comparisons
  • Keep to position p of the last successful
    comparisons in the right input stream
  • The next item from the left input stream starts
    scanning from position p.

29
Native Path Query Processing
  • Stack-based Structural Join (Wu et al. 02)
  • Improve the MPMGJN algorithm
  • Do not look back but keep all ancestors in a
    stack
  • When comparing the new item, just compare it with
    the top of the stack

30
Native Path Query Processing
  • Holistic Twig Join (Bruno et al. 02)
  • Improve the stack-based structure algorithm
  • Use one join algorithm for the whole path
    expression instead of one join for one step
  • Reduce the overhead to produce and store
    intermediate results

31
Native Path Query Processing
  • Natix (Brantner et al. 05)
  • Translate each step into a logical navigational
    operator Unnest-Map
  • Each unnest-map operator is translated into a
    physical operator that performs tree traversal on
    the Natix storage
  • Physical optimization can be performed on the
    physical navigational operators to reduce
    cross-cluster I/O.

32
Native Path Query Processing
  • IBM DB2 XNav (Josifovski et al. 04)
  • XML path expressions are translated into automata
  • The automaton is constructed dynamically while
    traversing the XML tree in DFS
  • Physical I/O can be optimized by navigating to
    next_sibling without traversing the whole subtree

33
Native Path Query Processing
  • Tree automata (Koch 03)
  • The tree automaton needs two passes of tree
  • The first traversal is a bottom-up deterministic
    tree automaton to determine which states are
    reachable
  • The second traversal is a top-down tree automaton
    to prune the reachable states and compute
    predicates.

34
Hybrid Processing
  • BlossomTree (Zhang 04, Zhang05)
  • Navigational approach is efficient for
    parent-child navigation
  • Join-based approach is efficient for
    ancestor-descendant
  • BlossomTree approach identifies sub-expressions,
    Next-of-Kin (NoK), that are efficient for
    navigational approach.
  • Use navigational approach for NoK subexpressions
    and use structural joins to join intermediate
    results

35
XML Indexing
  • Structural Index
  • Clustering tree nodes by their structural
    similarity (e.g., bisimilarity and FB
    bisimilarity)
  • Index is a graph, in which each vertex is an
    equivalence class of similar XML tree nodes
  • Path query evaluation amounts to navigational
    evaluation on the graph

36
Agenda
  • XML Storage
  • XML Path Query Processing
  • XML Optimization

37
Overview of Cost-based Optimization
  • Query Optimization depends on
  • How much knowledge about the data we have?
  • How intelligent we can make use of the knowledge
    (within a time constraint)?
  • The cost of a plan is heavily dependent on
  • The cost model of each operator
  • The cardinality/selectivity of each operator

38
Cardinality Estimation
  • Full path summarization DataGuide (Goldman 97)
    and PathTree (Aboulnaga 01)
  • Summarize all distinct paths in XML documents in
    a graph
  • Cardinality information is annotated on graph
    vertices

39
Cardinality Estimation
  • Partial path summarization Markov Table
    (Aboulnaga 01)
  • Keep sub-paths and cardinality information in a
    table
  • Cardinality for longer paths are calculated using
    partial paths.
  • Can use additional compression methods to
    accommodate Internet scale database

40
Cardinality Estimation
  • Structural clustered summarization XSketch
    (Neoklis 02) and TreeSketch (Neoklis 04)
  • Similar idea as clustered-based index
  • XSketch uses forward and backward stability, and
    TreeSketch uses count stability as similarity
    measurement
  • Heuristics to reduce graph to fit memory budget

41
Cardinality Estimation
  • Decompression-based approach XSEED (Zhang 06)
  • XML documents are compressed into a small kernel
    with edge cardinality labels
  • Kernel can be decompressed into XML document with
    cardinality annotations
  • Navigational path operator can be reused on the
    decompressed XML document for cardinality
    estimation

42
Cost Modeling
  • Statistical Learning Cost Model COMET (Zhang
    05)
  • Relational operator cost modeling is performed by
    analyzing the source code
  • XML operators are much more complex than
    relational operators therefore analytical
    approach is too time-consuming
  • Statistical learning approach needs a training
    set of queries and learn the cost model from the
    input parameters and real cost.
Write a Comment
User Comments (0)
About PowerShow.com