Title: XML Data Management
1XML Data Management
- Ning Zhang
- University of Waterloo
2What is XML?
- XML documents have elements and attributes
- Elements (indicated by begin end tags)
- can be nested but cannot interleave each other
- can have arbitrary number of sub-elements
- can have free text as values
- ltchap title Introduction To XMLgt
- some free text
- ltsect title What is XML?gt lt/sectgt
- ltsect title Elementsgt lt/sectgt
- ltsect title Why XML?gt lt/sectgt
- possibly more free text
- lt/chapgt
end element
attribute
begin element
Elements w/ same name can be nested
3Why XML?
- Database Side XML is a new way to organize data
- Relational databases organize data in tables
- XML documents organize data in ordered trees
- Document Side XML is a semantic markup language
- HTML focuses on presentation
- XML focuses on semantics/structure in the data
chap
sect
sect
sect
sect
sect
sect
lthtmlgt lth1gt Chapter 1 lt/h1gt some free
text lth2gt Section 1 lt/h2gt some more free
text lth3gt Section 1.1 lt/h3gt lt/htmlgt
4Data Management -- Relational vs. XML
- Relational data are well organized fully
structured (more strict) - E-R modeling to model the data structures in the
application - E-R diagram is converted to relational tables and
integrity constraints (relational schemas) - XML data are semi-structured (more flexible)
- Schemas may be unfixed, or unknown (flexible
anyone can author a document) - Suitable for data integration (data on the web,
data exchange between different enterprises).
5More about Relational vs. XML
- XML is not meant to replace relational database
systems - RDBMSs are well suited to OLTP applications
(e.g., electronic banking) which has 1000 small
transactions per minute. - XML is suitable data exchange over heterogeneous
data sources (e.g., Web services) that allow them
to talk.
6When should we use XML? (1)
- Document representation language
- XML can be transformed to other format (e.g., by
XSLT) - XML ? HTML
- XML ? LaTeX, bibTeX
- XML ? PDF
- DocBook (standard schema for authoring
document/book)
7When should we use XML? (2)
- Data integration and exchange language
- Web services (SOAP, WSDL, UDDI)
- Amazon.com, eBay, Microsoft MapPoint,
- Domain specific data exchange schemas (gt1000)
- legal document exchange language
- business information exchange
- RSS XML news feed
- CNN, slashdot, blogs,
8When should we use XML? (3)
- Any data having hierarchical structure
- Email
- Header from, to, cc, bcc
- Body my message, replied email
- Network log file
- IP address, time, request type, error code
- Advances of translating to XML
- Exploit high-level declarative XML query languages
9XML Databases
- Advantages
- Manage large volume of XML data
- Provide high-level declarative language
- Efficiently evaluate complex queries
- XML Data Management Issues
- XML Data Model
- XML Query Languages
- XML Query Processing and Optimization
10XML Data Model
- Hierarchical data model
- An XML document is an ordered tree
- Nodes in the tree are labeled with element names.
- Element nesting relationship corresponds to
parent-child relationship
chap
sect
_at_title
some free text
Introduction to XML
sect
_at_title
_at_title
What is XML?
11XML Schema Languages
- Schema languages defines the structure
- Document Type Definition (DTD)
- Context-free grammar
- Structurally richer than relational schema
definition language because of recursion. - XML Schema
- Also context-free
- Richer than DTD because of data types definition
(integer, date, sequence).
12XML Query Languages
- XPath
- 13 axes (navigation directions in the tree)
- child (/), descendant (//), following-sibling,
following - NameTest, predicates
- E.g,
- doc(bib.xml)//booktitleHarry Potter/ISBN
- XQuery (superset of XPath)
- FLWOR expression
- for x in doc(bib.xml)//booktitle
- Harry Potter/ISBN,
- y in doc(imdb.xml)//movie
- where y//novel/ISBN x
- return y//title
13Important Problems in XML Data Management
What follows is not covered in COSC 3480!!
- How to store XML data?
- How to efficiently evaluate XPath/XQuery
languages? - Efficient physical operators
- Query optimization
- How to support XML update languages?
- How to support transaction management?
- Recovery management?
14Agenda
- XML Storage
- XML Path Query Processing
- XML Optimization
15XML Storage
- Extended Relational Storage
- Convert XML documents to relational tables
- Native Storage
- Treat XML elements as first-class citizens
- Hybrid of Relational and Native Storage
- XML documents can be stored in columns of
relational tables (XML typed column)
16Extended Relational Storage
- Edge-based Storage Scheme (Florescu and Kossman
99) - Each node has an ID
- Each tuple in the edge table consists of
(parentID, childID, type of data, reference to
data) - Pro easy to convert XML to relational tables
- Con impossible to answer path queries such as
//a//b using SQL (needs transitive closure
operator)
17Extended Relational Storage
- Path-based Storage Scheme XRel (Yoshikawa et al.
01) - Each node corresponds to a tuple in the table
- Each tuple keeps a rooted path to the node (e.g.,
/article/chap/sec/sec/_at_title) - Pro also easy to convert XML to tables
- Con answering path queries, such as //a//b,
needs expensive string pattern matching
18Extended Relational Storage
- Node-based Storage Scheme Niagara, TIMBER etc.
(Zhang et al. 01) - Each node is encoded with a begin and end
integers. - Begin corresponds to the order of in-order
traversal of tree end corresponds to the order
in post-order traversal. - Pro checking parent-child/ancestor-descendant
relationships is efficient (constant time using
begin and end) - Con inefficient for updating XML
19Native Storage
- Subtree partition-based scheme Natix (Kanne and
Moerkotte 00) - A large XML tree is partitioned into small
subtrees, each of which can be fit into one disk
page - Introducing aproxy and aggregate nodes to connect
different subtrees - Pro easy to update and traversal
- Con complex update algorithm frequent
deletion/addition may deteriorate page usage ratio
20Native Storage
- Binary tree-based scheme Arb (Koch 03)
- Convert a tree with arbitrary number of children
to a binary tree (first child translates to left
child next sibling translate to right child) - Tree nodes are stored in document order
- Each node has 2 bits indicating whether it has a
left right child - Pro easy to do depth-first search (DFS)
traversal - Con inefficient to do next_sibling navigation
and hard to update
21Native Storage
- String-based scheme NoK (Zhang 04)
- Convert a tree to a parenthesized string
- E.g., a having b and c as children is converted
to ab)c)), by DFS of the tree and )
representing end-of-subtree - Tree can be reconstructed by the string
- A long string can be cut into substrings and fit
them into disk pages - Page header can contains simple statistics to
expedite next_sibling navigation - Pro particularly optimized for DFS navigational
evaluation plan - Con inefficient to do for breadth-first search
(BFS)
22Hybrid of Relational and Native Storage
- All major commercial RDBMS vendaors (IBM, Oracle,
Microsoft and Sybase) support XML type in their
RDBMS - A table can have a column whose type is XML
- When inserting a tuple in the table, the XML
field could be an XML document - XML documents are stored natively
23Hybrid of Relational and Native Storage
- IBM DB2 UDB
- System RX XML storage is similar to Natix
- Microsoft SQL Server
- Uses BLOB (binary large object) to represent XML
documents - Oracle
- Can use multiple format
- CLOB (character large object)
- Serialized object
- Shredded relational table
24Agenda
- XML Storage
- XML Path Query Processing
- XML Optimization
25XML Path Processing
- Extended Relational Approach
- Translate XML queries to SQL statements
- Native Approach (may be based on extended
relational storage) - Join-based approach
- Navigational approach
- Hybrid approach
26Extended Relational Query Processing
- Regular expression based approach XRel
(Yoshikawa et al. 01) - Linear path expression (without branches) are
translated to regular expressions on strings
(rooted paths) - Use the like predicate in SQL to evaluate
regular expressions - Pro easy to implement
- Con cannot answer branching path queries
27Extended Relational Query Processing
- Dynamic Interval based approach DI (DeHaan et
al. 03) - Use the node labeling (begin,end) interval
storage scheme - Dynamically calculate (begin,end) intervals for
resulting nodes give a path/FLWOR expression - Pro can handle all types of queries including
FLWOR expression - Con inefficient for answering complex path
queries
28Native Path Query Processing
- Merge-Join based approach Multi-predicate Merge
Join (MPMGJN) algorithm (Zhang et al. 01) - Modify the merge join algorithm to reduce
unnecessary comparisons - Keep to position p of the last successful
comparisons in the right input stream - The next item from the left input stream starts
scanning from position p.
29Native Path Query Processing
- Stack-based Structural Join (Wu et al. 02)
- Improve the MPMGJN algorithm
- Do not look back but keep all ancestors in a
stack - When comparing the new item, just compare it with
the top of the stack
30Native Path Query Processing
- Holistic Twig Join (Bruno et al. 02)
- Improve the stack-based structure algorithm
- Use one join algorithm for the whole path
expression instead of one join for one step - Reduce the overhead to produce and store
intermediate results
31Native Path Query Processing
- Natix (Brantner et al. 05)
- Translate each step into a logical navigational
operator Unnest-Map - Each unnest-map operator is translated into a
physical operator that performs tree traversal on
the Natix storage - Physical optimization can be performed on the
physical navigational operators to reduce
cross-cluster I/O.
32Native Path Query Processing
- IBM DB2 XNav (Josifovski et al. 04)
- XML path expressions are translated into automata
- The automaton is constructed dynamically while
traversing the XML tree in DFS - Physical I/O can be optimized by navigating to
next_sibling without traversing the whole subtree
33Native Path Query Processing
- Tree automata (Koch 03)
- The tree automaton needs two passes of tree
- The first traversal is a bottom-up deterministic
tree automaton to determine which states are
reachable - The second traversal is a top-down tree automaton
to prune the reachable states and compute
predicates.
34Hybrid Processing
- BlossomTree (Zhang 04, Zhang05)
- Navigational approach is efficient for
parent-child navigation - Join-based approach is efficient for
ancestor-descendant - BlossomTree approach identifies sub-expressions,
Next-of-Kin (NoK), that are efficient for
navigational approach. - Use navigational approach for NoK subexpressions
and use structural joins to join intermediate
results
35XML Indexing
- Structural Index
- Clustering tree nodes by their structural
similarity (e.g., bisimilarity and FB
bisimilarity) - Index is a graph, in which each vertex is an
equivalence class of similar XML tree nodes - Path query evaluation amounts to navigational
evaluation on the graph
36Agenda
- XML Storage
- XML Path Query Processing
- XML Optimization
37Overview of Cost-based Optimization
- Query Optimization depends on
- How much knowledge about the data we have?
- How intelligent we can make use of the knowledge
(within a time constraint)? - The cost of a plan is heavily dependent on
- The cost model of each operator
- The cardinality/selectivity of each operator
38Cardinality Estimation
- Full path summarization DataGuide (Goldman 97)
and PathTree (Aboulnaga 01) - Summarize all distinct paths in XML documents in
a graph - Cardinality information is annotated on graph
vertices
39Cardinality Estimation
- Partial path summarization Markov Table
(Aboulnaga 01) - Keep sub-paths and cardinality information in a
table - Cardinality for longer paths are calculated using
partial paths. - Can use additional compression methods to
accommodate Internet scale database
40Cardinality Estimation
- Structural clustered summarization XSketch
(Neoklis 02) and TreeSketch (Neoklis 04) - Similar idea as clustered-based index
- XSketch uses forward and backward stability, and
TreeSketch uses count stability as similarity
measurement - Heuristics to reduce graph to fit memory budget
41Cardinality Estimation
- Decompression-based approach XSEED (Zhang 06)
- XML documents are compressed into a small kernel
with edge cardinality labels - Kernel can be decompressed into XML document with
cardinality annotations - Navigational path operator can be reused on the
decompressed XML document for cardinality
estimation
42Cost Modeling
- Statistical Learning Cost Model COMET (Zhang
05) - Relational operator cost modeling is performed by
analyzing the source code - XML operators are much more complex than
relational operators therefore analytical
approach is too time-consuming - Statistical learning approach needs a training
set of queries and learn the cost model from the
input parameters and real cost.