XML Data Management - PowerPoint PPT Presentation

About This Presentation

Title:

XML Data Management

Description:

XML Data Management – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 43

Provided by: www249

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: XML Data Management

1
XML Data Management

Ning Zhang
University of Waterloo

2
What is XML?

XML documents have elements and attributes
Elements (indicated by begin end tags)
can be nested but cannot interleave each other
can have arbitrary number of sub-elements
can have free text as values
ltchap title Introduction To XMLgt
some free text
ltsect title What is XML?gt lt/sectgt
ltsect title Elementsgt lt/sectgt
ltsect title Why XML?gt lt/sectgt
possibly more free text
lt/chapgt

end element
attribute
begin element
Elements w/ same name can be nested
3
Why XML?

Database Side XML is a new way to organize data
Relational databases organize data in tables
XML documents organize data in ordered trees
Document Side XML is a semantic markup language
HTML focuses on presentation
XML focuses on semantics/structure in the data

chap
sect
sect
sect
sect
sect
sect
lthtmlgt lth1gt Chapter 1 lt/h1gt some free
text lth2gt Section 1 lt/h2gt some more free
text lth3gt Section 1.1 lt/h3gt lt/htmlgt
4
Data Management -- Relational vs. XML

Relational data are well organized fully
structured (more strict)
E-R modeling to model the data structures in the
application
E-R diagram is converted to relational tables and
integrity constraints (relational schemas)
XML data are semi-structured (more flexible)
Schemas may be unfixed, or unknown (flexible
anyone can author a document)
Suitable for data integration (data on the web,
data exchange between different enterprises).

5
More about Relational vs. XML

XML is not meant to replace relational database
systems
RDBMSs are well suited to OLTP applications
(e.g., electronic banking) which has 1000 small
transactions per minute.
XML is suitable data exchange over heterogeneous
data sources (e.g., Web services) that allow them
to talk.

6
When should we use XML? (1)

Document representation language
XML can be transformed to other format (e.g., by
XSLT)
XML ? HTML
XML ? LaTeX, bibTeX
XML ? PDF
DocBook (standard schema for authoring
document/book)

7
When should we use XML? (2)

Data integration and exchange language
Web services (SOAP, WSDL, UDDI)
Amazon.com, eBay, Microsoft MapPoint,
Domain specific data exchange schemas (gt1000)
legal document exchange language
business information exchange
RSS XML news feed
CNN, slashdot, blogs,

8
When should we use XML? (3)

Any data having hierarchical structure
Email
Header from, to, cc, bcc
Body my message, replied email
Network log file
IP address, time, request type, error code
Advances of translating to XML
Exploit high-level declarative XML query languages

9
XML Databases

Advantages
Manage large volume of XML data
Provide high-level declarative language
Efficiently evaluate complex queries
XML Data Management Issues
XML Data Model
XML Query Languages
XML Query Processing and Optimization

10
XML Data Model

Hierarchical data model
An XML document is an ordered tree
Nodes in the tree are labeled with element names.
Element nesting relationship corresponds to
parent-child relationship

chap
sect
_at_title

some free text
Introduction to XML
sect
_at_title

_at_title
What is XML?

11
XML Schema Languages

Schema languages defines the structure
Document Type Definition (DTD)
Context-free grammar
Structurally richer than relational schema
definition language because of recursion.
XML Schema
Also context-free
Richer than DTD because of data types definition
(integer, date, sequence).

12
XML Query Languages

XPath
13 axes (navigation directions in the tree)
child (/), descendant (//), following-sibling,
following
NameTest, predicates
E.g,
doc(bib.xml)//booktitleHarry Potter/ISBN
XQuery (superset of XPath)
FLWOR expression
for x in doc(bib.xml)//booktitle
Harry Potter/ISBN,
y in doc(imdb.xml)//movie
where y//novel/ISBN x
return y//title

13
Important Problems in XML Data Management
What follows is not covered in COSC 3480!!

How to store XML data?
How to efficiently evaluate XPath/XQuery
languages?
Efficient physical operators
Query optimization
How to support XML update languages?
How to support transaction management?
Recovery management?

14
Agenda

XML Storage
XML Path Query Processing
XML Optimization

15
XML Storage

Extended Relational Storage
Convert XML documents to relational tables
Native Storage
Treat XML elements as first-class citizens
Hybrid of Relational and Native Storage
XML documents can be stored in columns of
relational tables (XML typed column)

16
Extended Relational Storage

Edge-based Storage Scheme (Florescu and Kossman
99)
Each node has an ID
Each tuple in the edge table consists of
(parentID, childID, type of data, reference to
data)
Pro easy to convert XML to relational tables
Con impossible to answer path queries such as
//a//b using SQL (needs transitive closure
operator)

17
Extended Relational Storage

Path-based Storage Scheme XRel (Yoshikawa et al.
01)
Each node corresponds to a tuple in the table
Each tuple keeps a rooted path to the node (e.g.,
/article/chap/sec/sec/_at_title)
Pro also easy to convert XML to tables
Con answering path queries, such as //a//b,
needs expensive string pattern matching

18
Extended Relational Storage

Node-based Storage Scheme Niagara, TIMBER etc.
(Zhang et al. 01)
Each node is encoded with a begin and end
integers.
Begin corresponds to the order of in-order
traversal of tree end corresponds to the order
in post-order traversal.
Pro checking parent-child/ancestor-descendant
relationships is efficient (constant time using
begin and end)
Con inefficient for updating XML

19
Native Storage

Subtree partition-based scheme Natix (Kanne and
Moerkotte 00)
A large XML tree is partitioned into small
subtrees, each of which can be fit into one disk
page
Introducing aproxy and aggregate nodes to connect
different subtrees
Pro easy to update and traversal
Con complex update algorithm frequent
deletion/addition may deteriorate page usage ratio

20
Native Storage

Binary tree-based scheme Arb (Koch 03)
Convert a tree with arbitrary number of children
to a binary tree (first child translates to left
child next sibling translate to right child)
Tree nodes are stored in document order
Each node has 2 bits indicating whether it has a
left right child
Pro easy to do depth-first search (DFS)
traversal
Con inefficient to do next_sibling navigation
and hard to update

21
Native Storage

String-based scheme NoK (Zhang 04)
Convert a tree to a parenthesized string
E.g., a having b and c as children is converted
to ab)c)), by DFS of the tree and )
representing end-of-subtree
Tree can be reconstructed by the string
A long string can be cut into substrings and fit
them into disk pages
Page header can contains simple statistics to
expedite next_sibling navigation
Pro particularly optimized for DFS navigational
evaluation plan
Con inefficient to do for breadth-first search
(BFS)

22
Hybrid of Relational and Native Storage

All major commercial RDBMS vendaors (IBM, Oracle,
Microsoft and Sybase) support XML type in their
RDBMS
A table can have a column whose type is XML
When inserting a tuple in the table, the XML
field could be an XML document
XML documents are stored natively

23
Hybrid of Relational and Native Storage

IBM DB2 UDB
System RX XML storage is similar to Natix
Microsoft SQL Server
Uses BLOB (binary large object) to represent XML
documents
Oracle
Can use multiple format
CLOB (character large object)
Serialized object
Shredded relational table

24
Agenda

XML Storage
XML Path Query Processing
XML Optimization

25
XML Path Processing

Extended Relational Approach
Translate XML queries to SQL statements
Native Approach (may be based on extended
relational storage)
Join-based approach
Navigational approach
Hybrid approach

26
Extended Relational Query Processing

Regular expression based approach XRel
(Yoshikawa et al. 01)
Linear path expression (without branches) are
translated to regular expressions on strings
(rooted paths)
Use the like predicate in SQL to evaluate
regular expressions
Pro easy to implement
Con cannot answer branching path queries

27
Extended Relational Query Processing

Dynamic Interval based approach DI (DeHaan et
al. 03)
Use the node labeling (begin,end) interval
storage scheme
Dynamically calculate (begin,end) intervals for
resulting nodes give a path/FLWOR expression
Pro can handle all types of queries including
FLWOR expression
Con inefficient for answering complex path
queries

28
Native Path Query Processing

Merge-Join based approach Multi-predicate Merge
Join (MPMGJN) algorithm (Zhang et al. 01)
Modify the merge join algorithm to reduce
unnecessary comparisons
Keep to position p of the last successful
comparisons in the right input stream
The next item from the left input stream starts
scanning from position p.

29
Native Path Query Processing

Stack-based Structural Join (Wu et al. 02)
Improve the MPMGJN algorithm
Do not look back but keep all ancestors in a
stack
When comparing the new item, just compare it with
the top of the stack

30
Native Path Query Processing

Holistic Twig Join (Bruno et al. 02)
Improve the stack-based structure algorithm
Use one join algorithm for the whole path
expression instead of one join for one step
Reduce the overhead to produce and store
intermediate results

31
Native Path Query Processing

Natix (Brantner et al. 05)
Translate each step into a logical navigational
operator Unnest-Map
Each unnest-map operator is translated into a
physical operator that performs tree traversal on
the Natix storage
Physical optimization can be performed on the
physical navigational operators to reduce
cross-cluster I/O.

32
Native Path Query Processing

IBM DB2 XNav (Josifovski et al. 04)
XML path expressions are translated into automata
The automaton is constructed dynamically while
traversing the XML tree in DFS
Physical I/O can be optimized by navigating to
next_sibling without traversing the whole subtree

33
Native Path Query Processing

Tree automata (Koch 03)
The tree automaton needs two passes of tree
The first traversal is a bottom-up deterministic
tree automaton to determine which states are
reachable
The second traversal is a top-down tree automaton
to prune the reachable states and compute
predicates.

34
Hybrid Processing

BlossomTree (Zhang 04, Zhang05)
Navigational approach is efficient for
parent-child navigation
Join-based approach is efficient for
ancestor-descendant
BlossomTree approach identifies sub-expressions,
Next-of-Kin (NoK), that are efficient for
navigational approach.
Use navigational approach for NoK subexpressions
and use structural joins to join intermediate
results

35
XML Indexing

Structural Index
Clustering tree nodes by their structural
similarity (e.g., bisimilarity and FB
bisimilarity)
Index is a graph, in which each vertex is an
equivalence class of similar XML tree nodes
Path query evaluation amounts to navigational
evaluation on the graph

36
Agenda

XML Storage
XML Path Query Processing
XML Optimization

37
Overview of Cost-based Optimization

Query Optimization depends on
How much knowledge about the data we have?
How intelligent we can make use of the knowledge
(within a time constraint)?
The cost of a plan is heavily dependent on
The cost model of each operator
The cardinality/selectivity of each operator

38
Cardinality Estimation

Full path summarization DataGuide (Goldman 97)
and PathTree (Aboulnaga 01)
Summarize all distinct paths in XML documents in
a graph
Cardinality information is annotated on graph
vertices

39
Cardinality Estimation

Partial path summarization Markov Table
(Aboulnaga 01)
Keep sub-paths and cardinality information in a
table
Cardinality for longer paths are calculated using
partial paths.
Can use additional compression methods to
accommodate Internet scale database

40
Cardinality Estimation

Structural clustered summarization XSketch
(Neoklis 02) and TreeSketch (Neoklis 04)
Similar idea as clustered-based index
XSketch uses forward and backward stability, and
TreeSketch uses count stability as similarity
measurement
Heuristics to reduce graph to fit memory budget

41
Cardinality Estimation

Decompression-based approach XSEED (Zhang 06)
XML documents are compressed into a small kernel
with edge cardinality labels
Kernel can be decompressed into XML document with
cardinality annotations
Navigational path operator can be reused on the
decompressed XML document for cardinality
estimation

42
Cost Modeling

Statistical Learning Cost Model COMET (Zhang
05)
Relational operator cost modeling is performed by
analyzing the source code
XML operators are much more complex than
relational operators therefore analytical
approach is too time-consuming
Statistical learning approach needs a training
set of queries and learn the cost model from the
input parameters and real cost.