XML and Databases

About This Presentation

Title:

XML and Databases

Description:

IBM DB2 Viper. Native XML column storage. Truly ... IBM DB2 Viper (continued) Path specific XML indexing. Because indexing all nodes is prohibitive ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 61

Provided by: justin105

Category:

more less

Transcript and Presenter's Notes

Title: XML and Databases

1
XML and Databases

Justin Hong
Suhasini Ranganathan
Michael Suwandi

2
Outline

XML / Relational Model
XML in Relational Databases
Native XML Databases
References

3
XML / Relational Model

Suhasini Ranganathan

4
XML in One Slide

Hierarchical document format for information
exchange in WWW
Self describing data (tags)
Nested element structure having a root
Element data can have
Attributes
Sub-elements

5
XML and Databases

With a large amount of data represented as XML
documents, it becomes necessary to store and
query these XML documents.
Exciting database opportunity
Unlike HTML, tags are not only for presentation
Can capture semantics
Can query the web if we can query XML!!!
Two approaches to storing XML data
Using relational databases
Using native XML stores.

6
Why relational databases to store XML?

Can harness the sophisticated storage and query
capability already provided by existing
relational database systems.
No need to build native systems from scratch.
users can query seamlessly across XML documents
and other data stored in relational database
systems.

7
Storing XML Documents in a Relational Database
System
Automatic Translation Layer
Commercial RDBMS (DB2)
8
Issues

The main issues that must be addressed include
dealing with the complexity of DTD element
specifications
resolving the conflict between the two-level
nature of relational schemas (table and
attribute) vs. the arbitrary nesting of XML DTD
schemas
dealing with set-valued attributes and recursion.

9
Example DTD (will be used in further slides)

lt!ELEMENT book(booktitle,author)
lt!ELEMENT article(title,author,contactauthor)gt
lt!ELEMENT contactauthor EMPTYgt
lt!ATTLIST contactauthor authorID IDREF IMPLIEDgt
lt!ELEMENT monograph(title,author,editor)gt
lt!ELEMENT editor(monograph)gt
lt!ATTLIST editor name CDATA REQUIREDgt
lt!ELEMENT author(name, address)gt
lt!ATTLIST author id ID REQUIREDgt
lt!ELEMENT name(firstname?,lastname)gt
lt!ELEMENT firstname(PCDATA)gt
lt!ELEMENT lastname(PCDATA)gt
lt!ELEMENT address(ANY)gt

10
DTD Graph
11
Simplifying DTDs

DTDs can be complex and generating relational
schemas that capture this complexity would be
unwieldy
Use set of transformations that can be used to
simplify any arbitrary DTD without undermining
the effectiveness of queries over documents
conforming to that DTD.

12
Transformation of DTDs

The transformations are of three types
flattening transformations which convert a
nested
definition into flat representation (eg
(e1,e2)e1,e2)
simplification transformations, which reduce many
unary
operators to a single unary operator(ege1e1)
grouping transformations that group sub-elements
having the same name(eg a,.,a.. .a..
For example, applying these transformations can
transform an element such as
lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt to
lt!ELEMENT a (b, c?, e, f)

13
DTD to Relational Schema

Naïve Approach
Each Element gt Relation
Each Attribute of Element gt Column of Relation
Connect elements using foreign keys
Problem?
Fragmentation!

14
Fragmentation Example
lt!ELEMENT author (name, address)gt lt!ATTLIST
author id ID REQUIREDgt lt!ELEMENT name
(firstname?, lastname)gt lt!ELEMENT firstname
(PCDATA)gt lt!ELEMENT lastname (PCDATA)gt lt!ELEME
NT address ANYgt
author (authorID integer, id string) name
(nameID integer, authorID integer) firstname
(firstnameID integer, nameID integer, value
string) lastname (lastnameID integer, nameID
integer, value string) address (addressID
integer, authorID integer, value string)

Results in 5 relations
Just retrieving first and last names of an author
requires three joins!

15
The Basic Inlining Technique

solves the fragmentation problem by inlining as
many descendants of an element as possible into a
single relation.
Basic creates relations for every element because
an XML document can be rooted at any element in a
DTD.
create a graph structure called the element graph

16
Example Element Graph
article

?
title

contactauthor
author
authorid
name
authorid
address
?
firstname
lastname
17
Basic Inlining Technique (contd)

book (bookID integer, book.booktitle string,
book.author.name.firstname string,
book.author.name.lastname string,
book.author.address string, author.authorid
string)
booktitle (booktitleID integer, booktitle
string)
article (articleID integer, article.contactauthor
.authorid string, article.title string)
article.author (article.authorID integer,
article.author.parentID integer,
article.author.name.firstname string,
article.author.name.lastname string,
article.author.address string,
article.author.authorid string)
contactauthor (contactauthorID integer,
contactauthor.authorid string)
title (titleID integer, title string)
monograph (monographID integer,
monograph.parentID integer, monograph.title
string, monograph.editor.name string,
monograph.author.name.firstname string,
monograph.author.name.lastname string,
monograph.author.address string,
monograph.author.authorid string)
editor (editorID integer, editor.parentID
integer, editor.name string)
editor.monograph (editor.monographID integer,
editor.monograph.parentID integer,
editor.monograph.title string,
editor.monograph.author.name.firstname string,
editor.monograph.author.name.lastname string,
editor.monograph.author.address string,
editor.monograph.author.authorid string)
author (authorID integer, author.name.firstname
string, author.name.lastname string,
author.address string,
author.authorid string)
name (nameID integer, name.firstname string,
name.lastname string)
firstname (firstnameID integer, firstname
string)

18
Shared Inlining Technique

Intuition
Inline as many sub-elements as possible
Do not inline only if it is a shared, recursive
or set sub-element.
Technique
Relations are created for all elements in the
DTD graph whose nodes have an in-degree greater
than one / in-degree equal to zero.
Nodes with an in-degree of one are inlined.

19
Shared Relational Schema

book (bookID integer, book.booktitle.isroot
boolean, book.booktitle string)
article (articleID integer, article.contactauthor
.isroot boolean,
article.contactauthor.authorid
string)
monograph (monographID integer,
monograph.parentID integer,
monograph.parentCODE integer, monograph.editor.is
root boolean, monograph.editor.name string)
title (titleID integer, title.parentID integer,
title.parentCODE integer, title string)
author (authorID integer, author.parentID
integer, author.parentCODE integer,
author.name.isroot boolean, author.name.firstname
.isroot boolean,author.name.firstname string,
author.name.lastname.isroot boolean,
author.name.lastname string, author.address.isroo
t boolean, author.address string,
author.authorid string)

20
Semi-structured Queries to SQL Queries

Semi-structured queries have a lot more
flexibility than SQL
Allow path expressions with various operators and
wild cards
Discussion based on Shared approach

21
Queries with Simple Path Expressions Example

WHERE ltbookgt
ltbooktitlegt The Selfish
Gene lt/booktitlegt
ltauthorgt
ltnamegt
ltfirstnamegt f
lt/firstnamegt
ltlastnamegt l lt/lastnamegt
lt/namegt
lt/authorgt
lt/bookgt IN CONFORMING TO
pubs.dtd
CONSTRUCT ltresultgt f l lt/resultgt

22
Queries with Simple Path Expressions Translation
Select A.author.name.firstname,
A.author.name.lastname From author A, book
B Where B.bookID A.parentID AND
A.parentCODE 0 AND
B.book.booktitle The Selfish Gene
23
XML in Relational Databases

Michael Suwandi

24
Motivation

Need to support new breed of applications
Financial applications
Web services (SOAP and XML-RPC messages)
Web feed formats (RSS and Atom)
Want to combine SQL and XML
Preserve the long term investments in established
relational databases
SQL and Relational Databases are reliable,
dependable, understood, proven, tested, and
familiar
Introduce the power and flexibility of XML and
its related technologies
XML, XPath and XQuery are new, flexible, powerful
and intuitive
Somehow need to fuse XML content into SQL
databases

25
Solutions

Possible solution
Turn the XML elements into tables, columns and
rows in databases
Advantages minimal changes, only add
pre-processing
Disadvantages cannot handle unstructured data,
cannot handle data with evolving schemas, cannot
handle recursive structures
Proposed solution
Just insert the XML document
Add a new column type (XML blob)
Add a new index type (XPath fragment)
Combine query language SQL DML XQuery
Advantages XML data is stored as is (do not care
if the data is structured or not, if the schema
is static or changing, if structure is
recursive), no pre-processing, XQuery processing
is integrated with the query optimizer, etc.
Disadvantages dual type systems, new technology
(less studied and understood), new pitfalls and
gotchas (optimizing storage and queries)

26
SQL Table vs. XML Document

ltblogsgt
ltblog date12/2/06 userJakegt
ltentrygtI got accepted to UTlt/entrygt
ltkeywordsgtltkeywordgtschoollt/keywordgtltkeywordgtU
Tlt/keywordgtlt/keywordsgt
lt/bloggt
ltblog date12/5/06 userCindygt
ltentrygtI won lotterylt/entrygt
ltmoodsgtltmoodgtecstaticlt/moodgtltmoodgthappylt/mood
gtlt/moodsgt
lt/bloggt
lt/blogsgt

27
SQL Table XML Document
CREATE TABLE blogs (date DATE, user VARCHAR, blog
XML)

Entire XML documents or fragments are stored
inside a column to preserve the structure,
whitespaces, etc
Notice that blog column contents do not match in
structure

28
SQL Indexing vs. XPath Indexing

SQL Indexing
Builds B tree on the entire value of each cell
values cannot efficiently process large VARCHAR
nor BLOB (Binary Large OBject) columns
Most commercial databases add full text indexing
for efficiently processing LIKE operator
XPath indexing
Needs to evaluate both the explicit and implicit
paths for the XML nodes
Needs to preserve ordering between nodes
important to process XPath axis functions e.g.
last(), first(), following-sibling()
Needs to preserve node identities important to
support attribute id uniqueness

29
SQL Indexing XPath Indexing

Index for all the user names
CREATE INDEX blog_user ON blogs(user)
Index for all the blog keywords
CREATE INDEX blog_keyword ON blogs(blog)
USING XMLPATTERN '//keyword' AS VARCHAR
Index for all the blog moods
CREATE INDEX blog_mood ON blogs(blog)
USING XMLPATTERN '//mood' AS VARCHAR
Notice as VARCHAR clause at the end

30
SQL DML vs. XQuery

Notice the differences in clauses

SELECT ltprojected columnsgt
FROM lttables and/or viewsgt
WHERE ltpredicatesgt
GROUP BY ltgrouping columnsgt
ORDER BY ltsorting columnsgt
SELECT clause always produces tables (even if it
is only one row/one column)
Allow nesting in WHERE clause

for ltitems in XPath sequencegt
let ltvariable definitionsgt
where ltXPath conditionsgt
order by ltXPath expressiongt
return ltsequence of itemsgt
return clause produces sequences of XML documents
Allow nesting in for, let, where, and return
clauses
XPath usage for selecting, enumerating,
filtering, evaluating XML element and attributes

31
SQL DML XQuery

Return all of cindys happy entries
SELECT b.date,
XMLQuery('blog//entry passing b.blog
as blog")
as VARCHAR
FROM blog b
WHERE b.user cindy AND
XMLExists('blog//keywordtext() eq
happy
passing b.blog as blog)
Notice the use of XMLQuery() and XMLExists()
functions as well as the clause as VARCHAR

32
Differences in Type Systems

SQL data types date, varchar, blob, double, etc.
XML data types XML nodes (elements and
attributes) and text
Need some way to bridge the two
Hint!! Look at XMLExists(), XMLQuery() functions
and as VARCHAR clause in previous examples
Problems (like trying to convert one to a
double) are silenced by inserting empty nodes or
empty sequences.
Warning!! This behavior is generally correct
except in the presence of negation.

33
Commercial Implementations

IBM DB2 Viper
Microsoft SQL Server 2005

34
IBM DB2 Viper

Native XML column storage
Truly hierarchical XML storage
Flexible schema support
Schema assignment is per document not per column
Documents are allowed not to have schemas
Multiple query systems
Plain old SQL
Stand-alone XQuery
Use db2-fnxmlcolumn() to fetch XML documents
from the columns in tables
XQuery in SQL
Use xmlquery(), xmlexists(), xmltable() and
xmlcast() to convert from XML elements to SQL
native data types
SQL in XQuery XML in SQL
Use xmlelement() to construct column data to XML
documents

35
IBM DB2 Viper (continued)

Path specific XML indexing
Because indexing all nodes is prohibitive
Path can only contain navigation specifications
Implemented as B tree
Building the index
Perform XPath evaluation of the indexed column
if the evaluation succeeded, the node is added to
the index.
Tolerant behaviour if the node passes the XPath
evaluation but failed the coversion to the
desired data type that node is not added to the
index this is important to support evolving
schema definition
Using the index
Can be used to filter based on structural
restrictions
Can be used to locate specific nodes in specific
rows
Mainly used to perform context filtering on the
entire table this is to done to support the
millions of moderately sized documents workload
Algorithm to determine eligibility of indexes as
well as effectiveness in terms of processing the
queries

36
Microsoft SQL Server 2005

Native column type for XML storage
Binary XML to represent XML documents and
fragments
Flexible schema support
XML documents are either typed (associated with a
schema) or un-typed (not associated with any
schema)
Single query system
Support augmented SQL with XQuery
Support most (but not all) of the XQuery
functions
Tries to map as many operations from XPath and
XQuery to the relational model for faster
execution
Optimization of execution plan uses the costing
functions and histogram of data distributions
(same as the relational model) together the SQL
and the XML parts for the query are optimized as
a whole using rule based query tree rewrites

37
Microsoft SQL Server 2005 (cont.)

XML indexing
Primary XML indexes
Contains the data model content of the XML nodes
Secondary XML indexes
PATH indexes for path-based queries
PROPERTY indexes for property bag scenarios
VALUE indexes for value-based queries
Building the index
Implemented as B tree on the data model content
of the nodes
Also contains the reversed encoded path for each
node to the XML root
Also contains the OrdPath column for each node to
capture relative ordering of the nodes
Statistics are created for the key column of the
primary and secondary indexes
Using the index
Use the cost based selection for the secondary
indexes

38
Microsoft SQL Server 2005 (cont.)

XQuery processing diagram
(tries to map XQuery expression to Relational
query processor)

Example Relational operator tree for XDOC.query
(/BOOK_at_id123)
39
Conclusion

XML is here to stay. Need to integrate it into
the established framework (i.e. relational
databases)
Enhancing the type system, the index type, and
the query language allow for easier transition
and faster adoption of XML related technology
Commercial support for XML in relational
databases are available from IBM, MS, and Oracle
The level of XML support are different in each
offerings
Need to understand the impact of these additional
features to the storage and performance
requirements of existing applications

40
Native XML Databases

Justin Hong

41
Motivation

Advantages of XML
Interoperability
Ease of use
Extensibility
Integration of XML into existing DBMSs
Most DBMSs use tables as basic storage element
Hierarchical structure of XML needs to be mapped
into and out of tables when querying
High number of joins to recreate XML structure

42
Motivation (Continued)

Typical queries on XML documents are queries on
the hierarchical structure (relationships between
nodes)
Types of relationships
parent-child
ancestor-descendant
sibling (previous and following)
Processing these types of queries without native
XML data store support is costly

43
Native XML Data Stores

Provides a logical model for storing and
retrieving XML documents efficiently
XML interface for accessing XML data (e.g. XPath,
XQuery)
Advantages
Scalability
Data-access speed
Reliability

44
Native XML Database

Definition (XMLDB Initiative
www.xmldb-org/sourceforge.net)
XML document is the fundamental unit of logical
storage
Uses a logical model for the XML document itself
Requires a particular underlying physical storage
model

45
Abstract View of Components

Native XML Interface
Most popular choices are XQuery and XPath
Storage Manager
Performs any necessary transformations to
store/retrieve XML data

6
46
Storage Trees Collections

XML documents are trees, so storing them as trees
is beneficial for navigation-oriented
applications
Collection-based implementations are beneficial
for applications that query a collection of XML
documents

47
eXist

An Open Source Native XML Database

48
XML Indexing (Motivation)

Example find all the figures in a collection of
books
Access all the nodes under the book elements and
check if they are figure elements
Implies that a lot of non-figure elements are
accessed
Need index structures to efficiently perform
queries on large, unconstrained collections of
XML documents

49
XML Index by Numbering

A numbering scheme assigns a unique identifier to
each node in an XML document
Depending on the scheme, certain relationships
can be determined immediately

50
Base Numbering Scheme in eXist

XML document is modeled as a complete k-ary tree
Unique identifier assigned to each node by
traversing tree document in level-order

4
51
Base Scheme (Continued)

A nodes parent identifier can be computed with
parent7 (7-2) / 2 1 3.5 3

4
4
52
Base Scheme Disadvantage

Typical documents have a small number of top
level elements (e.g. chapters of a book), whereas
the deeper elements contain many more children
(e.g. paragraphs in a chapter)
Spare identifiers need to be inserted at the top
level elements (to be complete)
Assigned identifiers grows extremely fast even in
small documents

53
eXists Alternating Scheme

The maximum number of children (k) for an element
is a per-level attribute k(d) where d is the
depth of the node with respect to the root node
How to calculate parent identifier?
Not obvious and not presented in the paper, but
authors claim that it is still doable

54
Index Files

collections.dbx manages collection hierarchy
dom.dbx stores nodes in a paged file
associates them with their identifiers
elements.dbx indexes elements attributes

55
dom.dbx

Index is a multi-root B-Tree

4
56
elements.dbx

Indexed by ltcollection-id, name-idgt pairs
Value is ordered list of ltdocument-id, node-idgt
pairs, which are the nodes whose name matches the
key entry

57
Limitations

eXist is good for relatively static documents
Inserting a child element under a node that
already has the max number of children involves
re-assigning identifiers to all elements in the
level and subsequent levels

58
References

A. Balmin, K. S. Beyer, F. Özcan and M. Nicola.
On the Path to Efficient XML Queries. In
Proceedings of the 32nd international Conference
on Very Large Data Bases, Seoul, Korea, September
12 - 15, 2006.
A. Halverson, V. Josifovski, G. M. Lohman, H.
Pirahesh and M. Mörschel. ROX Relational Over
XML. In Proceedings of the Thirtieth
International Conference on Very Large Data
Bases, Toronto, Canada, August 31 September 3,
2004.
R. Krishnamurthy, R. Kaushi and J. F. Naughton.
Efficient XML-to-SQL Query Translation Where to
Add the Intelligence? In Proceedings of the
Thirtieth International Conference on Very Large
Data Bases, Toronto, Canada, August 31
September 3, 2004.

59
References

W. Meier. eXist An Open Source Native XML
Database. In Erhard Rahm B. Chaudri, Mario Jeckle
and Rainer Unland, editors, Web, Web-Services,
and Database Systems, 2593, Erfurt, Germany,
2002. Springer LNCS Series.
S. Pal, I. Cseri, O. Seeliger, M. Rys, G.
Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D.
Churin and E. Kogan. XQuery Implementation in a
Relational Database System. In Proceedings of the
31st international Conference on Very Large Data
Bases, Trondheim, Norway, August 30 - September
02, 2005.
A. Vakali, B. Catania and A. Maddalena. XML Data
Stores Emerging Practices. In IEEE Internet
Computing Journal, Vol. 9 Issue 2, pages 62-69,
March-April 2005.