Title: XML and Databases
1XML and Databases
- Justin Hong
- Suhasini Ranganathan
- Michael Suwandi
2Outline
- XML / Relational Model
- XML in Relational Databases
- Native XML Databases
- References
3XML / Relational Model
4XML in One Slide
- Hierarchical document format for information
exchange in WWW - Self describing data (tags)
- Nested element structure having a root
- Element data can have
- Attributes
- Sub-elements
5XML and Databases
- With a large amount of data represented as XML
documents, it becomes necessary to store and
query these XML documents. - Exciting database opportunity
- Unlike HTML, tags are not only for presentation
- Can capture semantics
- Can query the web if we can query XML!!!
- Two approaches to storing XML data
- Using relational databases
- Using native XML stores.
6Why relational databases to store XML?
- Can harness the sophisticated storage and query
capability already provided by existing
relational database systems. - No need to build native systems from scratch.
- users can query seamlessly across XML documents
and other data stored in relational database
systems.
7Storing XML Documents in a Relational Database
System
Automatic Translation Layer
Commercial RDBMS (DB2)
8Issues
- The main issues that must be addressed include
- dealing with the complexity of DTD element
specifications - resolving the conflict between the two-level
nature of relational schemas (table and
attribute) vs. the arbitrary nesting of XML DTD
schemas - dealing with set-valued attributes and recursion.
9Example DTD (will be used in further slides)
- lt!ELEMENT book(booktitle,author)
- lt!ELEMENT article(title,author,contactauthor)gt
- lt!ELEMENT contactauthor EMPTYgt
- lt!ATTLIST contactauthor authorID IDREF IMPLIEDgt
- lt!ELEMENT monograph(title,author,editor)gt
- lt!ELEMENT editor(monograph)gt
- lt!ATTLIST editor name CDATA REQUIREDgt
- lt!ELEMENT author(name, address)gt
- lt!ATTLIST author id ID REQUIREDgt
- lt!ELEMENT name(firstname?,lastname)gt
- lt!ELEMENT firstname(PCDATA)gt
- lt!ELEMENT lastname(PCDATA)gt
- lt!ELEMENT address(ANY)gt
10DTD Graph
11Simplifying DTDs
- DTDs can be complex and generating relational
- schemas that capture this complexity would be
- unwieldy
- Use set of transformations that can be used to
- simplify any arbitrary DTD without undermining
- the effectiveness of queries over documents
- conforming to that DTD.
12Transformation of DTDs
- The transformations are of three types
- flattening transformations which convert a
nested - definition into flat representation (eg
(e1,e2)e1,e2) - simplification transformations, which reduce many
unary - operators to a single unary operator(ege1e1)
- grouping transformations that group sub-elements
- having the same name(eg a,.,a.. .a..
- For example, applying these transformations can
- transform an element such as
- lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt to
- lt!ELEMENT a (b, c?, e, f)
13DTD to Relational Schema
- Naïve Approach
- Each Element gt Relation
- Each Attribute of Element gt Column of Relation
- Connect elements using foreign keys
- Problem?
- Fragmentation!
14Fragmentation Example
lt!ELEMENT author (name, address)gt lt!ATTLIST
author id ID REQUIREDgt lt!ELEMENT name
(firstname?, lastname)gt lt!ELEMENT firstname
(PCDATA)gt lt!ELEMENT lastname (PCDATA)gt lt!ELEME
NT address ANYgt
author (authorID integer, id string) name
(nameID integer, authorID integer) firstname
(firstnameID integer, nameID integer, value
string) lastname (lastnameID integer, nameID
integer, value string) address (addressID
integer, authorID integer, value string)
- Results in 5 relations
- Just retrieving first and last names of an author
requires three joins!
15The Basic Inlining Technique
- solves the fragmentation problem by inlining as
many descendants of an element as possible into a
single relation. - Basic creates relations for every element because
an XML document can be rooted at any element in a
DTD. - create a graph structure called the element graph
16Example Element Graph
article
?
title
contactauthor
author
authorid
name
authorid
address
?
firstname
lastname
17Basic Inlining Technique (contd)
- book (bookID integer, book.booktitle string,
book.author.name.firstname string,
book.author.name.lastname string, - book.author.address string, author.authorid
string) - booktitle (booktitleID integer, booktitle
string) - article (articleID integer, article.contactauthor
.authorid string, article.title string) - article.author (article.authorID integer,
article.author.parentID integer,
article.author.name.firstname string, - article.author.name.lastname string,
article.author.address string,
article.author.authorid string) - contactauthor (contactauthorID integer,
contactauthor.authorid string) - title (titleID integer, title string)
- monograph (monographID integer,
monograph.parentID integer, monograph.title
string, monograph.editor.name string, - monograph.author.name.firstname string,
monograph.author.name.lastname string, - monograph.author.address string,
monograph.author.authorid string) - editor (editorID integer, editor.parentID
integer, editor.name string) - editor.monograph (editor.monographID integer,
editor.monograph.parentID integer,
editor.monograph.title string, - editor.monograph.author.name.firstname string,
editor.monograph.author.name.lastname string, - editor.monograph.author.address string,
editor.monograph.author.authorid string) - author (authorID integer, author.name.firstname
string, author.name.lastname string,
author.address string, - author.authorid string)
- name (nameID integer, name.firstname string,
name.lastname string) - firstname (firstnameID integer, firstname
string)
18Shared Inlining Technique
- Intuition
- Inline as many sub-elements as possible
- Do not inline only if it is a shared, recursive
or set sub-element. - Technique
- Relations are created for all elements in the
DTD graph whose nodes have an in-degree greater
than one / in-degree equal to zero. - Nodes with an in-degree of one are inlined.
19Shared Relational Schema
- book (bookID integer, book.booktitle.isroot
boolean, book.booktitle string) - article (articleID integer, article.contactauthor
.isroot boolean, - article.contactauthor.authorid
string) - monograph (monographID integer,
monograph.parentID integer, - monograph.parentCODE integer, monograph.editor.is
root boolean, monograph.editor.name string) - title (titleID integer, title.parentID integer,
title.parentCODE integer, title string) - author (authorID integer, author.parentID
integer, author.parentCODE integer,
author.name.isroot boolean, author.name.firstname
.isroot boolean,author.name.firstname string,
author.name.lastname.isroot boolean,
author.name.lastname string, author.address.isroo
t boolean, author.address string,
author.authorid string)
20Semi-structured Queries to SQL Queries
- Semi-structured queries have a lot more
flexibility than SQL - Allow path expressions with various operators and
wild cards - Discussion based on Shared approach
21Queries with Simple Path Expressions Example
- WHERE ltbookgt
- ltbooktitlegt The Selfish
Gene lt/booktitlegt - ltauthorgt
- ltnamegt
- ltfirstnamegt f
lt/firstnamegt - ltlastnamegt l lt/lastnamegt
- lt/namegt
- lt/authorgt
- lt/bookgt IN CONFORMING TO
pubs.dtd - CONSTRUCT ltresultgt f l lt/resultgt
22Queries with Simple Path Expressions Translation
Select A.author.name.firstname,
A.author.name.lastname From author A, book
B Where B.bookID A.parentID AND
A.parentCODE 0 AND
B.book.booktitle The Selfish Gene
23XML in Relational Databases
24Motivation
- Need to support new breed of applications
- Financial applications
- Web services (SOAP and XML-RPC messages)
- Web feed formats (RSS and Atom)
- Want to combine SQL and XML
- Preserve the long term investments in established
relational databases - SQL and Relational Databases are reliable,
dependable, understood, proven, tested, and
familiar - Introduce the power and flexibility of XML and
its related technologies - XML, XPath and XQuery are new, flexible, powerful
and intuitive - Somehow need to fuse XML content into SQL
databases
25Solutions
- Possible solution
- Turn the XML elements into tables, columns and
rows in databases - Advantages minimal changes, only add
pre-processing - Disadvantages cannot handle unstructured data,
cannot handle data with evolving schemas, cannot
handle recursive structures - Proposed solution
- Just insert the XML document
- Add a new column type (XML blob)
- Add a new index type (XPath fragment)
- Combine query language SQL DML XQuery
- Advantages XML data is stored as is (do not care
if the data is structured or not, if the schema
is static or changing, if structure is
recursive), no pre-processing, XQuery processing
is integrated with the query optimizer, etc. - Disadvantages dual type systems, new technology
(less studied and understood), new pitfalls and
gotchas (optimizing storage and queries)
26SQL Table vs. XML Document
- ltblogsgt
- ltblog date12/2/06 userJakegt
- ltentrygtI got accepted to UTlt/entrygt
- ltkeywordsgtltkeywordgtschoollt/keywordgtltkeywordgtU
Tlt/keywordgtlt/keywordsgt - lt/bloggt
- ltblog date12/5/06 userCindygt
- ltentrygtI won lotterylt/entrygt
- ltmoodsgtltmoodgtecstaticlt/moodgtltmoodgthappylt/mood
gtlt/moodsgt - lt/bloggt
- lt/blogsgt
27SQL Table XML Document
CREATE TABLE blogs (date DATE, user VARCHAR, blog
XML)
- Entire XML documents or fragments are stored
inside a column to preserve the structure,
whitespaces, etc - Notice that blog column contents do not match in
structure
28SQL Indexing vs. XPath Indexing
- SQL Indexing
- Builds B tree on the entire value of each cell
values cannot efficiently process large VARCHAR
nor BLOB (Binary Large OBject) columns - Most commercial databases add full text indexing
for efficiently processing LIKE operator - XPath indexing
- Needs to evaluate both the explicit and implicit
paths for the XML nodes - Needs to preserve ordering between nodes
important to process XPath axis functions e.g.
last(), first(), following-sibling() - Needs to preserve node identities important to
support attribute id uniqueness
29SQL Indexing XPath Indexing
- Index for all the user names
- CREATE INDEX blog_user ON blogs(user)
- Index for all the blog keywords
- CREATE INDEX blog_keyword ON blogs(blog)
- USING XMLPATTERN '//keyword' AS VARCHAR
- Index for all the blog moods
- CREATE INDEX blog_mood ON blogs(blog)
- USING XMLPATTERN '//mood' AS VARCHAR
- Notice as VARCHAR clause at the end
30SQL DML vs. XQuery
- Notice the differences in clauses
- SELECT ltprojected columnsgt
- FROM lttables and/or viewsgt
- WHERE ltpredicatesgt
- GROUP BY ltgrouping columnsgt
- ORDER BY ltsorting columnsgt
- SELECT clause always produces tables (even if it
is only one row/one column) - Allow nesting in WHERE clause
- for ltitems in XPath sequencegt
- let ltvariable definitionsgt
- where ltXPath conditionsgt
- order by ltXPath expressiongt
- return ltsequence of itemsgt
- return clause produces sequences of XML documents
- Allow nesting in for, let, where, and return
clauses - XPath usage for selecting, enumerating,
filtering, evaluating XML element and attributes
31SQL DML XQuery
- Return all of cindys happy entries
- SELECT b.date,
- XMLQuery('blog//entry passing b.blog
as blog") - as VARCHAR
- FROM blog b
- WHERE b.user cindy AND
- XMLExists('blog//keywordtext() eq
happy - passing b.blog as blog)
- Notice the use of XMLQuery() and XMLExists()
functions as well as the clause as VARCHAR
32Differences in Type Systems
- SQL data types date, varchar, blob, double, etc.
- XML data types XML nodes (elements and
attributes) and text - Need some way to bridge the two
- Hint!! Look at XMLExists(), XMLQuery() functions
and as VARCHAR clause in previous examples - Problems (like trying to convert one to a
double) are silenced by inserting empty nodes or
empty sequences. - Warning!! This behavior is generally correct
except in the presence of negation.
33Commercial Implementations
- IBM DB2 Viper
- Microsoft SQL Server 2005
34IBM DB2 Viper
- Native XML column storage
- Truly hierarchical XML storage
- Flexible schema support
- Schema assignment is per document not per column
- Documents are allowed not to have schemas
- Multiple query systems
- Plain old SQL
- Stand-alone XQuery
- Use db2-fnxmlcolumn() to fetch XML documents
from the columns in tables - XQuery in SQL
- Use xmlquery(), xmlexists(), xmltable() and
xmlcast() to convert from XML elements to SQL
native data types - SQL in XQuery XML in SQL
- Use xmlelement() to construct column data to XML
documents
35IBM DB2 Viper (continued)
- Path specific XML indexing
- Because indexing all nodes is prohibitive
- Path can only contain navigation specifications
- Implemented as B tree
- Building the index
- Perform XPath evaluation of the indexed column
if the evaluation succeeded, the node is added to
the index. - Tolerant behaviour if the node passes the XPath
evaluation but failed the coversion to the
desired data type that node is not added to the
index this is important to support evolving
schema definition - Using the index
- Can be used to filter based on structural
restrictions - Can be used to locate specific nodes in specific
rows - Mainly used to perform context filtering on the
entire table this is to done to support the
millions of moderately sized documents workload - Algorithm to determine eligibility of indexes as
well as effectiveness in terms of processing the
queries
36Microsoft SQL Server 2005
- Native column type for XML storage
- Binary XML to represent XML documents and
fragments - Flexible schema support
- XML documents are either typed (associated with a
schema) or un-typed (not associated with any
schema) - Single query system
- Support augmented SQL with XQuery
- Support most (but not all) of the XQuery
functions - Tries to map as many operations from XPath and
XQuery to the relational model for faster
execution - Optimization of execution plan uses the costing
functions and histogram of data distributions
(same as the relational model) together the SQL
and the XML parts for the query are optimized as
a whole using rule based query tree rewrites
37Microsoft SQL Server 2005 (cont.)
- XML indexing
- Primary XML indexes
- Contains the data model content of the XML nodes
- Secondary XML indexes
- PATH indexes for path-based queries
- PROPERTY indexes for property bag scenarios
- VALUE indexes for value-based queries
- Building the index
- Implemented as B tree on the data model content
of the nodes - Also contains the reversed encoded path for each
node to the XML root - Also contains the OrdPath column for each node to
capture relative ordering of the nodes - Statistics are created for the key column of the
primary and secondary indexes - Using the index
- Use the cost based selection for the secondary
indexes
38Microsoft SQL Server 2005 (cont.)
- XQuery processing diagram
- (tries to map XQuery expression to Relational
query processor)
Example Relational operator tree for XDOC.query
(/BOOK_at_id123)
39Conclusion
- XML is here to stay. Need to integrate it into
the established framework (i.e. relational
databases) - Enhancing the type system, the index type, and
the query language allow for easier transition
and faster adoption of XML related technology - Commercial support for XML in relational
databases are available from IBM, MS, and Oracle - The level of XML support are different in each
offerings - Need to understand the impact of these additional
features to the storage and performance
requirements of existing applications
40Native XML Databases
41Motivation
- Advantages of XML
- Interoperability
- Ease of use
- Extensibility
- Integration of XML into existing DBMSs
- Most DBMSs use tables as basic storage element
- Hierarchical structure of XML needs to be mapped
into and out of tables when querying - High number of joins to recreate XML structure
42Motivation (Continued)
- Typical queries on XML documents are queries on
the hierarchical structure (relationships between
nodes) - Types of relationships
- parent-child
- ancestor-descendant
- sibling (previous and following)
- Processing these types of queries without native
XML data store support is costly
43Native XML Data Stores
- Provides a logical model for storing and
retrieving XML documents efficiently - XML interface for accessing XML data (e.g. XPath,
XQuery) - Advantages
- Scalability
- Data-access speed
- Reliability
44Native XML Database
- Definition (XMLDB Initiative
www.xmldb-org/sourceforge.net) - XML document is the fundamental unit of logical
storage - Uses a logical model for the XML document itself
- Requires a particular underlying physical storage
model
45Abstract View of Components
- Native XML Interface
- Most popular choices are XQuery and XPath
- Storage Manager
- Performs any necessary transformations to
store/retrieve XML data
6
46Storage Trees Collections
- XML documents are trees, so storing them as trees
is beneficial for navigation-oriented
applications - Collection-based implementations are beneficial
for applications that query a collection of XML
documents
47eXist
- An Open Source Native XML Database
48XML Indexing (Motivation)
- Example find all the figures in a collection of
books - Access all the nodes under the book elements and
check if they are figure elements - Implies that a lot of non-figure elements are
accessed - Need index structures to efficiently perform
queries on large, unconstrained collections of
XML documents
49XML Index by Numbering
- A numbering scheme assigns a unique identifier to
each node in an XML document - Depending on the scheme, certain relationships
can be determined immediately
50Base Numbering Scheme in eXist
- XML document is modeled as a complete k-ary tree
- Unique identifier assigned to each node by
traversing tree document in level-order
4
51Base Scheme (Continued)
- A nodes parent identifier can be computed with
- parent7 (7-2) / 2 1 3.5 3
4
4
52Base Scheme Disadvantage
- Typical documents have a small number of top
level elements (e.g. chapters of a book), whereas
the deeper elements contain many more children
(e.g. paragraphs in a chapter) - Spare identifiers need to be inserted at the top
level elements (to be complete) - Assigned identifiers grows extremely fast even in
small documents
53eXists Alternating Scheme
- The maximum number of children (k) for an element
is a per-level attribute k(d) where d is the
depth of the node with respect to the root node - How to calculate parent identifier?
- Not obvious and not presented in the paper, but
authors claim that it is still doable
54Index Files
- collections.dbx manages collection hierarchy
- dom.dbx stores nodes in a paged file
associates them with their identifiers - elements.dbx indexes elements attributes
55dom.dbx
- Index is a multi-root B-Tree
4
56elements.dbx
- Indexed by ltcollection-id, name-idgt pairs
- Value is ordered list of ltdocument-id, node-idgt
pairs, which are the nodes whose name matches the
key entry
57Limitations
- eXist is good for relatively static documents
- Inserting a child element under a node that
already has the max number of children involves
re-assigning identifiers to all elements in the
level and subsequent levels
58References
- A. Balmin, K. S. Beyer, F. Özcan and M. Nicola.
On the Path to Efficient XML Queries. In
Proceedings of the 32nd international Conference
on Very Large Data Bases, Seoul, Korea, September
12 - 15, 2006. - A. Halverson, V. Josifovski, G. M. Lohman, H.
Pirahesh and M. Mörschel. ROX Relational Over
XML. In Proceedings of the Thirtieth
International Conference on Very Large Data
Bases, Toronto, Canada, August 31 September 3,
2004. - R. Krishnamurthy, R. Kaushi and J. F. Naughton.
Efficient XML-to-SQL Query Translation Where to
Add the Intelligence? In Proceedings of the
Thirtieth International Conference on Very Large
Data Bases, Toronto, Canada, August 31
September 3, 2004.
59References
- W. Meier. eXist An Open Source Native XML
Database. In Erhard Rahm B. Chaudri, Mario Jeckle
and Rainer Unland, editors, Web, Web-Services,
and Database Systems, 2593, Erfurt, Germany,
2002. Springer LNCS Series. - S. Pal, I. Cseri, O. Seeliger, M. Rys, G.
Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D.
Churin and E. Kogan. XQuery Implementation in a
Relational Database System. In Proceedings of the
31st international Conference on Very Large Data
Bases, Trondheim, Norway, August 30 - September
02, 2005. - A. Vakali, B. Catania and A. Maddalena. XML Data
Stores Emerging Practices. In IEEE Internet
Computing Journal, Vol. 9 Issue 2, pages 62-69,
March-April 2005.
60Questions?