XML and Databases - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

XML and Databases

Description:

IBM DB2 Viper. Native XML column storage. Truly ... IBM DB2 Viper (continued) Path specific XML indexing. Because indexing all nodes is prohibitive ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 61
Provided by: justin105
Category:
Tags: xml | databases | viper

less

Transcript and Presenter's Notes

Title: XML and Databases


1
XML and Databases
  • Justin Hong
  • Suhasini Ranganathan
  • Michael Suwandi

2
Outline
  • XML / Relational Model
  • XML in Relational Databases
  • Native XML Databases
  • References

3
XML / Relational Model
  • Suhasini Ranganathan

4
XML in One Slide
  • Hierarchical document format for information
    exchange in WWW
  • Self describing data (tags)
  • Nested element structure having a root
  • Element data can have
  • Attributes
  • Sub-elements

5
XML and Databases
  • With a large amount of data represented as XML
    documents, it becomes necessary to store and
    query these XML documents.
  • Exciting database opportunity
  • Unlike HTML, tags are not only for presentation
  • Can capture semantics
  • Can query the web if we can query XML!!!
  • Two approaches to storing XML data
  • Using relational databases
  • Using native XML stores.

6
Why relational databases to store XML?
  • Can harness the sophisticated storage and query
    capability already provided by existing
    relational database systems.
  • No need to build native systems from scratch.
  • users can query seamlessly across XML documents
    and other data stored in relational database
    systems.

7
Storing XML Documents in a Relational Database
System
Automatic Translation Layer
Commercial RDBMS (DB2)
8
Issues
  • The main issues that must be addressed include
  • dealing with the complexity of DTD element
    specifications
  • resolving the conflict between the two-level
    nature of relational schemas (table and
    attribute) vs. the arbitrary nesting of XML DTD
    schemas
  • dealing with set-valued attributes and recursion.

9
Example DTD (will be used in further slides)
  • lt!ELEMENT book(booktitle,author)
  • lt!ELEMENT article(title,author,contactauthor)gt
  • lt!ELEMENT contactauthor EMPTYgt
  • lt!ATTLIST contactauthor authorID IDREF IMPLIEDgt
  • lt!ELEMENT monograph(title,author,editor)gt
  • lt!ELEMENT editor(monograph)gt
  • lt!ATTLIST editor name CDATA REQUIREDgt
  • lt!ELEMENT author(name, address)gt
  • lt!ATTLIST author id ID REQUIREDgt
  • lt!ELEMENT name(firstname?,lastname)gt
  • lt!ELEMENT firstname(PCDATA)gt
  • lt!ELEMENT lastname(PCDATA)gt
  • lt!ELEMENT address(ANY)gt

10
DTD Graph
11
Simplifying DTDs
  • DTDs can be complex and generating relational
  • schemas that capture this complexity would be
  • unwieldy
  • Use set of transformations that can be used to
  • simplify any arbitrary DTD without undermining
  • the effectiveness of queries over documents
  • conforming to that DTD.

12
Transformation of DTDs
  • The transformations are of three types
  • flattening transformations which convert a
    nested
  • definition into flat representation (eg
    (e1,e2)e1,e2)
  • simplification transformations, which reduce many
    unary
  • operators to a single unary operator(ege1e1)
  • grouping transformations that group sub-elements
  • having the same name(eg a,.,a.. .a..
  • For example, applying these transformations can
  • transform an element such as
  • lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt to
  • lt!ELEMENT a (b, c?, e, f)

13
DTD to Relational Schema
  • Naïve Approach
  • Each Element gt Relation
  • Each Attribute of Element gt Column of Relation
  • Connect elements using foreign keys
  • Problem?
  • Fragmentation!

14
Fragmentation Example
lt!ELEMENT author (name, address)gt lt!ATTLIST
author id ID REQUIREDgt lt!ELEMENT name
(firstname?, lastname)gt lt!ELEMENT firstname
(PCDATA)gt lt!ELEMENT lastname (PCDATA)gt lt!ELEME
NT address ANYgt
author (authorID integer, id string) name
(nameID integer, authorID integer) firstname
(firstnameID integer, nameID integer, value
string) lastname (lastnameID integer, nameID
integer, value string) address (addressID
integer, authorID integer, value string)
  • Results in 5 relations
  • Just retrieving first and last names of an author
    requires three joins!

15
The Basic Inlining Technique
  • solves the fragmentation problem by inlining as
    many descendants of an element as possible into a
    single relation.
  • Basic creates relations for every element because
    an XML document can be rooted at any element in a
    DTD.
  • create a graph structure called the element graph

16
Example Element Graph
article

?
title

contactauthor
author
authorid
name
authorid
address
?
firstname
lastname
17
Basic Inlining Technique (contd)
  • book (bookID integer, book.booktitle string,
    book.author.name.firstname string,
    book.author.name.lastname string,
  • book.author.address string, author.authorid
    string)
  • booktitle (booktitleID integer, booktitle
    string)
  • article (articleID integer, article.contactauthor
    .authorid string, article.title string)
  • article.author (article.authorID integer,
    article.author.parentID integer,
    article.author.name.firstname string,
  • article.author.name.lastname string,
    article.author.address string,
    article.author.authorid string)
  • contactauthor (contactauthorID integer,
    contactauthor.authorid string)
  • title (titleID integer, title string)
  • monograph (monographID integer,
    monograph.parentID integer, monograph.title
    string, monograph.editor.name string,
  • monograph.author.name.firstname string,
    monograph.author.name.lastname string,
  • monograph.author.address string,
    monograph.author.authorid string)
  • editor (editorID integer, editor.parentID
    integer, editor.name string)
  • editor.monograph (editor.monographID integer,
    editor.monograph.parentID integer,
    editor.monograph.title string,
  • editor.monograph.author.name.firstname string,
    editor.monograph.author.name.lastname string,
  • editor.monograph.author.address string,
    editor.monograph.author.authorid string)
  • author (authorID integer, author.name.firstname
    string, author.name.lastname string,
    author.address string,
  • author.authorid string)
  • name (nameID integer, name.firstname string,
    name.lastname string)
  • firstname (firstnameID integer, firstname
    string)

18
Shared Inlining Technique
  • Intuition
  • Inline as many sub-elements as possible
  • Do not inline only if it is a shared, recursive
    or set sub-element.
  • Technique
  • Relations are created for all elements in the
    DTD graph whose nodes have an in-degree greater
    than one / in-degree equal to zero.
  • Nodes with an in-degree of one are inlined.

19
Shared Relational Schema
  • book (bookID integer, book.booktitle.isroot
    boolean, book.booktitle string)
  • article (articleID integer, article.contactauthor
    .isroot boolean,
  • article.contactauthor.authorid
    string)
  • monograph (monographID integer,
    monograph.parentID integer,
  • monograph.parentCODE integer, monograph.editor.is
    root boolean, monograph.editor.name string)
  • title (titleID integer, title.parentID integer,
    title.parentCODE integer, title string)
  • author (authorID integer, author.parentID
    integer, author.parentCODE integer,
    author.name.isroot boolean, author.name.firstname
    .isroot boolean,author.name.firstname string,
    author.name.lastname.isroot boolean,
    author.name.lastname string, author.address.isroo
    t boolean, author.address string,
    author.authorid string)

20
Semi-structured Queries to SQL Queries
  • Semi-structured queries have a lot more
    flexibility than SQL
  • Allow path expressions with various operators and
    wild cards
  • Discussion based on Shared approach

21
Queries with Simple Path Expressions Example
  • WHERE ltbookgt
  • ltbooktitlegt The Selfish
    Gene lt/booktitlegt
  • ltauthorgt
  • ltnamegt
  • ltfirstnamegt f
    lt/firstnamegt
  • ltlastnamegt l lt/lastnamegt
  • lt/namegt
  • lt/authorgt
  • lt/bookgt IN CONFORMING TO
    pubs.dtd
  • CONSTRUCT ltresultgt f l lt/resultgt

22
Queries with Simple Path Expressions Translation
Select A.author.name.firstname,
A.author.name.lastname From author A, book
B Where B.bookID A.parentID AND
A.parentCODE 0 AND
B.book.booktitle The Selfish Gene
23
XML in Relational Databases
  • Michael Suwandi

24
Motivation
  • Need to support new breed of applications
  • Financial applications
  • Web services (SOAP and XML-RPC messages)
  • Web feed formats (RSS and Atom)
  • Want to combine SQL and XML
  • Preserve the long term investments in established
    relational databases
  • SQL and Relational Databases are reliable,
    dependable, understood, proven, tested, and
    familiar
  • Introduce the power and flexibility of XML and
    its related technologies
  • XML, XPath and XQuery are new, flexible, powerful
    and intuitive
  • Somehow need to fuse XML content into SQL
    databases

25
Solutions
  • Possible solution
  • Turn the XML elements into tables, columns and
    rows in databases
  • Advantages minimal changes, only add
    pre-processing
  • Disadvantages cannot handle unstructured data,
    cannot handle data with evolving schemas, cannot
    handle recursive structures
  • Proposed solution
  • Just insert the XML document
  • Add a new column type (XML blob)
  • Add a new index type (XPath fragment)
  • Combine query language SQL DML XQuery
  • Advantages XML data is stored as is (do not care
    if the data is structured or not, if the schema
    is static or changing, if structure is
    recursive), no pre-processing, XQuery processing
    is integrated with the query optimizer, etc.
  • Disadvantages dual type systems, new technology
    (less studied and understood), new pitfalls and
    gotchas (optimizing storage and queries)

26
SQL Table vs. XML Document
  • ltblogsgt
  • ltblog date12/2/06 userJakegt
  • ltentrygtI got accepted to UTlt/entrygt
  • ltkeywordsgtltkeywordgtschoollt/keywordgtltkeywordgtU
    Tlt/keywordgtlt/keywordsgt
  • lt/bloggt
  • ltblog date12/5/06 userCindygt
  • ltentrygtI won lotterylt/entrygt
  • ltmoodsgtltmoodgtecstaticlt/moodgtltmoodgthappylt/mood
    gtlt/moodsgt
  • lt/bloggt
  • lt/blogsgt

27
SQL Table XML Document
CREATE TABLE blogs (date DATE, user VARCHAR, blog
XML)
  • Entire XML documents or fragments are stored
    inside a column to preserve the structure,
    whitespaces, etc
  • Notice that blog column contents do not match in
    structure

28
SQL Indexing vs. XPath Indexing
  • SQL Indexing
  • Builds B tree on the entire value of each cell
    values cannot efficiently process large VARCHAR
    nor BLOB (Binary Large OBject) columns
  • Most commercial databases add full text indexing
    for efficiently processing LIKE operator
  • XPath indexing
  • Needs to evaluate both the explicit and implicit
    paths for the XML nodes
  • Needs to preserve ordering between nodes
    important to process XPath axis functions e.g.
    last(), first(), following-sibling()
  • Needs to preserve node identities important to
    support attribute id uniqueness

29
SQL Indexing XPath Indexing
  • Index for all the user names
  • CREATE INDEX blog_user ON blogs(user)
  • Index for all the blog keywords
  • CREATE INDEX blog_keyword ON blogs(blog)
  • USING XMLPATTERN '//keyword' AS VARCHAR
  • Index for all the blog moods
  • CREATE INDEX blog_mood ON blogs(blog)
  • USING XMLPATTERN '//mood' AS VARCHAR
  • Notice as VARCHAR clause at the end

30
SQL DML vs. XQuery
  • Notice the differences in clauses
  • SELECT ltprojected columnsgt
  • FROM lttables and/or viewsgt
  • WHERE ltpredicatesgt
  • GROUP BY ltgrouping columnsgt
  • ORDER BY ltsorting columnsgt
  • SELECT clause always produces tables (even if it
    is only one row/one column)
  • Allow nesting in WHERE clause
  • for ltitems in XPath sequencegt
  • let ltvariable definitionsgt
  • where ltXPath conditionsgt
  • order by ltXPath expressiongt
  • return ltsequence of itemsgt
  • return clause produces sequences of XML documents
  • Allow nesting in for, let, where, and return
    clauses
  • XPath usage for selecting, enumerating,
    filtering, evaluating XML element and attributes

31
SQL DML XQuery
  • Return all of cindys happy entries
  • SELECT b.date,
  • XMLQuery('blog//entry passing b.blog
    as blog")
  • as VARCHAR
  • FROM blog b
  • WHERE b.user cindy AND
  • XMLExists('blog//keywordtext() eq
    happy
  • passing b.blog as blog)
  • Notice the use of XMLQuery() and XMLExists()
    functions as well as the clause as VARCHAR

32
Differences in Type Systems
  • SQL data types date, varchar, blob, double, etc.
  • XML data types XML nodes (elements and
    attributes) and text
  • Need some way to bridge the two
  • Hint!! Look at XMLExists(), XMLQuery() functions
    and as VARCHAR clause in previous examples
  • Problems (like trying to convert one to a
    double) are silenced by inserting empty nodes or
    empty sequences.
  • Warning!! This behavior is generally correct
    except in the presence of negation.

33
Commercial Implementations
  • IBM DB2 Viper
  • Microsoft SQL Server 2005

34
IBM DB2 Viper
  • Native XML column storage
  • Truly hierarchical XML storage
  • Flexible schema support
  • Schema assignment is per document not per column
  • Documents are allowed not to have schemas
  • Multiple query systems
  • Plain old SQL
  • Stand-alone XQuery
  • Use db2-fnxmlcolumn() to fetch XML documents
    from the columns in tables
  • XQuery in SQL
  • Use xmlquery(), xmlexists(), xmltable() and
    xmlcast() to convert from XML elements to SQL
    native data types
  • SQL in XQuery XML in SQL
  • Use xmlelement() to construct column data to XML
    documents

35
IBM DB2 Viper (continued)
  • Path specific XML indexing
  • Because indexing all nodes is prohibitive
  • Path can only contain navigation specifications
  • Implemented as B tree
  • Building the index
  • Perform XPath evaluation of the indexed column
    if the evaluation succeeded, the node is added to
    the index.
  • Tolerant behaviour if the node passes the XPath
    evaluation but failed the coversion to the
    desired data type that node is not added to the
    index this is important to support evolving
    schema definition
  • Using the index
  • Can be used to filter based on structural
    restrictions
  • Can be used to locate specific nodes in specific
    rows
  • Mainly used to perform context filtering on the
    entire table this is to done to support the
    millions of moderately sized documents workload
  • Algorithm to determine eligibility of indexes as
    well as effectiveness in terms of processing the
    queries

36
Microsoft SQL Server 2005
  • Native column type for XML storage
  • Binary XML to represent XML documents and
    fragments
  • Flexible schema support
  • XML documents are either typed (associated with a
    schema) or un-typed (not associated with any
    schema)
  • Single query system
  • Support augmented SQL with XQuery
  • Support most (but not all) of the XQuery
    functions
  • Tries to map as many operations from XPath and
    XQuery to the relational model for faster
    execution
  • Optimization of execution plan uses the costing
    functions and histogram of data distributions
    (same as the relational model) together the SQL
    and the XML parts for the query are optimized as
    a whole using rule based query tree rewrites

37
Microsoft SQL Server 2005 (cont.)
  • XML indexing
  • Primary XML indexes
  • Contains the data model content of the XML nodes
  • Secondary XML indexes
  • PATH indexes for path-based queries
  • PROPERTY indexes for property bag scenarios
  • VALUE indexes for value-based queries
  • Building the index
  • Implemented as B tree on the data model content
    of the nodes
  • Also contains the reversed encoded path for each
    node to the XML root
  • Also contains the OrdPath column for each node to
    capture relative ordering of the nodes
  • Statistics are created for the key column of the
    primary and secondary indexes
  • Using the index
  • Use the cost based selection for the secondary
    indexes

38
Microsoft SQL Server 2005 (cont.)
  • XQuery processing diagram
  • (tries to map XQuery expression to Relational
    query processor)

Example Relational operator tree for XDOC.query
(/BOOK_at_id123)
39
Conclusion
  • XML is here to stay. Need to integrate it into
    the established framework (i.e. relational
    databases)
  • Enhancing the type system, the index type, and
    the query language allow for easier transition
    and faster adoption of XML related technology
  • Commercial support for XML in relational
    databases are available from IBM, MS, and Oracle
  • The level of XML support are different in each
    offerings
  • Need to understand the impact of these additional
    features to the storage and performance
    requirements of existing applications

40
Native XML Databases
  • Justin Hong

41
Motivation
  • Advantages of XML
  • Interoperability
  • Ease of use
  • Extensibility
  • Integration of XML into existing DBMSs
  • Most DBMSs use tables as basic storage element
  • Hierarchical structure of XML needs to be mapped
    into and out of tables when querying
  • High number of joins to recreate XML structure

42
Motivation (Continued)
  • Typical queries on XML documents are queries on
    the hierarchical structure (relationships between
    nodes)
  • Types of relationships
  • parent-child
  • ancestor-descendant
  • sibling (previous and following)
  • Processing these types of queries without native
    XML data store support is costly

43
Native XML Data Stores
  • Provides a logical model for storing and
    retrieving XML documents efficiently
  • XML interface for accessing XML data (e.g. XPath,
    XQuery)
  • Advantages
  • Scalability
  • Data-access speed
  • Reliability

44
Native XML Database
  • Definition (XMLDB Initiative
    www.xmldb-org/sourceforge.net)
  • XML document is the fundamental unit of logical
    storage
  • Uses a logical model for the XML document itself
  • Requires a particular underlying physical storage
    model

45
Abstract View of Components
  • Native XML Interface
  • Most popular choices are XQuery and XPath
  • Storage Manager
  • Performs any necessary transformations to
    store/retrieve XML data

6
46
Storage Trees Collections
  • XML documents are trees, so storing them as trees
    is beneficial for navigation-oriented
    applications
  • Collection-based implementations are beneficial
    for applications that query a collection of XML
    documents

47
eXist
  • An Open Source Native XML Database

48
XML Indexing (Motivation)
  • Example find all the figures in a collection of
    books
  • Access all the nodes under the book elements and
    check if they are figure elements
  • Implies that a lot of non-figure elements are
    accessed
  • Need index structures to efficiently perform
    queries on large, unconstrained collections of
    XML documents

49
XML Index by Numbering
  • A numbering scheme assigns a unique identifier to
    each node in an XML document
  • Depending on the scheme, certain relationships
    can be determined immediately

50
Base Numbering Scheme in eXist
  • XML document is modeled as a complete k-ary tree
  • Unique identifier assigned to each node by
    traversing tree document in level-order

4
51
Base Scheme (Continued)
  • A nodes parent identifier can be computed with
  • parent7 (7-2) / 2 1 3.5 3

4
4
52
Base Scheme Disadvantage
  • Typical documents have a small number of top
    level elements (e.g. chapters of a book), whereas
    the deeper elements contain many more children
    (e.g. paragraphs in a chapter)
  • Spare identifiers need to be inserted at the top
    level elements (to be complete)
  • Assigned identifiers grows extremely fast even in
    small documents

53
eXists Alternating Scheme
  • The maximum number of children (k) for an element
    is a per-level attribute k(d) where d is the
    depth of the node with respect to the root node
  • How to calculate parent identifier?
  • Not obvious and not presented in the paper, but
    authors claim that it is still doable

54
Index Files
  • collections.dbx manages collection hierarchy
  • dom.dbx stores nodes in a paged file
    associates them with their identifiers
  • elements.dbx indexes elements attributes

55
dom.dbx
  • Index is a multi-root B-Tree

4
56
elements.dbx
  • Indexed by ltcollection-id, name-idgt pairs
  • Value is ordered list of ltdocument-id, node-idgt
    pairs, which are the nodes whose name matches the
    key entry

57
Limitations
  • eXist is good for relatively static documents
  • Inserting a child element under a node that
    already has the max number of children involves
    re-assigning identifiers to all elements in the
    level and subsequent levels

58
References
  • A. Balmin, K. S. Beyer, F. Özcan and M. Nicola.
    On the Path to Efficient XML Queries. In
    Proceedings of the 32nd international Conference
    on Very Large Data Bases, Seoul, Korea, September
    12 - 15, 2006.
  • A. Halverson, V. Josifovski, G. M. Lohman, H.
    Pirahesh and M. Mörschel. ROX Relational Over
    XML. In Proceedings of the Thirtieth
    International Conference on Very Large Data
    Bases, Toronto, Canada, August 31 September 3,
    2004.
  • R. Krishnamurthy, R. Kaushi and J. F. Naughton.
    Efficient XML-to-SQL Query Translation Where to
    Add the Intelligence? In Proceedings of the
    Thirtieth International Conference on Very Large
    Data Bases, Toronto, Canada, August 31
    September 3, 2004.

59
References
  • W. Meier. eXist An Open Source Native XML
    Database. In Erhard Rahm B. Chaudri, Mario Jeckle
    and Rainer Unland, editors, Web, Web-Services,
    and Database Systems, 2593, Erfurt, Germany,
    2002. Springer LNCS Series.
  • S. Pal, I. Cseri, O. Seeliger, M. Rys, G.
    Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D.
    Churin and E. Kogan. XQuery Implementation in a
    Relational Database System. In Proceedings of the
    31st international Conference on Very Large Data
    Bases, Trondheim, Norway, August 30 - September
    02, 2005.
  • A. Vakali, B. Catania and A. Maddalena. XML Data
    Stores Emerging Practices. In IEEE Internet
    Computing Journal, Vol. 9 Issue 2, pages 62-69,
    March-April 2005.

60
Questions?
Write a Comment
User Comments (0)
About PowerShow.com