Title: Relational Databases for Querying XML Documents: Limitations and Opportunities
1Relational Databases for Querying XML
DocumentsLimitations and Opportunities
2Introduction
- XML is fast emerging as the dominant standard for
representing data in the World Wide Web. - The initial purpose of XML is to enhance the
ability of exchanging data over the Internet. - It raises a problem how to query the contents of
the XML documents.
3Approaches for querying XML documents
- Use semi-structured query languages and query
evaluation techniques. - Use relational database to store and query XML
documents. - Native XML repositories, e.g., Software AGs
Tamino, eXcelons XIS. (summarized by Lus paper)
4Processes that used in the relational approach
- Process a XML DTD to generate a relational schema
- Parse XML documents conforming to DTDs and load
them into tuples of relational tables in a
standard commercial DBMS (DB2, Oracle) - Translate semi-structured queries over XML
documents into SQL queries over relational
database - Convert the results back to XML format.
5Outline of the talk
- XML background
- Mapping XML DTD to relational schema
- Basic inlining techniques
- Shared inlining techniques
- Hybrid inlining techniques
- Some experiments
- Translating semi-structured queries into SQL
queries - Converting results to XML format.
- Conclusions and future work
6XML background
- Extensible Markup Language
- DTDs and other XML schemas
- DCD (Document Content Descriptor)
- XML Schemas
- Semi-structure query languages
- XML-QL, Lorel, UnQL, XQL
- notion of path expressions for navigating nested
structure of XML
7XML Query Languages
- XML-QL Use nested XML-like structure
8XML Query Languages
- Lorel more like SQL
- In this paper, the combination of XML-QL Lorel
is used to do demonstration
9Outline of the talk
- XML background
- Mapping XML DTD to relational schema
- Basic inlining techniques
- Shared inlining techniques
- Hybrid inlining techniques
- Some experiments
- Translating semi-structured queries into SQL
queries - Converting results to XML format.
- Conclusions and future work
10Mapping DTDs to relational schemas
- Simplifying DTDs
- Creating and inlining DTD graphs
- Generating relational schemas
11Simplifying DTDS
- Why?
- DTDs can be complex
- Generating relational schemas would be unwieldy
- Can simplify DTD and still generate relational
schema that can store and query documents
conforming to the original DTD. - Transformation preserves semantics of
- one or many
- null or not null
- Loses some information about relative orders of
the element - can be captured when a specific XML doc is loaded
into relational schema?
12Simplifying DTDS (Cont)
- Flatting transformations
- Simplification transformations
13Simplifying DTDS (Cont)
- Grouping transformations
- All operators are transformed to
- A DTD simplification example
- lt!ELEMENT a((bce)?,(e?f?,(b,b))))gt
- Transformed to lt!ELEMENT a(b,c?,e,f)gt
14Motivation for special schema conversion
- Relational schema
- derived from data model such as ER-model
- clear separation between Entity and Attribute
- Try mapping DTDs element and attribute to ERs
entity and attribute - no correspondence
- lead to excessive fragmentation of the document
15DTD graph
- Represents the structure of DTD
- Nodes
- Elements appear exactly once
- Attributes appear as many times as they appear
- Operators appear as many times as they appear
- Cycles in the DTD graph indicate the presence of
recursion
16An example used in the paper
17DTD graph
18The Basic inlining technique
- Creating a relation with DTD graph
- All elements descendents are inlined into that
relation - Exception
- children directly below node are made into
separate relations - each node having a backpointer edge pointing to
it is made into a separate relations
19The Basic inlining technique
- Attributes are named by the path from the root
- Each relation has an ID field
- key of the relation
- All relation corresponding to element nodes
having a parent have a parentID field - foreign key
20The Basic inlining technique
21The Shared inlining technique
- Attempts to avoid drawbacks of Basic
- Principal idea
- identify element nodes that are represented in
multiple relations - share them by creating separate relations
22The Shared inlining technique
- Creating a relation that all elements in the DTD
graph whose nodes have in-degree greater than
one - in-degree of 1 inlined
- in-degree of 0 separate relation is created
- Elements below node are made into separate
relations
23The Shared inlining technique
24The Hybrid inlining technique
- Inlines some elements that are not inlined in
Shared - Inlines elements with in-degree greater than one
- that are not recursive
- reached through a node
25The Hybrid inlining technique
26A qualitative evaluation of the Basic, Shared and
Hybrid techniques
- Evaluation Metric
- Major concern efficiency of query processing
- Average number of SQL joins required to process
path expressions of a certain length N - Measurements
- The average number of SQL queries generated for
path expressions of length N - The average number of joins in each SQL query for
path expressions of length N - The total average number of joins in order to
process path expressions of length N - Concentrate on comparisons between Shared and
Hybrid
27Evaluation results
- Hybrid eliminates large number of joins for some
DTDs - Hybrid requires more SQL queries than using
Shared for some DTDs - Shared always produces at least number of join
per SQL query as Hybrid - Hybrid always produces at least the number of SQL
queries as Shared
28Outline of the talk
- XML background
- Mapping XML DTD to relational schema
- Basic inlining techniques
- Shared inlining techniques
- Hybrid inlining techniques
- Some experiments
- Translating semi-structured queries into SQL
queries - Converting results to XML format.
- Conclusions and future work
29Translating semi-structured queries to SQL queries
- Semi-structured QL have more flexibility than
SQL, allow path expression with various operators
and wild cards - Converting queries with simple path expression to
SQL - Converting simple recursive path expressions to
SQL - Converting arbitrary path expression to simple
recursive path expressions
30Converting queries with simple path expression to
SQL
- Relation corresponding to start of the root path
expression are identified - Add it to the from clause of the SQL query
- Path expressions are translated to joins among
relations
31Converting simple recursive path expressions to
SQL
- Determine the initialization of the recursion and
the actual recursive path expression - Ask for the names of all editors reachable
directly or indirectly from the monograph with
title Subclass Cirripedia
32Converting arbitrary path expression to simple
recursive path expressions
- Path expression can be of arbitrary complexity
- ask for all the name elements reachable directly
or indirectly through monograph - Method
- take path expression appearing in such query
- translate them into possibly many simple path
expressions
33Outline of the talk
- XML background
- Mapping XML DTD to relational schema
- Basic inlining techniques
- Shared inlining techniques
- Hybrid inlining techniques
- Some experiments
- Translating semi-structured queries into SQL
queries - Converting results to XML format.
- Conclusions and future work
34Converting relational results to XML
- Explorer How results by SQL queries can be
converted XML documents - It is difficult to constructing arbitrary XML
result, and it is the main drawback in using
current relational approach - use XML-QL as the illustrative query language, it
provides XML structuring constructs
35Simple structuring
The first and last name of all the authors of
books.
36Tag variables
- Generating a relational query that contains tag
value as an element of the result tuple. - Ask for The names of authors of all
publications, nested under a tag specifying the
type of publication
37Grouping
- Ask all the publications of an author to be
grouped together, and within this structure,
requires the titles of publications to be grouped
by the type of publication. - Two approach can be used
- The relational database order the result tuples
by last name and then by publication type, and
scan the result to construct the XML document. It
is showed in the figure. - get an unordered set of tuples and do a grouping
operation outside of database engine, by last
name and by type.
38Grouping
- Treating tag variables as attribute in the result
relation provided a way of uniformly treating
the contents of the result XML document - Some relational database functionality is either
not fully exploited or is duplicated outside.
39Converting other element type
- Complex element construction
- Mainly concern about the set values
- Heterogeneous Results
- The different queries can be handled in different
ways, and then the results can be merged together - Nested Queries
- Using outer join to construct the association
between a query and a sub-query
40Conclusions
- Study the virtues and limitation of the
relational model for processing queries over XML - The advantages is reusing relational database
technology which has high performance. The paper
shows the possibility to handle most queries on
XML document using a relational DB - Limitations
- Awkward complex XML constructs in their results
- Inefficient Fragmentation causes too many joins
in the evaluation of simple queries
41Future work
- Support for sets
- Untype/variable-type references
- Information retrieval style indices
- Flexible comparisons operators
- Multiple-query optimization execution
- More powerful recursion
42Questions?