Title: A New Inlining Algorithm for Mapping XML DTDs to Relational Schema
1A New Inlining Algorithm for Mapping XML DTDs to
Relational Schema
- Speaker Shiyong Lu
- Email shiyong_at_cs.wayne.edu
- Wayne State University
- Joint work with Yezhou, Mustafa and Farshad
2Introduction
- XML is rapidly emerging on the World Wide Web as
a standard for representing and exchanging data. - The amount of XML documents is increasing each
day. - It is critical to store and query XML documents
efficiently and effectively.
3Current approaches of storing and querying XML
documents
- Native XML repositories, e.g., Software AGs
Tamino 2, eXcelons XIS 1. - XML support enabled by commercial database
systems such as SQL Server, Oracle, and DB2 in
which XMLType is introduced. - Use RDBMS/ODBMS to store and query XML documents.
8, 10, 16, 11.
4Issues of the relational approach
- XML data model needs to be mapped into the
relational model - XML queries need to be translated into SQL
queries - Query results need to be tagged to XML format.
5Our contributions
- We proposed a new inlining algorithm to map DTDs
to relational schemas. - Improvements over the shared-inlining 16
- Completeness
- Redundancy elimination for shared nodes
- Optimizations
- Efficiency
6Outline of the talk
- Introduction of XML DTDs
- Mapping DTDS to relational schemas
- Simplifying DTDs
- Creating and inlining DTD graphs
- Generating relational schemas
- An example
- Conclusions and future work
7An overview of DTDs A DTD example
- lt!DOCTYPE memo
- lt!ELEMENT memo (to, from, date, subject?, body)gt
- lt!ATTLIST memo security CDATAgt
- lt!ATTLIST memo lang CDATAgt
- lt!ELEMENT to (PCDATA)gt
- lt!ELEMENT from (PCDATA)gt
- lt!ELEMENT date (PCDATA)gt
- lt!ELEMENT subject (PCDATA)gt
- lt!ELEMENT body (para)gt
- lt!ELEMENT para (PCDATA)gt
8DTD Document Type Defintion
- lt!DOCTYPE root-element doctype-declaration...
- lt!ELEMENT element-name content-modelgt, content
model , ,, , , ? - lt!ATTLIST element-name attr-name attr-type
attr-default ...gt
9DTD Document Type Definition (cont)
- lt!ATTLIST element-name attr-name attr-type
attr-default ...gtdeclares which attributes are
allowed or required in which elements attribute
types - CDATA any value is allowed (the default)
- (value...) enumeration of allowed values
- ID, IDREF, IDREFS ID attribute values must be
unique (contain "element identity"), IDREF
attribute values must match some ID (reference to
an element) - ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION
just forget these... (consider them deprecated) - attribute defaults
- REQUIRED the attribute must be explicitly
provided - IMPLIED attribute is optional, no default
provided - "value" if not explicitly provided, this value
inserted by default - FIXED "value" as above, but only this value is
allowed
10Mapping DTDs to relational schemas
- Simplifying DTDs
- Creating and inlining DTD graphs
- Generating relational schemas
11Simplifying DTDs
- A DTD might be very complex due to nesting, e.g.,
ltELEMENT a ((b, c, d?)?, (e?, f, (g, h?))?)gt - A XML query language is concerned about
- The parent-child relationships between XML
elements - The relative order relationships between siblings
(add an ordinal attribute to each relation)
12DTD simplifications rules
- e ? e
- e? ? e
- (e1 en) ? (e1, ,en)
- (a) (e1, ,en) ? (e1, ,en)
- (b) e ? e
- 5. (a) , e, , e, ?,e, ,
- (b) , e, , e, ?,e, ,
- (c) , e, , e, ?,e, ,
- (d) , e, , e, ?,e, ,
13Example of simplifying a DTD
- ltELEMENT a ((b, c, d?)?, (e?, f, (g, h?))?)gt
- simplified to
- ltELEMENT a (b, c, d, e, f, g, h)gt
14Creating and inlining DTD graphs
- We create a DTD graph based on the simplified
DTD. In the graph, nodes represent XML elements,
and edges represent operators. - Idea inline a child c to its parent p if p can
contain at most one occurrence of c. - Rationale inlined elements will produce a
relation.
15Inlining DTD graphs
16Inlining
- Case 1 Element a is connected to b by a ,-edge
and b has no other incoming edges, inlining b to
a. - Case 2 Element a is connected to b by a ,-edge
but b has other incoming edges, b is a shared
node, no inlining. - Case 3 Element a is connected to b by a -edge,
no inlining.
17Inlinable node
- Definition 2 Given a DTD graph, a node is
inlinable if and only if it has exactly one
incoming edge and that edge is a ,-edge.
18Inlinable tree
- Given a DTD graph and a node e in the graph,
- node e and all other inlinable nodes that are
reachable from e by ,-edge constitute a tree This
tree is called the inlinable tree for node e
(rooted at e).
19Complexity of inlining
- Theorem 2 (Complexity)
- Our inlining algorithm can be performed in O(n)
where n is the number of elements in the input
DTD.
20The inlining procedure
21The inlining procedure (cont)
22Generating relational schema
- For each node e, a relation e is generated with
the following attributes - ID is the primary key, and for each XML
attribute A of e, a corresponding relational
attribute A is generated with the same name. - If e.inlinedSet gt 2, introduce attribute
nodetype to indicate the type of the XML element - The names of all the terminal XML elements in
e.inlinedSet - If there is a ,-edge from e to node c, then
introduce c.ID as a foreign key of e referencing
relation c.
23Generating relational schema (cont)
- If there are at least two relations t_1(ID) and
t_2(ID) generated by step 1, then we combine all
the relations of the form t(ID) into one single
relation table1(ID, nodetype) - If there are at least two relations t_1(ID, t_1)
and t_2(ID, t_2) generated by step 1, then we
combine all the relations of the form t(ID, t)
into one single relation table2(ID, nodetype,
pcdata) - If there is at least one edge in the inlined
DTD graph, then we introduce relation
edge(parentID, childID, parentType, childType).
24Improvement over the shared-inlining algorithm
- Completeness
- Redundancy elimination for shared nodes
- Optimizations
- Efficiency
- See the next slide for examples.
25Examples
26A complete example
27DTD graph
28Inlined DTD graph
29Generated relational schema
30Conclusions
- We have developed a new inlining algorithm that
maps a given input DTD to a relational schema. - We made several improvements over the
shared-inlining algorithm. Experimental results
will be presented in an upcoming paper.
31Future work
- Lossless schema mapping. How to maintain the
sibling order relationship as well, so that the
original XML document can be reconstructed! - Maintain the ID/IDREF/IDREFS in terms of key and
foreign key constraints.