L09: Introduction to XML Data Management

About This Presentation

Title:

L09: Introduction to XML Data Management

Description:

L09: Introduction to XML Data Management XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems – PowerPoint PPT presentation

Number of Views:207

Avg rating:3.0/5.0

Slides: 90

Provided by: Dom197

Category:

more less

Transcript and Presenter's Notes

Title: L09: Introduction to XML Data Management

1
L09 Introduction to XML Data Management

XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XML Data in Relational Systems

2
XML and XML Query Languages

XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XML Data in Relational Systems

3
XML

Extensible Markup Language for data
A W3C standard to complement HTML
http//www.w3.org/TR/2000/REC-xml-20001006
(version 2, 10/2000)
Standard for publishing and interchange
Origins structured text SGML
Cleaner SGML for the Internet
Motivation
HTML describes presentation
XML describes content

4
XML Describing the Content

ltprojectgt
lttalk gt
lttitlegt XML Query Processing Optimization
lt/titlegt
ltdategt March 18, 2004 lt/dategt
ltinstructorgt Instructor
ltnamegt Lu Hongjun lt/namegt
ltaffiliationgt HKUST lt/
affiliation gt
ltemailgt luhj_at_cs.ust.hk lt/emailgt
ltnamegt Jeffrey X. Yu lt/namegt
ltaffiliationgt CUHK lt/
affiliation gt
ltemailgt yu_at_se.cuhk.edu.hk
lt/emailgt
lt/ instructor gt
lt/talkgt
lt/projectgt

5
XML Document/Data

Hierarchical document format for information
exchange in WWW
Self describing data (tags)
Nested element structures having a root
Element data can have
Attributes
Sub-elements

6
Basic XML Structures

Elements lttitlegt lt/titlegt,ltnamegt lt/namegt
Open close tags or empty tag
Ordered, nestable
an element can be empty
Attributes
PCDATA/CDATA
An XML document single root element
well formed XML document if it has matching tags

7
Basic XML Structures Attributes

Single-valued, ordered
ltproject proj_id P1234 budget 1000000gt
lttitlegt XML Data Management lt/titlegt
ltyeargt 2003-2004 lt/yeargt
lt/projectgt
Special types ID, IDREF, IDREFS
ltmember idm007gt ltnamegt James lt/namegt
lt/membergt
ltproject idp123gt lttitlegt XML Data Management
lt/titlegt
ltmember
idrefm007 m008/gt
lt/projectgt

8
Other XML Structures

Processing instructions instructions for
applications
lt?xml version1.0?gt
CDATA sections treat content as char data
lt!CDATAlttaggtWhatever!!!lt/taggtltwhatevergtgt
Comments just like HTML
lt!-- Comments --gt
Entities external resources and macros
my-entity (non-parameter entity)
param-entity (parameter entity for DTD
declarations)

9
Data Centric vs. Document centric

ltprojectgt
ltpnamegt XML lt/pnamegt
ltmember ID3, age 50 gt
ltnamegtH. Lu lt/namegt
ltemailgt luhj_at_cs.ust.hk lt/emailgt
ltpublication author H. Lugt
lttitlegt Managing XML data using RDBMS lt/titlegt
ltyeargt 2001 lt/yeargt
lt/publicationgt
lt/membergt
ltmember ID24, age 35 gt
ltnamegt J.X. Yu lt/namegt
ltprojectgt
ltpnamegt Data mining lt/pnamegt
lt/projectgt
lt/membergt
lt/projectgt

ltbiogt ltpgt Dr Lu is a professor at ltbgt
HKUST. lt/bgt He worked at ltbgt NUSgt lt/bgt
before 1998. lt/pgt lt/biogt
10
XML Data Model

Several competing models
Document Object Model (DOM)
a platform- and language-neutral interface that
will allow programs and scripts to dynamically
access and update the content, structure and
style of documents
http//www.w3.org/DOM/

11
DOM Core Interface Node

DOM tree a tree-like structure of Node objects
the root of the tree is a document object.
Node Object (nodeName, nodeValue, nodeType,
parentNode, childnodes, firstChild, lastChild,
previousSibling, nextSibling, attributes,
ownerDocument)
nodeType ELEMENT_NODE, ATTRIBUTE_NODE,
TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE,
PROCESSING_INSTRUCTION_NODE, COMMENT_NODE,
DOCUMENT_NODE, DOCUMENT_TYPE_NODE,
DOCUMENT_FRAGMENT_NODE, NOTATION_NODE

12
DOM Interface

Each node of the document tree may have a number
of child nodes, contained in a NodeList object.
Two ways of accessing a node object
Based on the location of an object in the
document tree
Based on the name of an object

13
A Sample DOM Tree
Project Node NodeTypeELEMENT_NODE tagName
project NodeValue nill
1
project
2
3
24
pname
member
member
XML
28
26
27
65
66
publication
project
name
email
name
H. Lu
69
luhj_at_cs.ust.hk
J.X. Yu
70
294
71
author
title
pname
year
Managing
2001
Data mining
name Node NodeTypeTEXT_NODE tagName name
NodeValue H. Lu
publicatiom Node NodeTypeELEMENT_NODE tagName
publication NodeValue nill
14
Data Graph

Similar to DOM tree, but may have different
notations that represent an XML document

1
project
2
3
24
pname
member
member
28
26
65
27
66
author
XML
publication
project
name
email
name
50
69
70
294
71
50
J.X. Yu
luhj_at_cs.ust.hk
H. Lu
title
year
pname
Managing
2001
Data mining
15
Document Type Definition

Inherited from SGML DTD standard
BNF grammar establishing constraints on element
structure and content
Specification of attributes and their types
Definitions of entities

16
A Sample DTD
project
1

publication
member
pname

3
4
2
author

?
5
6
7
9
8

lt?xml version"1.0" standalone"yes"?gt
lt!DOCTYPE Research gt
lt!ELEMENT project(pname,member,publication)gt
lt!ELEMENT pname(PCDATA)gt
lt!ELEMENT member (name,email?, publication,
project)gt
lt!ATTLIST member ID ID REQUIREDgt
lt!ELEMENT name(PCDATA)gt
lt!ELEMENT email(PCDATA)gt
lt!ELEMENT publication(title,year)gt
lt!ATTLIST publication author IDREF IMPLIED)gt
lt!ELEMENT title(PCDATA)gt
lt!ELEMENT year(PCDATA)gt

ID
name
email
title
year
17
XML Query Languages

There have been a large number of proposals
during the past few years
XPath Clark, DeRose, W3C 1999
XQuery Boag, Chamberlin et al, W3C 2003
XML-QLDeutsch, Fernandez et al, QL99
XQL Robie, Lapp, QL99
XML_GL Ceri, Comai et al, WWW99
Quilt Chamberlin, Robie et al, 2000
From W3C
XQuery 1.0 (W3C Working Draft, 12 November 2003)
http//www.w3.org/TR/xquery/
XPath 2.0 (W3C Working Draft 12 November 2003)
http//www.w3.org/TR/xpath20/

18
XPath XML Path Language

The purpose
To address the node of an XML tree using a path
notation for navigating through the hierarchical
structure of an XML document.
Uses a compact, non-XML syntax
Designed to be embedded in a host language (e.g.,
XSLT, XQuery)
XPath Expressions
String of characters
Value of an expression is always an ordered
collection of zero or more items (atomic value,
node)

19
XPath Steps

An XPath expression has following syntax
Path/Step1/Step2//Stepn,
where each Xpath step is defined as follows
StepAxisNode-test Predicate
Axis specifies the direction in which the
document should be navigated.
For example, childtitleposition() 2
There are 12 axes child, descendant,
descendant-or-self, parent, ancestor,
ancestor-or-self, following, preceding,
following-sibling, preceding-sibling, attribute,
self, namespace

20
XPath Path Expressions

project matches a project element
matches any element
/ matches the root element
/project matches a project element under root
project/member matches a member in project
project//name matches a name in project, at any
depth
//title matches a title at any depth
memberpublciation matches a member or a
publication
_at_age matches an age attribute
project/member/_at_age matches age attribute in
member, in project
project/member/_at_agelt45 matches a member with
age lt 45

21
XPath Query Examples
/project/member/name matches a name of member in
project
Result ltnamegt H. Lu lt/namegt ltnamegt J.X. Yu
lt/namegt
/project/publication/venue
Result empty there was no venue element
//pname matches a pname at any depth
Result ltpnamegt XML lt/pnamegt ltpnamegt Data
mining lt/pnamegt

/project/member/name/text() text of name elements

Result H. Lu J.X. Yu
22
More XPath Queries

/project/memberpublication
ltmember ID3, age 50 gt
ltnamegtH. Lu lt/namegt
ltemailgt luhj_at_cs.ust.hk lt/emailgt
ltpublication author H. Lugt
lttitlegt Managing XML data using RDBMS lt/titlegt
ltyeargt 2001 lt/yeargt
lt/publicationgt
lt/membergt
/project/member_at_age lt 45
ltmember ID24, age 35 gt
ltnamegt J.X. Yu lt/namegt
ltprojectgt
ltpnamegt Data mining lt/pnamegt
lt/projectgt
lt/membergt
/project member/_at_age lt 25
No element returned
/project/memberemail/text()

23
XQuery

XQuery 1.0 An XML Query Language
W3C Working Draft 12 November 2003
http//www.w3.org/TR/xquery/
XPath expressions are still the basic building
block

24
XQuery

XQuery 1.0 An XML Query Language
W3C Working Draft 12 November 2003
http//www.w3.org/TR/xquery/
FLWR Expressions FOR-LET-WHERE-RETURN

25
XQuery Examples

ltresultgt
FOR x in /project/member/publication
WHERE x/year gt 2000
RETURN ltrecentpubgt x/title
lt/ recentpub gt
lt/resultgt

distinct a function that eliminates duplicates
ltactive_membersgt FOR m IN distinct(document(pro
ject.xml")//member) LET p document(project.x
ml")//publicationauthor m WHERE count(p) gt
10 RETURN m lt/ active_members gt
count a (aggregate) function that returns the
number of elements
26
Structural Summary and Coding Scheme

XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XMLData in Relational Systems

27
Structural Summary

A structural summary for a data graph GD(VD, ED )
is another labeled graph GI (VI, EI ).
Each node vi ? GI represents a set of nodes,
extent(vi ), and extent(vi ) ? VD.
An edge ed (vi , vi) ? GI exists if there is an
edge ed (vd , vd) ? GD vd ? extent(vi ), vd ?
extent(vi ).
The summary preserves all the paths in the data
graph. A path expression query can be executed on
GI instead of GD, which is most likely more
efficient since size of GI is much smaller than
GD.

28
Structural Summary

Basically, nodes in the data graph is grouped
based on certain criteria, each group of nodes is
represented by one node in the summary.
The size of summary will be determined by the
grouping criteria.
Desired properties in supporting evaluating path
expression queries using summary
The results are safe (no false negatives)
If not safe, only approximate answers can be
obtained
The results are precise contains no false
positives
If not precise, need validate results using the
data graph

29
Structural Summary
R
r
A
a1,a2,a3
B
b1,b2,b3
C
c1,c2,c3
Data Graph
Structural summary
30
Sample Structural Summaries

Query workload independent summaries
Data Guide
1-index Milo, Suciu, ICDT99
A(k) index Kaushik, Shenoy, ICDE02
Query workload dependent summaries
APEX Chung, Min et al, SIGMOD02
D(k)-index Chen, Lim et al, SIGMOD03

31
Data Guides

DataGuide dynamic structural summary of current
database
Each label path in database appears once in
DataGuide
No extraneous paths in DataGuide
Maintained incrementally as database evolves
Serves role of schema

C1 is duplicated to achieve determinism in
DataGuides
32
Bisimilarity and 1-Index

Most existing structural summary are based on
graph bisimilarity, defined as follows
Two data nodes u and v are bisimilar (u ? v) if
u and v have the same label
if u is a parent of u, then there is a parent
v of v such that u? v, and vice versa
Intuitively, the set of paths coming into them is
the same if two nodes are bisimilar
Tova Milo and Dan Suciu. Index structures for
path expressions. In ICDT99. 277-295, January
1999.

33
1-Index

1-index Each index node represents an
equivalence class, in which data nodes are
mutually bisimilar.
Evaluating path expression query using 1-index
safe the result always contains the result of
evaluating on the data graph
precise its result contains no false data node

34
K-bisimilarity

1-index can be big
Formally, based on the notion of k-bisimilarity
(?k ) which is defined inductively
Node u ?k v iff u ?k-1 v, and for every parent
u of u, there is a parent v of v such that u
?k-1 v, and vice versa
For any two nodes, u and v, u ?0 v iff u and v
have the same label
Intuitively, if two data nodes are k-bisimilar,
the set of paths coming into them with length (?
k) is the same

35
A(k)-Index

A(k)-Index group nodes based on their local
structure paths of length up to k, instead of
the global path information
data nodes in each index nodes of A(k) index are
mutually k-bisimilar
Evaluation path expression query using
A(k)-index
safe its result always contains the result of
evaluating on the data graph
precision its result contains no false data
node
Raghav Kaushik, Pradeep Shenoy, Philip Bohannon
and Ehud Gudes. Exploiting local similarity for
indexing paths in graph-structured data. ICDE02,
129-140.

36
A(2)-Index
C2 and C3 can be grouped because their length-2
incoming paths are the same
37
APEX Adaptive Path Index

1-index, A(k)-index and FB index are all
workload independent
APEX Adaptive Path index
Maintains two types of paths in the summary
All paths of length two so that all queries can
be answered using APEX
Full paths are maintained for those paths that
frequently appear in query workload so that
frequently asked queries can be answered
efficiently
A hash table is included in the index so that
partial matching queries with the
self-or-descendent axis (//) can be processed
efficiently
C-W Chung, J-K Min, K. Shim, APEX An Adaptive
Path Index for XML Data, SIGMOD 02

38
D(k)-Index

A generalization of 1-Index and A(k)-Index.
Assigning different local bisimilarites to index
nodes in the summary structure according to the
query load to optimize its structure.
for any two index nodes ni and nj, k(ni) ?
k(nj)-1 if there is an edge from ni to nj, in
which k(ni) and k(ni) are ni and njs local
bisimilarities, respectively.
Advantage over 1-Index and A(k)-Index
workload-sensitive
can be more efficiently updated
Qun Chen, Andrew Lim and Kian Win Ong.
D(k)-index An adaptive structural summary for
graph-structured data. SIGMOD 03, 134-144.

39
Node (Edge) Encoding

Structural relationships
Is node u an ancestor of node v?
Is node u the parent of node v?
Assigning a unique code to a node (edge) in the
data graph so that the above question can be
answered by looking at the codes rather than the
original data graphs.
Issues
Length of the code.
Complexity for computing the structural
relationship. between two nodes from their codes.
Efficient code generation and code maintenance.

40
XML Data Coding Scheme

Region-based
XML document is ordered
Codes are assigned based on the lexicographical
location of an element in the original document
Path-based
XML document is nested
Codes are assigned based on the nesting structure
of the document, or the path that reaches and
element from the root.
There are quite a number of variants for both
categories of coding schemes

41
XML Region Based Coding

Region code (start, end, level)
u is an ancestor of v iff u.start lt v.start lt
u.end
u is the parent of v, additionally, u.level
v.level-1
Only a depth-first traversal for code generation
Property strictly nesting
Completely disjoint (case 1,4) or containing
(case 2,3)
Formally, a.start lt b.start lt a.end, if a is an
ancestor of b

42
Sample of Region Codes

The order of start values is also the document
order
The region can also be interpreted as an interval

43
Dewey
1

ltcontactgt
ltnamegtblahlt/namegt
ltphonegt
ltofficegt1234lt/officegt
lthomegt5678lt/homegt
ltmobilegt0000lt/mobilegt
lt/phonegt
lt/contactgt

contact
1.2
1.1
phone
name
1.2.1
1.2.3
1.2.2
office
blah
home
mobile
1.1.1
0000
5678
1234
1.2.1.1
1.2.2.1
1.2.3.1
a.Dewey is a prefix of d.Dewey
Igor Tatarinov, Stratis D. Viglas, Kevin Beyer,
Jayavel Shanmugasundaram, Eugene Shekita, and
Chun Zhang. Storing and querying ordered XML
using a relational database system. SIGMOD 2002.
44
Managing XML Data in Relational Systems

XML and XML Query Languages
XML Coding Scheme and Structural Summary
Managing XMLData in Relational Systems

45
XML-Enabled DB Systems

IBM DB2 XML Extender
XML column support, XML Collection, File liked
from the DBMS, or Character Large Objects
(CLOBs).
Side Tables server as XML indexes
Oracle 9i
CLOB, OracleText Cartridge, XMLType, and XML SQL
Utility
Microsoft SQL Server
CLOBs, Generic Edge technique and user-defined
decomposition (from XML to tables), XML views.

46
Storing XML Data in RDBMSs

RDBMS a matured technology
RDBMS widely available
Less investment to adopt the new technology
Easy to be integrated with other existing
applications
Impedance mismatch
Two level nature of relational schema (tuples and
attributes) vs. arbitrary nesting of XML DTD
Flat structure vs. recursion
Structure-based and content-based query

47
XQuery vs SQL Different Culture

Data Characteristics
Relational data regular, homogeneous, flat
structure in nature, and no order among tuples.
XML data irregular, heterogeneous, unpredictable
structure, order sensitive.
Query Languages
SQL
Select-from-where
With capability to support some fix-point
operation
XQuery
FLWOR (pronounced flower) For-let-where-order-r
eturn
Simple/Regular Path expressions

48
Storing XML Data in RDBMSs Architecture
Automatic Schema/Data Mapping
SQL Query
Tuples
Commercial RDBMS
49
Storing XML Data in RDBMSs Issues

Schema/Data mapping
Automate storage of XML in RDBMS
Query mapping
Provide XML views of relational sources
Result construction
Export existing data as XML

50
XML-Relational Mapping

Model mapping
Database schemas represent constructs of the XML
document model.
DTD Independent
Florescu Kossmann 99, Yoshikawa, et. al.
TOIT01
Structure mapping
Database schemas represent the logical structure
of target XML documents
DTD Dependent
Shanmugasundaram et. al. VDLB 99

51
A Simple XML Document

ltprojectgt
ltpnamegt XML lt/pnamegt
ltmember ID3gt
ltnamegtH. Lu lt/namegt
ltemailgt luhj_at_cs.ust.hk lt/emailgt
ltpublication author H. Lugt
lttitlegt Managing XML data using RDBMS lt/titlegt
ltyeargt 2001 lt/yeargt
lt/publicationgt
lt/membergt
ltmember ID24gt
ltnamegt J.X. Yu lt/namegt
ltprojectgt
ltpnamegt Data mining lt/pnamegt
lt/projectgt
lt/membergt
lt/projectgt

52
A Sample DOM Tree
1
project
2
3
24
pname
member
member
XML
28
26
27
65
66
publication
project
name
email
name
H. Lu
69
luhj_at_cs.ust.hk
J.X. Yu
70
294
71
author
title
pname
year
Managing
2001
Data mining
53
Model Mapping Document Model to Relation

Database schema represents the constructs of XML
documents
Fixed database schema for all XML documents
Data graph tree (may contain cycles)
Relational schema represents a tree
Pros and cons
DTD is not required. Documents may not conform
to DTD
Fixed schema no schema evolution issue
Large collection of documents with various DTDs
Semantics get (totally) lost

54
Model Mapping Edge/Monet Approach

Edge oriented approach
Single table schema Florescu Kossmann 99
Edge (source, ordinal, target, label, flag,
value)
Monet Schmidt et. al. WebDB00
multiple tables, horizontal partitions of edge
table on label-path

Note Document ID is omitted here
55
Querying with Edge
/DBGroup/MemberAgegt20/Name

select name.Value
from Edge dbgroup, Edge member, Edge age, Edge
name
where dbgroup.Label DBGroup'
and member.Label Member'
and age.Label Age'
and name.Label Name'
and dbgroup.Source 0
and dbgroup.Target member.Source
and member.Target age.Source
and member.Target name.Source
and cast (age.Value as int) gt 20

56
Model Mapping Node Approach

XRel Yoshikawa et. al. TOIT 2001
Four table schema
Element(pathID, start, end, ordinal)
Attribute(pathID, start, end, value)
Text(pathID, start, end, value)
Path(pathID, pathexp)

57
Querying with XRel
/DBGroup/MemberAgegt20/Name

select v2.Value
from Element e1, Path p1, Path p2, Path p3, Text
v1, Text v2
where p1.Pathexp \/DBGroup\/Member'
and p2.Pathexp \/DBGroup\/Member\/Age'
and p3.Pathexp \/DBGroup\/Member\/Name'
and e1.PathID p1.PathID
and v1.PathID p2.PathID
and v2.PathID p3.PathID
/ containment testing /
and e1.Start lt v1.Start and e1.End gt v1.End
and e1.Start lt v2.Start and e1.End gt v2.End
and cast(v1.Value as int ) gt 20

58
Structural Mapping Simplifying DTDs

DTD element specifications can be of arbitrary
complexity
lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt is
valid!
Simple DTD for translation purposes
Key observations not necessary to regenerate DTD
from relational schema
XML queries query the position of an element,
relative to its siblings, and the parent/child
relationships.

59
DTD Simplification Transformations
Simplification Transformations
Flattening Transformations
(e1, e2) ? e1, e2 (e1, e2)? ? e1?, e2? (e1e2)
? e1?, e2?
e1 ? e1 e1? ? e1 e1? ? e1 e1?? ? e1?
Grouping Transformations
..., a, ..., a, ... ? a, ... ..., a, ...,
a?, ... ? a, ... ..., a?, ..., a, ... ? a,
... ..., a?, ..., a?, ... ? a, , ...a, ,
a, ? a,
lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt
lt!ELEMENT a (b, c?, e, f)gt
Deutsch, Fernandez, and Suciu,
SIGMOD99 Shanmugasundaram, Tufte, He, Zhang,
DeWitt, and Naughton, VLDB99
60
A Sample DTD
lt!ELEMENT book (booktitle, author) lt!ELEMENT
booktitle (PCDATA)gt lt!ELEMENT author (name,
address)gt lt!ATTLIST author id ID REQUIREDgt
lt!ELEMENT name (firstname?, lastname)gt
lt!ELEMENT firstname (PCDATA)gt lt!ELEMENT
lastname (PCDATA)gt lt!ELEMENT address
ANYgt lt!ELEMENT article (title, author,
contactauthor)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT contactauthor EMPTYgt
lt!ATTLIST contactauthor authorID IDREF
IMPLIEDgt lt!ELEMENT monograph (title, author,
editor)gt lt!ELEMENT editor (monograph)gt
lt!ATTLIST editor name CDATA REQUIREDgt
Shanmugasundaram et. al. VDLB 99
61
DTD to Relational Schema Naïve Approach

Each Element gt Relation
Each Attribute of Element gt Column of Relation
Connect elements using foreign keys

lt!ELEMENT author (name, address)gt lt!ATTLIST
author id ID REQUIREDgt lt!ELEMENT name
(firstname?, lastname)gt lt!ELEMENT firstname
(PCDATA)gt lt!ELEMENT lastname (PCDATA)gt lt!ELEMENT
address ANYgt
author (authorID integer, id string) name
(nameID integer, authorID integer) firstname
(firstnameID integer, nameID integer, value
string) lastname (lastnameID integer, nameID
integer, value string) address (addressID
integer, authorID integer, value string)
62
Basic Inlining Technique

Problem of the naïve approach fragmentation
too many tables
Results in 5 relations in the previous example
retrieving first and last names of an author
Intuition
Inline as many sub-elements as possible
Do not inline only if it is a set sub-element
RDBMSs do not all support set-valued columns.
Connect relations using foreign keys
Can handle recursions
A document can be rooted at any element
Create separate a relation for each root

63
Basic Inlining Technique Relation Schemas
article

article (articleID integer, article.contactauthor
.authorid string, article.title string)
article.author (article.authorID integer,
article.author.parentID integer,
article.author.name.firstname string,
article.author.name.lastname string,
article.author.address string,
article.author.authorid string)

?
contactauthor
title

authorid
author
name
address
authorid
?
firstname
lastname
64
Basic Inlining Technique Pros Cons

Reduces number of joins for queries like get the
first and last names of a book author
Efficient for queries such as list all authors
of books
Queries like list all authors with name Ullman
Union of 5 queries!
Large number of relations
Unrolling recursive strongly connected components
(major)
Separate relational schema for each element as
root (minor)

65
Shared Inlining Technique

Intuition
Inline as many sub-elements as possible.
Do not inline only if it is a shared, recursive
or set sub-element.
An element node is represented in exactly one
relation.
Technique
Mapping the following nodes into relations
Shared In-degree gt 2 in DTD graph
Root elements In-degree 0

66
Issues with Sharing Elements

Parent of elements not fixed at schema level
Need to store type and ids of parents (or if
there are no parents)
parentCODE field (type of parent)
parentID field (id of parent)
Not foreign key relationship

67
Shared Relational Schema

book (bookID integer, book.booktitle.isroot
boolean, book.booktitle string)
article (articleID integer, article.contactauthor
.isroot boolean,
article.contactauthor.authorid
string)
monograph (monographID integer,
monograph.parentID integer,
monograph.parentCODE
integer, monograph.editor.isroot boolean,
monograph.editor.name
string)
title (titleID integer, title.parentID integer,
title.parentCODE integer, title string)
author (authorID integer, author.parentID
integer, author.parentCODE integer,
author.name.isroot boolean,
author.name.firstname.isroot boolean,
author.name.firstname string,
author.name.lastname.isroot boolean,
author.name.lastname string,
author.address.isroot boolean,
author.address string,
author.authorid string)

68
Shared Inlining Techniques Pros Cons

Reduces number of joins for queries like get the
first and last names of an author
Efficient for queries such as list all authors
with name Ullman
Sharing whenever possible implies extra joins for
path expressions
Article with a given title name

69
Hybrid Inlining Technique

Inlines some elements that are shared in Shared
Elements with in-degree gt 2 that are not set
sub-elements or recursive
Handles set and recursive sub-elements as in
Shared

70
Hybrid Relational Schema

book (bookID integer, book.booktitle.isroot
boolean, book.booktitle string,
author.name.firstname string,
author.name.lastname string,
author.address string,
author.authorid string)
article (articleID integer, article.contactauthor
.isroot boolean,
article.contactauthor.authorid string,
article.title.isroot boolean, article.title
string)
monograph (monographID integer,
monograph.parentID integer,
monograph.parentCODE integer,
monograph.title string,
monograph.editor.isroot boolean,
monograph.editor.name string,
author.name.firstname string,
author.name.lastname string, author.address
string,
author.authorid string)
author (authorID integer, author.parentID
integer, author.parentCODE integer,
author.name.isroot boolean,
author.name.firstname.isroot boolean,
author.name.firstname string,
author.name.lastname.isroot boolean,
author.name.lastname string,
author.address.isroot boolean,
author.address string,
author.authorid string)

71
Hybrid Inlining Technique Pros Cons

Reduces joins through shared elements (that are
not set or recursive elements)
Shares some strengths of Shared
Reduces joins for queries like get first and
last names of a book author
Requires more SQL sub-queries to retrieve all
authors with name Ullman.
Tradeoff between reducing number of queries and
reducing number of joins
Shared and Hybrid target query- and
join-reduction respectively

72
More on Shared and Hybrid

Shared and Hybrid have pros and cons
In many cases, Shared and Hybrid are nearly
identical
Number of joins per SQL query path length
Mainly due to large number of set nodes
Problem as join processing is expensive!

73
Regular Expressions

Path expression queries can be represented by
regular expressions.
Considering path expressions in the following
from
r (r) (r) (r)? r1/r2 r1r2 r1//r2
name.
0 or more occurrences
1 or more occurrences
? 0 or 1 occurrences
r1/r2 form a path from r1 to r2 (child)
r1//r2 form a path from r1 to r2 (descendant)
disjunction.

74
SPE to SQL
Find the name of the authors for all members
publications

/member/publication/author/name

select m2.name
from member m1, publication, member m2
where publication.perantid m1.ID
and publication.author m2.ID

75
RPE Expansion
List the title of publications for all projects

project//publication

Substitute //

project/member/(project.member)/publication
project/(member.project)/publication

76
RPE Expansion
List the title of publications for all projects

project/member/(project/member)/publication/title
project/(member/project)/publication/title

Expanding
select project.publication.title
union select project.member.publication.title
union select project.member.project.publication.ti
tle
77
Recursive Path Expression Queries to SQL

Some DBMS supports least-fixed point computation.
E.g., WITH statement in DB2

project/(member/project)/publication
WITH R(PARENTID, ID) AS ( select m.PARENTID,
p1.ID from member m, project p1
where m.IDp1.PARENTID UNION ALL
select R.PARENTID, p1.ID from R, member
m, project p1 where R.IDm.PARENTID and
m.IDp1.PARENTID) select p3. from project p2,
R, publicaton p3 where p2.IDR.PARENTID and
R.IDp3.PARENTID
78
Expanding Recursive Path Expression Queries

Expanding wild cards before sending to DBMS
Transitive closure operation is not always
supported by RDBMS
Transitive closure with arbitrary nesting seems
not supported
Can handle nested recursive queries (though DB2
does not support it)
How many SQL statements are required?
Executing SQL until empty result returned
VXMLR approach keep statistics
Zhou et. al. VLDB 2001

79
Query Translation for Structural Mapping

Translating XML-QL into SQL Shanmugasundaram, et
al, VLDB99
Simple Path Expressions to SQL
Simple Recursive Path Expressions to SQL
Arbitrary Path Expressions to Simple Recursive
Path Expressions
Discussion based on Shared approach

80
Queries with Simple Path Expressions

WHERE ltbookgt
ltbooktitlegt The Selfish Gene lt/booktitlegt
ltauthorgt
ltnamegt
ltfirstnamegt f lt/firstnamegt
ltlastnamegt l lt/lastnamegt
lt/namegt
lt/authorgt
lt/bookgt IN CONFORMING TO pubs.dtd
CONSTRUCT ltresultgt f l lt/resultgt

Select A.author.name.firstname,
A.author.name.lastname From author A, book
B Where B.bookID A.parentID AND A.parentCODE
0 AND B.book.booktitle The Selfish Gene
81
Queries with Recursive Path Expressions

WHERE lt.monographgt
lteditor.(monograph.editor)gt
ltnamegt n lt/namegt
lt/gt
lttitlegt Subclass Cirripedia lt/titlegt
lt/gt IN CONFORMING TO pubs.dtd
CONSTRUCT ltresultgt n lt/resultgt

With Q1 (monographID, name) AS (Select
X.monographID, X.editor.name From
monograph X Where X.title Subclass
Cirripedia UNION ALL Select Z.monographID,
Z.editor.name From Q1 Y, monograph Z
Where Y.monographID Z.parentID AND Z.parentCODE
0 ) Select A.name From Q1 A
82
Queries with Arbitrary Path Expressions
WHERE lt(articlemonograph)..namegt n
lt/gt CONSTRUCT ltnamegt n lt/gt

Split complex path expression to (possibly many)
simple recursive path expressions
Has effect of splitting a single XML-QL query to
(possibly many) SQL queries
Can handle nested recursive queries

83
References (1)

Aboulnaga, Alameldeen et al, VLDB01
Ashraf Aboulnaga, Alaa R. Alameldeen, and
Jeffrey F. Naughton. Estimating the selectivity
of XML path expressions for Internet scale
applications. VLDB 2001.
Bohannon et al, ICDE 2002
P. Bohannon, J. Freire, P. Roy, and J. Simeon.
From XML schema to relations A cost-based
approach to XML storage. In Proceedings of ICDE,
2002.
Boag, Chamberlin et al, W3C 2003
Scott Boag, Don Chamberlin, Mary F.
Fernández, Daniela Florescu, Jonathan Robie,
Jérôme Siméon, XQuery 1.0 An XML Query Language,
http//www.w3.org/TR/xquery
Bruno et al, SIGMOD02
N. Bruno, N. Koudas, D. Srivastava. Holistic
twig joins Optimal XML pattern matching. In
SIGMOD Int'l Conf. on Management of Data,
310-311, 2002.
Chen, Jagadish et al, ICDE01
Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.
Muthukrishnan, R. T. Ng, and D. Srivastava.
Counting twig matches in a tree. In Proceedings
of the IEEE International Conference on Data
Engineering, pages 595-604, 2001.
Cohen, Kaplen et al, PODS02
E. Cohen, H. Kaplan, T. Milo. Labeling dynamic
XML trees. In Symposium on Principles of Database
Systems (PODS), 271-281, 2002.
Cark, DeRose, W3C 1999
James Clark, and Steven DeRose, XML Path
Language (XPath) Version 1.0
http//www.w3.org/TR/xpath

84
References (2)

Ceri, Comai et al, WWW99
S. Ceri, S. Comai, E. Damiani, P. Fraternali, S.
Paraboschi, and L. Tanca. XMLGL a graphical
language for querying and restructuring WWW data.
In International World Wide Web Conference (WWW),
Toronto, Canada, May 1999.
Chamberlin, Robie et al, 2000
Don Chamberlin, Jonathan Robie, and Daniela
Florescu. Quilt An XML query language for
heterogeneous data source. In Proceedings of the
Third International Workshop on the Web and
Databases, May 2000.
Chamberlin, Draper et al, 2003
Don Chamberlin, Denie Draper, Mary Fernandez,
Michael Kay, Jonathan Robie, Michael Rys, Jerome
Simeon, Jim Tivy, Philip Wadler. Editor Howard
Katz. XQuery from the Experts A Guide to the W3C
XML Query Language. Addison-Wesley Press, 2003
Chaudhri, Rashid et al, 2003
Akmal B. Chaudhri , Awais Rashid , Roberto
Zicari. XML Data Management Native XML and
XML-Enabled Database Systems. Addison-Wesley
Press, 2003
Chen, Lim et al, SIGMOD03
Qun Chen, Andrew Lim and Kian Win Ong.
D(k)-index An adaptive structural summary for
graph-structured data. In SIGMOD'03, 134-144.
Chien, Vagena and Zhang et al, VLDB02
S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras,
and C. Zaniolo. Efficient structural joins on
indexed XML documents. In VLDB02, pages 263--274,
2002.
Chung, Min et al, SIGMOD02
C-W Chung, J-K Min, K. Shim, APEX An Adaptive
Path Index for XML Data, In SIGMOD'02, 2002

85
References (3)

Deutsch, Fernandez et al, QL98
A. Deutsch, M. Fernandez, D. Florescu,
A. Levy, and D. Suciu. XML-QL A query language
for XML. In M. Marchiori, editor. QL'98--The
Query Languages Workshop. W3C, Dec. 1998.
http//www.w3.org/TR/1998/NOTE-xml-ql-19980819/
Deutsch, Fernandez, and Suciu, SIGMOD99
A. Deutsch, M. Fernandez, and D. Suciu. Storing
Semistructured Data with STORED. In Proc. of the
ACM SIGMOD Conference on Management of Data, June
1999.
Dietz STOC 82
Paul F. Dietz. Maintaining order in a linked
list. STOC 1982.
Grust SIGMOD02
Torsten Grust. Accelerating XPath Location
Steps. In Proc. of the 21st ACM SIGMOD
Conference, pages 109--120, Madison, Wisconsin,
USA, June 2002. ACM Press.
Jiang, Lu, Wang and Ooi, ICDE03
Haifeng Jiang, Hongjun Lu, Wei Wang, Beng Chin
Ooi, XR-Tree Indexing XML Data for Efficient
Structural Joins, The 19th International
Conference on Data Engineering (ICDE 2003), page
253-264, Bangalore, India, March 5-8, 2003.
Jiang, Wang, Lu and Yu, VLDB03
Haifeng Jiang, Wei Wang, Hongjun Lu, Jeffrey Xu
Yu, Holistic Twig Joins on Indexed XML Documents,
The 29th International Conference on Very Large
Data Bases (VLDB 2003), pages 273-284, Berlin,
Germany, September 9-12, 2003.

86
References (4)

Kaushik, Shenoy, ICDE02
Raghav Kaushik, Pradeep Shenoy, Philip Bohannon
and Ehud Gudes. Exploiting local similarity for
indexing paths in graph-structured data. In
ICDE02, 129-140.
Kha et al, ICDE01
Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke
Uemura. An XML indexing structure with relative
region coordinate. ICDE 2001.
Krishnamurthy et al, 2003
R. Krishnamurthy, R. Kaushik, J. Naughto XML,
XML-to-SQL Query Translation Literature The
State of the Art and Open Problems, Symposium
(XSym), Sep 2003.
Li and Moon, VLDB01
Quanzhong Li and Bongki Moon. Indexing and
querying XML data for regular path expressions.
VLDB 2001.
Milo, Suciu, ICDT99
Tova Milo and Dan Suciu. Index structures for
path expressions. In ICDT99. 277-295, January
1999.
Lee, Srivastava DASFAA04
Dongwon Lee and Divesh Srivastava. Counting
relaxed twig matches in a tree. DASFAA 2004.
Lim, Wang et al, VLDB02
Lipyeow Lim, Min Wang, Sriram Padmanabhan,
Jeffrey Scott Vitter, and Ronald Parr.
XPathLearner An on-line self-tuning Markov
histogram for XML path selectivity estimation.
VLDB 2002.
Lee, Yoo et al, 1996
Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and
P. Bruce Berra. Index structures forstructured
documents. In Proceedings of the ACM Conference
on Digital Libraries, 1996.

87
References (5)

Manolescu, Florescu et al, 2001
I. Manolescu, D. Florescu, and D. Kossmann.
Pushing XML queries inside relational databases.
Tech. Report no. 4112, INRIA, 2001
Manolescu, Florescu et al, VLDB01
I. Manolescu, D. Florescu, and D. Kossmann.
Answering xml queries over heterogeneous data
sources. In proceedings on the International
Conference on Very Large Data Bases (VLDB), Rome,
Italy, September 2001.
Meier, 2002
Wolfgang Meier. eXist An open source native XML
database. In Web, Web-Services, and Database
Systems 2002, 2002.
McHugh, Widom, VLDB99
Jason McHugh and Jennifer Widom. Query
optimization for XML. VLDB 1999.
Polyzotis, Garofalakis SIGMOD02
Neoklis Polyzotis and Minos N. Garofalakis.
Statistical synopses for graph-structured XML
databases. SIGMOD 2002.
Polyzotis, Garofalakis VLDB02
Neoklis Polyzotis and Minos N. Garofalakis.
Structure and value synopses for XML data graphs.
VLDB 2002.
Robie, Lapp, QL98
J. Robie, J. Lapp, and D. Schach. XML query
language (XQL). In M. Marchiori, editor.
QL'98--The Query Languages Workshop. W3C, Dec.
1998.
http//www.w3.org/TandS/QL/QL98/pp/xql.html

88
References (6)

Schmidt et. al. WebDB00
A. Schmidt, M. L. Kersten, M. Windhouwer, and F.
Waas. Efficient relational storage and retrieval
of XML documents. In WebDB (Informal
Proceedings), pages 47--52, 2000.
Shanmugasundaram, Tufte, He, Zhang, DeWitt, and
Naughton, VLDB99
Jayavel Shanmugasundaram, Kristin Tufte, Chun
Zhang, Gang He, David J. DeWitt, and Jeffrey F.
Naughton. Relational databases for querying XML
documents Limitations and oppor- tunities. In
Proceedings of 25th International Conference on
Very Large Data Bases (VLDB'99), pages 79-90.
Morgan Kaufmann, 1999.
Shanmugasundaram et. al. VDLB 99
Jayavel Shanmugasundaram, Kristin Tufte, Chun
Zhang, Gang He, David J. DeWitt, Jeffrey F.
Naughton Relational Databases for Querying XML
Documents Limitations and Opportunities. VLDB
1999 302-314
Srivastava, Al-Khalifa et al, ICDE02
D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N.
Koudas, J. M. Patel, and Y. Wu. Structural joins
A primitive for efficient XML query pattern
matching. In ICDE, pages 141-- 152, 2002.
Tatarinov, Viglas et al, SIGMOD02
Igor Tatarinov, Stratis D. Viglas, Kevin Beyer,
Jayavel Shanmugasundaram, Eugene Shekita, and
Chun Zhang. Storing and querying ordered XML
using a relational database system. SIGMOD 2002.
Wang, Jiang et al, SIGMOD03
Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Xu
Yu. Containment Join Size Estimation Models and
Methods. The 2003 ACM SIGMOD International
Conference on Management of Data(SIGMOD03), San
Diego, California, June 9 - June 12, 2003. Pages
145-156

89
References (7)

Wang, Jiang et al, ICDE03
Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey
Xu Yu. PBiTree coding and efficient processing of
containment joins. ICDE 2003.
Wu et al., EDBT02
Yuqing Wu, Jignesh Patel, H. V. Jagadish. Using
Histograms to Estimate Answer Size for XML
Queries. Information Systems 28 (1-2) 33-59
(2003) -- Special Issue Best Papers from EDBT
2002.
Wu et al, ICDE03
Yuqing Wu, Jignesh Patel and H.V. Jagadish,
Structural Join Order Selection for XML Query
Optimization. ICDE 2003.
Yoshikawa, et. al. TOIT01
Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki
Shimura, Shunsuke Uemura XRel a path-based
approach to storage and retrieval of XML
documents using relational databases. ACM Trans.
Internet Techn. 1(1) 110-141 (2001)
Zhou et. al. VLDB 2001
Aoying Zhou, Hongjun Lu, Shihui Zheng, Yuqi
Liang, Long Zhang, Wenyun Ji, Zengping Tian
VXMLR A Visual XML-Relational Database System.
VLDB 2001 719-720
Zhang, Naughton SIGMOD01
Chun Zhang, Jeffrey F. Naughton, David J.
DeWitt, Qiong Luo, and Guy M. Lohman. On
supporting containment queries in relational
database management systems. SIGMOD 2001.