Title: L09: Introduction to XML Data Management
1L09 Introduction to XML Data Management
- XML and XML Query Languages
- Structural Summary and Coding Scheme
- Managing XML Data in Relational Systems
2XML and XML Query Languages
-
- XML and XML Query Languages
- Structural Summary and Coding Scheme
- Managing XML Data in Relational Systems
3XML
- Extensible Markup Language for data
- A W3C standard to complement HTML
- http//www.w3.org/TR/2000/REC-xml-20001006
(version 2, 10/2000) - Standard for publishing and interchange
- Origins structured text SGML
- Cleaner SGML for the Internet
- Motivation
- HTML describes presentation
- XML describes content
4XML Describing the Content
- ltprojectgt
- lttalk gt
- lttitlegt XML Query Processing Optimization
lt/titlegt - ltdategt March 18, 2004 lt/dategt
- ltinstructorgt Instructor
- ltnamegt Lu Hongjun lt/namegt
- ltaffiliationgt HKUST lt/
affiliation gt - ltemailgt luhj_at_cs.ust.hk lt/emailgt
- ltnamegt Jeffrey X. Yu lt/namegt
- ltaffiliationgt CUHK lt/
affiliation gt - ltemailgt yu_at_se.cuhk.edu.hk
lt/emailgt - lt/ instructor gt
- lt/talkgt
- lt/projectgt
5XML Document/Data
- Hierarchical document format for information
exchange in WWW - Self describing data (tags)
- Nested element structures having a root
- Element data can have
- Attributes
- Sub-elements
6Basic XML Structures
- Elements lttitlegt lt/titlegt,ltnamegt lt/namegt
- Open close tags or empty tag
- Ordered, nestable
- an element can be empty
- Attributes
- PCDATA/CDATA
- An XML document single root element
- well formed XML document if it has matching tags
7Basic XML Structures Attributes
- Single-valued, ordered
- ltproject proj_id P1234 budget 1000000gt
- lttitlegt XML Data Management lt/titlegt
-
- ltyeargt 2003-2004 lt/yeargt
- lt/projectgt
- Special types ID, IDREF, IDREFS
- ltmember idm007gt ltnamegt James lt/namegt
lt/membergt - ltproject idp123gt lttitlegt XML Data Management
lt/titlegt - ltmember
idrefm007 m008/gt - lt/projectgt
8Other XML Structures
- Processing instructions instructions for
applications - lt?xml version1.0?gt
- CDATA sections treat content as char data
- lt!CDATAlttaggtWhatever!!!lt/taggtltwhatevergtgt
- Comments just like HTML
- lt!-- Comments --gt
- Entities external resources and macros
- my-entity (non-parameter entity)
- param-entity (parameter entity for DTD
declarations)
9Data Centric vs. Document centric
- ltprojectgt
- ltpnamegt XML lt/pnamegt
- ltmember ID3, age 50 gt
- ltnamegtH. Lu lt/namegt
- ltemailgt luhj_at_cs.ust.hk lt/emailgt
- ltpublication author H. Lugt
- lttitlegt Managing XML data using RDBMS lt/titlegt
- ltyeargt 2001 lt/yeargt
- lt/publicationgt
-
- lt/membergt
- ltmember ID24, age 35 gt
- ltnamegt J.X. Yu lt/namegt
- ltprojectgt
- ltpnamegt Data mining lt/pnamegt
- lt/projectgt
- lt/membergt
- lt/projectgt
ltbiogt ltpgt Dr Lu is a professor at ltbgt
HKUST. lt/bgt He worked at ltbgt NUSgt lt/bgt
before 1998. lt/pgt lt/biogt
10XML Data Model
- Several competing models
- Document Object Model (DOM)
- a platform- and language-neutral interface that
will allow programs and scripts to dynamically
access and update the content, structure and
style of documents - http//www.w3.org/DOM/
11DOM Core Interface Node
- DOM tree a tree-like structure of Node objects
the root of the tree is a document object. - Node Object (nodeName, nodeValue, nodeType,
parentNode, childnodes, firstChild, lastChild,
previousSibling, nextSibling, attributes,
ownerDocument) - nodeType ELEMENT_NODE, ATTRIBUTE_NODE,
TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE,
PROCESSING_INSTRUCTION_NODE, COMMENT_NODE,
DOCUMENT_NODE, DOCUMENT_TYPE_NODE,
DOCUMENT_FRAGMENT_NODE, NOTATION_NODE
12DOM Interface
- Each node of the document tree may have a number
of child nodes, contained in a NodeList object. - Two ways of accessing a node object
- Based on the location of an object in the
document tree - Based on the name of an object
13A Sample DOM Tree
Project Node NodeTypeELEMENT_NODE tagName
project NodeValue nill
1
project
2
3
24
pname
member
member
XML
28
26
27
65
66
publication
project
name
email
name
H. Lu
69
luhj_at_cs.ust.hk
J.X. Yu
70
294
71
author
title
pname
year
Managing
2001
Data mining
name Node NodeTypeTEXT_NODE tagName name
NodeValue H. Lu
publicatiom Node NodeTypeELEMENT_NODE tagName
publication NodeValue nill
14Data Graph
- Similar to DOM tree, but may have different
notations that represent an XML document
1
project
2
3
24
pname
member
member
28
26
65
27
66
author
XML
publication
project
name
email
name
50
69
70
294
71
50
J.X. Yu
luhj_at_cs.ust.hk
H. Lu
title
year
pname
Managing
2001
Data mining
15Document Type Definition
- Inherited from SGML DTD standard
- BNF grammar establishing constraints on element
structure and content - Specification of attributes and their types
- Definitions of entities
16A Sample DTD
project
1
publication
member
pname
3
4
2
author
?
5
6
7
9
8
- lt?xml version"1.0" standalone"yes"?gt
- lt!DOCTYPE Research gt
- lt!ELEMENT project(pname,member,publication)gt
- lt!ELEMENT pname(PCDATA)gt
- lt!ELEMENT member (name,email?, publication,
project)gt - lt!ATTLIST member ID ID REQUIREDgt
- lt!ELEMENT name(PCDATA)gt
- lt!ELEMENT email(PCDATA)gt
- lt!ELEMENT publication(title,year)gt
- lt!ATTLIST publication author IDREF IMPLIED)gt
- lt!ELEMENT title(PCDATA)gt
- lt!ELEMENT year(PCDATA)gt
ID
name
email
title
year
17XML Query Languages
- There have been a large number of proposals
during the past few years - XPath Clark, DeRose, W3C 1999
- XQuery Boag, Chamberlin et al, W3C 2003
- XML-QLDeutsch, Fernandez et al, QL99
- XQL Robie, Lapp, QL99
- XML_GL Ceri, Comai et al, WWW99
- Quilt Chamberlin, Robie et al, 2000
- From W3C
- XQuery 1.0 (W3C Working Draft, 12 November 2003)
- http//www.w3.org/TR/xquery/
- XPath 2.0 (W3C Working Draft 12 November 2003)
- http//www.w3.org/TR/xpath20/
18XPath XML Path Language
- The purpose
- To address the node of an XML tree using a path
notation for navigating through the hierarchical
structure of an XML document. - Uses a compact, non-XML syntax
- Designed to be embedded in a host language (e.g.,
XSLT, XQuery) - XPath Expressions
- String of characters
- Value of an expression is always an ordered
collection of zero or more items (atomic value,
node)
19XPath Steps
- An XPath expression has following syntax
- Path/Step1/Step2//Stepn,
- where each Xpath step is defined as follows
- StepAxisNode-test Predicate
- Axis specifies the direction in which the
document should be navigated. - For example, childtitleposition() 2
- There are 12 axes child, descendant,
descendant-or-self, parent, ancestor,
ancestor-or-self, following, preceding,
following-sibling, preceding-sibling, attribute,
self, namespace
20XPath Path Expressions
- project matches a project element
- matches any element
- / matches the root element
- /project matches a project element under root
- project/member matches a member in project
- project//name matches a name in project, at any
depth - //title matches a title at any depth
- memberpublciation matches a member or a
publication - _at_age matches an age attribute
- project/member/_at_age matches age attribute in
member, in project - project/member/_at_agelt45 matches a member with
age lt 45
21XPath Query Examples
/project/member/name matches a name of member in
project
Result ltnamegt H. Lu lt/namegt ltnamegt J.X. Yu
lt/namegt
/project/publication/venue
Result empty there was no venue element
//pname matches a pname at any depth
Result ltpnamegt XML lt/pnamegt ltpnamegt Data
mining lt/pnamegt
- /project/member/name/text() text of name elements
Result H. Lu J.X. Yu
22More XPath Queries
- /project/memberpublication
- ltmember ID3, age 50 gt
- ltnamegtH. Lu lt/namegt
- ltemailgt luhj_at_cs.ust.hk lt/emailgt
- ltpublication author H. Lugt
- lttitlegt Managing XML data using RDBMS lt/titlegt
- ltyeargt 2001 lt/yeargt
- lt/publicationgt
- lt/membergt
- /project/member_at_age lt 45
- ltmember ID24, age 35 gt
- ltnamegt J.X. Yu lt/namegt
- ltprojectgt
- ltpnamegt Data mining lt/pnamegt
- lt/projectgt
- lt/membergt
- /project member/_at_age lt 25
- No element returned
- /project/memberemail/text()
23XQuery
- XQuery 1.0 An XML Query Language
- W3C Working Draft 12 November 2003
- http//www.w3.org/TR/xquery/
- XPath expressions are still the basic building
block
24XQuery
- XQuery 1.0 An XML Query Language
- W3C Working Draft 12 November 2003
- http//www.w3.org/TR/xquery/
- FLWR Expressions FOR-LET-WHERE-RETURN
25XQuery Examples
- ltresultgt
- FOR x in /project/member/publication
- WHERE x/year gt 2000
- RETURN ltrecentpubgt x/title
- lt/ recentpub gt
- lt/resultgt
distinct a function that eliminates duplicates
ltactive_membersgt FOR m IN distinct(document(pro
ject.xml")//member) LET p document(project.x
ml")//publicationauthor m WHERE count(p) gt
10 RETURN m lt/ active_members gt
count a (aggregate) function that returns the
number of elements
26Structural Summary and Coding Scheme
- XML and XML Query Languages
- Structural Summary and Coding Scheme
- Managing XMLData in Relational Systems
27Structural Summary
- A structural summary for a data graph GD(VD, ED )
is another labeled graph GI (VI, EI ). - Each node vi ? GI represents a set of nodes,
extent(vi ), and extent(vi ) ? VD. - An edge ed (vi , vi) ? GI exists if there is an
edge ed (vd , vd) ? GD vd ? extent(vi ), vd ?
extent(vi ). - The summary preserves all the paths in the data
graph. A path expression query can be executed on
GI instead of GD, which is most likely more
efficient since size of GI is much smaller than
GD.
28Structural Summary
- Basically, nodes in the data graph is grouped
based on certain criteria, each group of nodes is
represented by one node in the summary. - The size of summary will be determined by the
grouping criteria. - Desired properties in supporting evaluating path
expression queries using summary - The results are safe (no false negatives)
- If not safe, only approximate answers can be
obtained - The results are precise contains no false
positives - If not precise, need validate results using the
data graph
29Structural Summary
R
r
A
a1,a2,a3
B
b1,b2,b3
C
c1,c2,c3
Data Graph
Structural summary
30Sample Structural Summaries
- Query workload independent summaries
- Data Guide
- 1-index Milo, Suciu, ICDT99
- A(k) index Kaushik, Shenoy, ICDE02
- Query workload dependent summaries
- APEX Chung, Min et al, SIGMOD02
- D(k)-index Chen, Lim et al, SIGMOD03
31Data Guides
- DataGuide dynamic structural summary of current
database - Each label path in database appears once in
DataGuide - No extraneous paths in DataGuide
- Maintained incrementally as database evolves
- Serves role of schema
C1 is duplicated to achieve determinism in
DataGuides
32Bisimilarity and 1-Index
- Most existing structural summary are based on
graph bisimilarity, defined as follows - Two data nodes u and v are bisimilar (u ? v) if
- u and v have the same label
- if u is a parent of u, then there is a parent
v of v such that u? v, and vice versa - Intuitively, the set of paths coming into them is
the same if two nodes are bisimilar - Tova Milo and Dan Suciu. Index structures for
path expressions. In ICDT99. 277-295, January
1999.
331-Index
- 1-index Each index node represents an
equivalence class, in which data nodes are
mutually bisimilar. - Evaluating path expression query using 1-index
- safe the result always contains the result of
evaluating on the data graph - precise its result contains no false data node
34K-bisimilarity
- 1-index can be big
- Formally, based on the notion of k-bisimilarity
(?k ) which is defined inductively - Node u ?k v iff u ?k-1 v, and for every parent
u of u, there is a parent v of v such that u
?k-1 v, and vice versa - For any two nodes, u and v, u ?0 v iff u and v
have the same label - Intuitively, if two data nodes are k-bisimilar,
the set of paths coming into them with length (?
k) is the same
35A(k)-Index
- A(k)-Index group nodes based on their local
structure paths of length up to k, instead of
the global path information - data nodes in each index nodes of A(k) index are
mutually k-bisimilar - Evaluation path expression query using
A(k)-index - safe its result always contains the result of
evaluating on the data graph - precision its result contains no false data
node - Raghav Kaushik, Pradeep Shenoy, Philip Bohannon
and Ehud Gudes. Exploiting local similarity for
indexing paths in graph-structured data. ICDE02,
129-140.
36A(2)-Index
C2 and C3 can be grouped because their length-2
incoming paths are the same
37APEX Adaptive Path Index
- 1-index, A(k)-index and FB index are all
workload independent - APEX Adaptive Path index
- Maintains two types of paths in the summary
- All paths of length two so that all queries can
be answered using APEX - Full paths are maintained for those paths that
frequently appear in query workload so that
frequently asked queries can be answered
efficiently - A hash table is included in the index so that
partial matching queries with the
self-or-descendent axis (//) can be processed
efficiently - C-W Chung, J-K Min, K. Shim, APEX An Adaptive
Path Index for XML Data, SIGMOD 02
38D(k)-Index
- A generalization of 1-Index and A(k)-Index.
- Assigning different local bisimilarites to index
nodes in the summary structure according to the
query load to optimize its structure. - for any two index nodes ni and nj, k(ni) ?
k(nj)-1 if there is an edge from ni to nj, in
which k(ni) and k(ni) are ni and njs local
bisimilarities, respectively. - Advantage over 1-Index and A(k)-Index
- workload-sensitive
- can be more efficiently updated
- Qun Chen, Andrew Lim and Kian Win Ong.
D(k)-index An adaptive structural summary for
graph-structured data. SIGMOD 03, 134-144.
39Node (Edge) Encoding
- Structural relationships
- Is node u an ancestor of node v?
- Is node u the parent of node v?
- Assigning a unique code to a node (edge) in the
data graph so that the above question can be
answered by looking at the codes rather than the
original data graphs. - Issues
- Length of the code.
- Complexity for computing the structural
relationship. between two nodes from their codes. - Efficient code generation and code maintenance.
40XML Data Coding Scheme
- Region-based
- XML document is ordered
- Codes are assigned based on the lexicographical
location of an element in the original document - Path-based
- XML document is nested
- Codes are assigned based on the nesting structure
of the document, or the path that reaches and
element from the root. - There are quite a number of variants for both
categories of coding schemes
41XML Region Based Coding
- Region code (start, end, level)
- u is an ancestor of v iff u.start lt v.start lt
u.end - u is the parent of v, additionally, u.level
v.level-1 - Only a depth-first traversal for code generation
- Property strictly nesting
- Completely disjoint (case 1,4) or containing
(case 2,3) - Formally, a.start lt b.start lt a.end, if a is an
ancestor of b
42Sample of Region Codes
- The order of start values is also the document
order - The region can also be interpreted as an interval
43Dewey
1
- ltcontactgt
- ltnamegtblahlt/namegt
- ltphonegt
- ltofficegt1234lt/officegt
- lthomegt5678lt/homegt
- ltmobilegt0000lt/mobilegt
- lt/phonegt
- lt/contactgt
contact
1.2
1.1
phone
name
1.2.1
1.2.3
1.2.2
office
blah
home
mobile
1.1.1
0000
5678
1234
1.2.1.1
1.2.2.1
1.2.3.1
a.Dewey is a prefix of d.Dewey
Igor Tatarinov, Stratis D. Viglas, Kevin Beyer,
Jayavel Shanmugasundaram, Eugene Shekita, and
Chun Zhang. Storing and querying ordered XML
using a relational database system. SIGMOD 2002.
44Managing XML Data in Relational Systems
- XML and XML Query Languages
- XML Coding Scheme and Structural Summary
- Managing XMLData in Relational Systems
45XML-Enabled DB Systems
- IBM DB2 XML Extender
- XML column support, XML Collection, File liked
from the DBMS, or Character Large Objects
(CLOBs). - Side Tables server as XML indexes
- Oracle 9i
- CLOB, OracleText Cartridge, XMLType, and XML SQL
Utility - Microsoft SQL Server
- CLOBs, Generic Edge technique and user-defined
decomposition (from XML to tables), XML views.
46Storing XML Data in RDBMSs
- RDBMS a matured technology
- RDBMS widely available
- Less investment to adopt the new technology
- Easy to be integrated with other existing
applications - Impedance mismatch
- Two level nature of relational schema (tuples and
attributes) vs. arbitrary nesting of XML DTD - Flat structure vs. recursion
- Structure-based and content-based query
47XQuery vs SQL Different Culture
- Data Characteristics
- Relational data regular, homogeneous, flat
structure in nature, and no order among tuples. - XML data irregular, heterogeneous, unpredictable
structure, order sensitive. - Query Languages
- SQL
- Select-from-where
- With capability to support some fix-point
operation - XQuery
- FLWOR (pronounced flower) For-let-where-order-r
eturn - Simple/Regular Path expressions
48Storing XML Data in RDBMSs Architecture
Automatic Schema/Data Mapping
SQL Query
Tuples
Commercial RDBMS
49Storing XML Data in RDBMSs Issues
- Schema/Data mapping
- Automate storage of XML in RDBMS
- Query mapping
- Provide XML views of relational sources
- Result construction
- Export existing data as XML
50XML-Relational Mapping
- Model mapping
- Database schemas represent constructs of the XML
document model. - DTD Independent
- Florescu Kossmann 99, Yoshikawa, et. al.
TOIT01 - Structure mapping
- Database schemas represent the logical structure
of target XML documents - DTD Dependent
- Shanmugasundaram et. al. VDLB 99
51A Simple XML Document
- ltprojectgt
- ltpnamegt XML lt/pnamegt
- ltmember ID3gt
- ltnamegtH. Lu lt/namegt
- ltemailgt luhj_at_cs.ust.hk lt/emailgt
- ltpublication author H. Lugt
- lttitlegt Managing XML data using RDBMS lt/titlegt
- ltyeargt 2001 lt/yeargt
- lt/publicationgt
-
- lt/membergt
- ltmember ID24gt
- ltnamegt J.X. Yu lt/namegt
- ltprojectgt
- ltpnamegt Data mining lt/pnamegt
- lt/projectgt
- lt/membergt
- lt/projectgt
52A Sample DOM Tree
1
project
2
3
24
pname
member
member
XML
28
26
27
65
66
publication
project
name
email
name
H. Lu
69
luhj_at_cs.ust.hk
J.X. Yu
70
294
71
author
title
pname
year
Managing
2001
Data mining
53Model Mapping Document Model to Relation
- Database schema represents the constructs of XML
documents - Fixed database schema for all XML documents
- Data graph tree (may contain cycles)
- Relational schema represents a tree
- Pros and cons
- DTD is not required. Documents may not conform
to DTD - Fixed schema no schema evolution issue
- Large collection of documents with various DTDs
- Semantics get (totally) lost
54Model Mapping Edge/Monet Approach
- Edge oriented approach
- Single table schema Florescu Kossmann 99
- Edge (source, ordinal, target, label, flag,
value) - Monet Schmidt et. al. WebDB00
- multiple tables, horizontal partitions of edge
table on label-path
Note Document ID is omitted here
55Querying with Edge
/DBGroup/MemberAgegt20/Name
- select name.Value
- from Edge dbgroup, Edge member, Edge age, Edge
name - where dbgroup.Label DBGroup'
- and member.Label Member'
- and age.Label Age'
- and name.Label Name'
- and dbgroup.Source 0
- and dbgroup.Target member.Source
- and member.Target age.Source
- and member.Target name.Source
- and cast (age.Value as int) gt 20
56Model Mapping Node Approach
- XRel Yoshikawa et. al. TOIT 2001
- Four table schema
- Element(pathID, start, end, ordinal)
- Attribute(pathID, start, end, value)
- Text(pathID, start, end, value)
- Path(pathID, pathexp)
57Querying with XRel
/DBGroup/MemberAgegt20/Name
- select v2.Value
- from Element e1, Path p1, Path p2, Path p3, Text
v1, Text v2 - where p1.Pathexp \/DBGroup\/Member'
- and p2.Pathexp \/DBGroup\/Member\/Age'
- and p3.Pathexp \/DBGroup\/Member\/Name'
- and e1.PathID p1.PathID
- and v1.PathID p2.PathID
- and v2.PathID p3.PathID
- / containment testing /
- and e1.Start lt v1.Start and e1.End gt v1.End
- and e1.Start lt v2.Start and e1.End gt v2.End
- and cast(v1.Value as int ) gt 20
58Structural Mapping Simplifying DTDs
- DTD element specifications can be of arbitrary
complexity - lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt is
valid! - Simple DTD for translation purposes
- Key observations not necessary to regenerate DTD
from relational schema - XML queries query the position of an element,
relative to its siblings, and the parent/child
relationships.
59DTD Simplification Transformations
Simplification Transformations
Flattening Transformations
(e1, e2) ? e1, e2 (e1, e2)? ? e1?, e2? (e1e2)
? e1?, e2?
e1 ? e1 e1? ? e1 e1? ? e1 e1?? ? e1?
Grouping Transformations
..., a, ..., a, ... ? a, ... ..., a, ...,
a?, ... ? a, ... ..., a?, ..., a, ... ? a,
... ..., a?, ..., a?, ... ? a, , ...a, ,
a, ? a,
lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt
lt!ELEMENT a (b, c?, e, f)gt
Deutsch, Fernandez, and Suciu,
SIGMOD99 Shanmugasundaram, Tufte, He, Zhang,
DeWitt, and Naughton, VLDB99
60A Sample DTD
lt!ELEMENT book (booktitle, author) lt!ELEMENT
booktitle (PCDATA)gt lt!ELEMENT author (name,
address)gt lt!ATTLIST author id ID REQUIREDgt
lt!ELEMENT name (firstname?, lastname)gt
lt!ELEMENT firstname (PCDATA)gt lt!ELEMENT
lastname (PCDATA)gt lt!ELEMENT address
ANYgt lt!ELEMENT article (title, author,
contactauthor)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT contactauthor EMPTYgt
lt!ATTLIST contactauthor authorID IDREF
IMPLIEDgt lt!ELEMENT monograph (title, author,
editor)gt lt!ELEMENT editor (monograph)gt
lt!ATTLIST editor name CDATA REQUIREDgt
Shanmugasundaram et. al. VDLB 99
61DTD to Relational Schema Naïve Approach
- Each Element gt Relation
- Each Attribute of Element gt Column of Relation
- Connect elements using foreign keys
lt!ELEMENT author (name, address)gt lt!ATTLIST
author id ID REQUIREDgt lt!ELEMENT name
(firstname?, lastname)gt lt!ELEMENT firstname
(PCDATA)gt lt!ELEMENT lastname (PCDATA)gt lt!ELEMENT
address ANYgt
author (authorID integer, id string) name
(nameID integer, authorID integer) firstname
(firstnameID integer, nameID integer, value
string) lastname (lastnameID integer, nameID
integer, value string) address (addressID
integer, authorID integer, value string)
62Basic Inlining Technique
- Problem of the naïve approach fragmentation
too many tables - Results in 5 relations in the previous example
retrieving first and last names of an author - Intuition
- Inline as many sub-elements as possible
- Do not inline only if it is a set sub-element
- RDBMSs do not all support set-valued columns.
- Connect relations using foreign keys
- Can handle recursions
- A document can be rooted at any element
- Create separate a relation for each root
63Basic Inlining Technique Relation Schemas
article
- article (articleID integer, article.contactauthor
.authorid string, article.title string) - article.author (article.authorID integer,
article.author.parentID integer,
article.author.name.firstname string, - article.author.name.lastname string,
article.author.address string,
article.author.authorid string)
?
contactauthor
title
authorid
author
name
address
authorid
?
firstname
lastname
64Basic Inlining Technique Pros Cons
- Reduces number of joins for queries like get the
first and last names of a book author - Efficient for queries such as list all authors
of books - Queries like list all authors with name Ullman
- Union of 5 queries!
- Large number of relations
- Unrolling recursive strongly connected components
(major) - Separate relational schema for each element as
root (minor)
65Shared Inlining Technique
- Intuition
- Inline as many sub-elements as possible.
- Do not inline only if it is a shared, recursive
or set sub-element. - An element node is represented in exactly one
relation. - Technique
- Mapping the following nodes into relations
- Shared In-degree gt 2 in DTD graph
- Root elements In-degree 0
66Issues with Sharing Elements
- Parent of elements not fixed at schema level
- Need to store type and ids of parents (or if
there are no parents) - parentCODE field (type of parent)
- parentID field (id of parent)
- Not foreign key relationship
67Shared Relational Schema
- book (bookID integer, book.booktitle.isroot
boolean, book.booktitle string) - article (articleID integer, article.contactauthor
.isroot boolean, - article.contactauthor.authorid
string) - monograph (monographID integer,
monograph.parentID integer, - monograph.parentCODE
integer, monograph.editor.isroot boolean, - monograph.editor.name
string) - title (titleID integer, title.parentID integer,
title.parentCODE integer, title string) - author (authorID integer, author.parentID
integer, author.parentCODE integer, - author.name.isroot boolean,
author.name.firstname.isroot boolean, - author.name.firstname string,
author.name.lastname.isroot boolean, - author.name.lastname string,
author.address.isroot boolean, - author.address string,
author.authorid string)
68Shared Inlining Techniques Pros Cons
- Reduces number of joins for queries like get the
first and last names of an author - Efficient for queries such as list all authors
with name Ullman - Sharing whenever possible implies extra joins for
path expressions - Article with a given title name
69Hybrid Inlining Technique
- Inlines some elements that are shared in Shared
- Elements with in-degree gt 2 that are not set
sub-elements or recursive - Handles set and recursive sub-elements as in
Shared
70Hybrid Relational Schema
- book (bookID integer, book.booktitle.isroot
boolean, book.booktitle string, - author.name.firstname string,
author.name.lastname string, - author.address string,
author.authorid string) - article (articleID integer, article.contactauthor
.isroot boolean, - article.contactauthor.authorid string,
article.title.isroot boolean, article.title
string) - monograph (monographID integer,
monograph.parentID integer, - monograph.parentCODE integer,
monograph.title string, - monograph.editor.isroot boolean,
monograph.editor.name string, - author.name.firstname string,
author.name.lastname string, author.address
string, - author.authorid string)
- author (authorID integer, author.parentID
integer, author.parentCODE integer, - author.name.isroot boolean,
author.name.firstname.isroot boolean, - author.name.firstname string,
author.name.lastname.isroot boolean, - author.name.lastname string,
author.address.isroot boolean, - author.address string,
author.authorid string)
71Hybrid Inlining Technique Pros Cons
- Reduces joins through shared elements (that are
not set or recursive elements) - Shares some strengths of Shared
- Reduces joins for queries like get first and
last names of a book author - Requires more SQL sub-queries to retrieve all
authors with name Ullman. - Tradeoff between reducing number of queries and
reducing number of joins - Shared and Hybrid target query- and
join-reduction respectively
72More on Shared and Hybrid
- Shared and Hybrid have pros and cons
- In many cases, Shared and Hybrid are nearly
identical - Number of joins per SQL query path length
- Mainly due to large number of set nodes
- Problem as join processing is expensive!
73Regular Expressions
- Path expression queries can be represented by
regular expressions. - Considering path expressions in the following
from - r (r) (r) (r)? r1/r2 r1r2 r1//r2
name. - 0 or more occurrences
- 1 or more occurrences
- ? 0 or 1 occurrences
- r1/r2 form a path from r1 to r2 (child)
- r1//r2 form a path from r1 to r2 (descendant)
- disjunction.
74SPE to SQL
Find the name of the authors for all members
publications
- /member/publication/author/name
- select m2.name
- from member m1, publication, member m2
- where publication.perantid m1.ID
- and publication.author m2.ID
75RPE Expansion
List the title of publications for all projects
Substitute //
- project/member/(project.member)/publication
- project/(member.project)/publication
76RPE Expansion
List the title of publications for all projects
- project/member/(project/member)/publication/title
- project/(member/project)/publication/title
Expanding
select project.publication.title
union select project.member.publication.title
union select project.member.project.publication.ti
tle
77Recursive Path Expression Queries to SQL
- Some DBMS supports least-fixed point computation.
E.g., WITH statement in DB2
project/(member/project)/publication
WITH R(PARENTID, ID) AS ( select m.PARENTID,
p1.ID from member m, project p1
where m.IDp1.PARENTID UNION ALL
select R.PARENTID, p1.ID from R, member
m, project p1 where R.IDm.PARENTID and
m.IDp1.PARENTID) select p3. from project p2,
R, publicaton p3 where p2.IDR.PARENTID and
R.IDp3.PARENTID
78Expanding Recursive Path Expression Queries
- Expanding wild cards before sending to DBMS
- Transitive closure operation is not always
supported by RDBMS - Transitive closure with arbitrary nesting seems
not supported - Can handle nested recursive queries (though DB2
does not support it) - How many SQL statements are required?
- Executing SQL until empty result returned
- VXMLR approach keep statistics
- Zhou et. al. VLDB 2001
79Query Translation for Structural Mapping
- Translating XML-QL into SQL Shanmugasundaram, et
al, VLDB99 - Simple Path Expressions to SQL
- Simple Recursive Path Expressions to SQL
- Arbitrary Path Expressions to Simple Recursive
Path Expressions - Discussion based on Shared approach
80Queries with Simple Path Expressions
- WHERE ltbookgt
- ltbooktitlegt The Selfish Gene lt/booktitlegt
- ltauthorgt
- ltnamegt
- ltfirstnamegt f lt/firstnamegt
- ltlastnamegt l lt/lastnamegt
- lt/namegt
- lt/authorgt
- lt/bookgt IN CONFORMING TO pubs.dtd
- CONSTRUCT ltresultgt f l lt/resultgt
Select A.author.name.firstname,
A.author.name.lastname From author A, book
B Where B.bookID A.parentID AND A.parentCODE
0 AND B.book.booktitle The Selfish Gene
81Queries with Recursive Path Expressions
- WHERE lt.monographgt
- lteditor.(monograph.editor)gt
- ltnamegt n lt/namegt
- lt/gt
- lttitlegt Subclass Cirripedia lt/titlegt
- lt/gt IN CONFORMING TO pubs.dtd
- CONSTRUCT ltresultgt n lt/resultgt
With Q1 (monographID, name) AS (Select
X.monographID, X.editor.name From
monograph X Where X.title Subclass
Cirripedia UNION ALL Select Z.monographID,
Z.editor.name From Q1 Y, monograph Z
Where Y.monographID Z.parentID AND Z.parentCODE
0 ) Select A.name From Q1 A
82Queries with Arbitrary Path Expressions
WHERE lt(articlemonograph)..namegt n
lt/gt CONSTRUCT ltnamegt n lt/gt
- Split complex path expression to (possibly many)
simple recursive path expressions - Has effect of splitting a single XML-QL query to
(possibly many) SQL queries - Can handle nested recursive queries
83References (1)
- Aboulnaga, Alameldeen et al, VLDB01
- Ashraf Aboulnaga, Alaa R. Alameldeen, and
Jeffrey F. Naughton. Estimating the selectivity
of XML path expressions for Internet scale
applications. VLDB 2001. - Bohannon et al, ICDE 2002
- P. Bohannon, J. Freire, P. Roy, and J. Simeon.
From XML schema to relations A cost-based
approach to XML storage. In Proceedings of ICDE,
2002. - Boag, Chamberlin et al, W3C 2003
- Scott Boag, Don Chamberlin, Mary F.
Fernández, Daniela Florescu, Jonathan Robie,
Jérôme Siméon, XQuery 1.0 An XML Query Language,
http//www.w3.org/TR/xquery - Bruno et al, SIGMOD02
- N. Bruno, N. Koudas, D. Srivastava. Holistic
twig joins Optimal XML pattern matching. In
SIGMOD Int'l Conf. on Management of Data,
310-311, 2002. - Chen, Jagadish et al, ICDE01
- Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.
Muthukrishnan, R. T. Ng, and D. Srivastava.
Counting twig matches in a tree. In Proceedings
of the IEEE International Conference on Data
Engineering, pages 595-604, 2001. - Cohen, Kaplen et al, PODS02
- E. Cohen, H. Kaplan, T. Milo. Labeling dynamic
XML trees. In Symposium on Principles of Database
Systems (PODS), 271-281, 2002. - Cark, DeRose, W3C 1999
- James Clark, and Steven DeRose, XML Path
Language (XPath) Version 1.0 - http//www.w3.org/TR/xpath
84References (2)
- Ceri, Comai et al, WWW99
- S. Ceri, S. Comai, E. Damiani, P. Fraternali, S.
Paraboschi, and L. Tanca. XMLGL a graphical
language for querying and restructuring WWW data.
In International World Wide Web Conference (WWW),
Toronto, Canada, May 1999. - Chamberlin, Robie et al, 2000
- Don Chamberlin, Jonathan Robie, and Daniela
Florescu. Quilt An XML query language for
heterogeneous data source. In Proceedings of the
Third International Workshop on the Web and
Databases, May 2000. - Chamberlin, Draper et al, 2003
- Don Chamberlin, Denie Draper, Mary Fernandez,
Michael Kay, Jonathan Robie, Michael Rys, Jerome
Simeon, Jim Tivy, Philip Wadler. Editor Howard
Katz. XQuery from the Experts A Guide to the W3C
XML Query Language. Addison-Wesley Press, 2003 - Chaudhri, Rashid et al, 2003
- Akmal B. Chaudhri , Awais Rashid , Roberto
Zicari. XML Data Management Native XML and
XML-Enabled Database Systems. Addison-Wesley
Press, 2003 - Chen, Lim et al, SIGMOD03
- Qun Chen, Andrew Lim and Kian Win Ong.
D(k)-index An adaptive structural summary for
graph-structured data. In SIGMOD'03, 134-144. - Chien, Vagena and Zhang et al, VLDB02
- S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras,
and C. Zaniolo. Efficient structural joins on
indexed XML documents. In VLDB02, pages 263--274,
2002. - Chung, Min et al, SIGMOD02
- C-W Chung, J-K Min, K. Shim, APEX An Adaptive
Path Index for XML Data, In SIGMOD'02, 2002
85References (3)
- Deutsch, Fernandez et al, QL98
- A. Deutsch, M. Fernandez, D. Florescu,
A. Levy, and D. Suciu. XML-QL A query language
for XML. In M. Marchiori, editor. QL'98--The
Query Languages Workshop. W3C, Dec. 1998. - http//www.w3.org/TR/1998/NOTE-xml-ql-19980819/
- Deutsch, Fernandez, and Suciu, SIGMOD99
- A. Deutsch, M. Fernandez, and D. Suciu. Storing
Semistructured Data with STORED. In Proc. of the
ACM SIGMOD Conference on Management of Data, June
1999. - Dietz STOC 82
- Paul F. Dietz. Maintaining order in a linked
list. STOC 1982. - Grust SIGMOD02
- Torsten Grust. Accelerating XPath Location
Steps. In Proc. of the 21st ACM SIGMOD
Conference, pages 109--120, Madison, Wisconsin,
USA, June 2002. ACM Press. - Jiang, Lu, Wang and Ooi, ICDE03
- Haifeng Jiang, Hongjun Lu, Wei Wang, Beng Chin
Ooi, XR-Tree Indexing XML Data for Efficient
Structural Joins, The 19th International
Conference on Data Engineering (ICDE 2003), page
253-264, Bangalore, India, March 5-8, 2003. - Jiang, Wang, Lu and Yu, VLDB03
- Haifeng Jiang, Wei Wang, Hongjun Lu, Jeffrey Xu
Yu, Holistic Twig Joins on Indexed XML Documents,
The 29th International Conference on Very Large
Data Bases (VLDB 2003), pages 273-284, Berlin,
Germany, September 9-12, 2003.
86References (4)
- Kaushik, Shenoy, ICDE02
- Raghav Kaushik, Pradeep Shenoy, Philip Bohannon
and Ehud Gudes. Exploiting local similarity for
indexing paths in graph-structured data. In
ICDE02, 129-140. - Kha et al, ICDE01
- Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke
Uemura. An XML indexing structure with relative
region coordinate. ICDE 2001. - Krishnamurthy et al, 2003
- R. Krishnamurthy, R. Kaushik, J. Naughto XML,
XML-to-SQL Query Translation Literature The
State of the Art and Open Problems, Symposium
(XSym), Sep 2003. - Li and Moon, VLDB01
- Quanzhong Li and Bongki Moon. Indexing and
querying XML data for regular path expressions.
VLDB 2001. - Milo, Suciu, ICDT99
- Tova Milo and Dan Suciu. Index structures for
path expressions. In ICDT99. 277-295, January
1999. - Lee, Srivastava DASFAA04
- Dongwon Lee and Divesh Srivastava. Counting
relaxed twig matches in a tree. DASFAA 2004. - Lim, Wang et al, VLDB02
- Lipyeow Lim, Min Wang, Sriram Padmanabhan,
Jeffrey Scott Vitter, and Ronald Parr.
XPathLearner An on-line self-tuning Markov
histogram for XML path selectivity estimation.
VLDB 2002. - Lee, Yoo et al, 1996
- Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and
P. Bruce Berra. Index structures forstructured
documents. In Proceedings of the ACM Conference
on Digital Libraries, 1996.
87References (5)
- Manolescu, Florescu et al, 2001
- I. Manolescu, D. Florescu, and D. Kossmann.
Pushing XML queries inside relational databases.
Tech. Report no. 4112, INRIA, 2001 - Manolescu, Florescu et al, VLDB01
- I. Manolescu, D. Florescu, and D. Kossmann.
Answering xml queries over heterogeneous data
sources. In proceedings on the International
Conference on Very Large Data Bases (VLDB), Rome,
Italy, September 2001. - Meier, 2002
- Wolfgang Meier. eXist An open source native XML
database. In Web, Web-Services, and Database
Systems 2002, 2002. - McHugh, Widom, VLDB99
- Jason McHugh and Jennifer Widom. Query
optimization for XML. VLDB 1999. - Polyzotis, Garofalakis SIGMOD02
- Neoklis Polyzotis and Minos N. Garofalakis.
Statistical synopses for graph-structured XML
databases. SIGMOD 2002. - Polyzotis, Garofalakis VLDB02
- Neoklis Polyzotis and Minos N. Garofalakis.
Structure and value synopses for XML data graphs.
VLDB 2002. - Robie, Lapp, QL98
- J. Robie, J. Lapp, and D. Schach. XML query
language (XQL). In M. Marchiori, editor.
QL'98--The Query Languages Workshop. W3C, Dec.
1998. - http//www.w3.org/TandS/QL/QL98/pp/xql.html
88References (6)
- Schmidt et. al. WebDB00
- A. Schmidt, M. L. Kersten, M. Windhouwer, and F.
Waas. Efficient relational storage and retrieval
of XML documents. In WebDB (Informal
Proceedings), pages 47--52, 2000. - Shanmugasundaram, Tufte, He, Zhang, DeWitt, and
Naughton, VLDB99 - Jayavel Shanmugasundaram, Kristin Tufte, Chun
Zhang, Gang He, David J. DeWitt, and Jeffrey F.
Naughton. Relational databases for querying XML
documents Limitations and oppor- tunities. In
Proceedings of 25th International Conference on
Very Large Data Bases (VLDB'99), pages 79-90.
Morgan Kaufmann, 1999. - Shanmugasundaram et. al. VDLB 99
- Jayavel Shanmugasundaram, Kristin Tufte, Chun
Zhang, Gang He, David J. DeWitt, Jeffrey F.
Naughton Relational Databases for Querying XML
Documents Limitations and Opportunities. VLDB
1999 302-314 - Srivastava, Al-Khalifa et al, ICDE02
- D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N.
Koudas, J. M. Patel, and Y. Wu. Structural joins
A primitive for efficient XML query pattern
matching. In ICDE, pages 141-- 152, 2002. - Tatarinov, Viglas et al, SIGMOD02
- Igor Tatarinov, Stratis D. Viglas, Kevin Beyer,
Jayavel Shanmugasundaram, Eugene Shekita, and
Chun Zhang. Storing and querying ordered XML
using a relational database system. SIGMOD 2002. - Wang, Jiang et al, SIGMOD03
- Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Xu
Yu. Containment Join Size Estimation Models and
Methods. The 2003 ACM SIGMOD International
Conference on Management of Data(SIGMOD03), San
Diego, California, June 9 - June 12, 2003. Pages
145-156
89References (7)
- Wang, Jiang et al, ICDE03
- Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey
Xu Yu. PBiTree coding and efficient processing of
containment joins. ICDE 2003. - Wu et al., EDBT02
- Yuqing Wu, Jignesh Patel, H. V. Jagadish. Using
Histograms to Estimate Answer Size for XML
Queries. Information Systems 28 (1-2) 33-59
(2003) -- Special Issue Best Papers from EDBT
2002. - Wu et al, ICDE03
- Yuqing Wu, Jignesh Patel and H.V. Jagadish,
Structural Join Order Selection for XML Query
Optimization. ICDE 2003. - Yoshikawa, et. al. TOIT01
- Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki
Shimura, Shunsuke Uemura XRel a path-based
approach to storage and retrieval of XML
documents using relational databases. ACM Trans.
Internet Techn. 1(1) 110-141 (2001) - Zhou et. al. VLDB 2001
- Aoying Zhou, Hongjun Lu, Shihui Zheng, Yuqi
Liang, Long Zhang, Wenyun Ji, Zengping Tian
VXMLR A Visual XML-Relational Database System.
VLDB 2001 719-720 - Zhang, Naughton SIGMOD01
- Chun Zhang, Jeffrey F. Naughton, David J.
DeWitt, Qiong Luo, and Guy M. Lohman. On
supporting containment queries in relational
database management systems. SIGMOD 2001.