Title: XML, XML Schema, Xpath and Xquery
1XML, XML Schema, Xpath and Xquery
Slides collated from various sources, many from
Dan Suciu at Univ. of Washington
2XML
- W3C standard to complement HTML
- origins structured text SGML
- motivation
- HTML describes presentation
- XML describes content
-
- http//www.w3.org/TR/2000/REC-xml-20001006
(version 2, 10/2000)
3From HTML to XML
HTML describes the presentation
4HTML
- lth1gt Bibliography lt/h1gt
- ltpgt ltigt Foundations of Databases lt/igt
- Abiteboul, Hull, Vianu
- ltbrgt Addison Wesley, 1995
- ltpgt ltigt Data on the Web lt/igt
- Abiteboul, Buneman, Suciu
- ltbrgt Morgan Kaufmann, 1999
5XML
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
XML describes the content
6XML Terminology
- tags book, title, author,
- start tag ltbookgt, end tag lt/bookgt
- elements ltbookgtltbookgt,ltauthorgtlt/authorgt
- elements are nested
- empty element ltredgtlt/redgt abbrv. ltred/gt
- an XML document single root element
well formed XML document if it has matching tags
7More XML Attributes
- ltbook price 55 currency USDgt
- lttitlegt Foundations of Databases lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
-
- ltyeargt 1995 lt/yeargt
- lt/bookgt
attributes are alternative ways to represent data
8More XML Oids and References
- ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
- ltperson ido456gt ltnamegt Mary lt/namegt
- ltchildren
idrefo123 o555/gt - lt/persongt
- ltperson ido123 mothero456gtltnamegtJohnlt/namegt
- lt/persongt
oids and references in XML are just syntax
9More XML CDATA Section
- Syntax lt!CDATA .....any text here...gt
- Example
- ltexamplegt lt!CDATA some text here lt/notAtaggt
ltgtgt - lt/examplegt
10XML Namespaces
- http//www.w3.org/TR/REC-xml-names (1/99)
- name prefixlocalpart
ltbook xmlnsisbnwww.isbn-org.org/defgt
lttitlegt lt/titlegt ltnumbergt 15 lt/numbergt
ltisbnnumbergt . lt/isbnnumbergt lt/bookgt
11XML Namespaces
- syntactic ltnumbergt , ltisbnnumbergt
- semantic provide URL for schema
lttag xmlnsmystyle http//gt
ltmystyletitlegt
lt/mystyletitlegt ltmystylenumbergt
lt/taggt
12XML Data Model
- Several competing models
- Document Object Model (DOM)
- http//www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-2001
0209/ (2/2001) - class hierarchy (node, element, attribute,)
- objects have behavior
- defines API to inspect/modify the document
- Infoset - PSV (post schema validation)
- XML Query data model
13XML Schemas
- http//www.w3.org/TR/xmlschema-1/10/2000
- generalizes DTDs
- uses XML syntax
- two documents structure and datatypes
- http//www.w3.org/TR/xmlschema-1
- http//www.w3.org/TR/xmlschema-2
- XML-Schema is complex
14XML Schemas
- ltxsdelement namepaper typepapertype/gt
- ltxsdcomplexType namepapertypegt
- ltxsdsequencegt
- ltxsdelement nametitle
typexsdstring/gt - ltxsdelement nameauthor
minOccurs0/gt - ltxsdelement nameyear/gt
- ltxsd choicegt lt xsdelement
namejournal/gt - ltxsdelement
nameconference/gt - lt/xsdchoicegt
- lt/xsdsequencegt
- lt/xsdelementgt
DTD lt!ELEMENT paper (title,author,year,
(journalconference))gt
15Elements v.s. Types in XML Schema
ltxsdelement namepersongt ltxsdcomplexTypegt
ltxsdsequencegt ltxsdelement namename
typexsdstring/gt
ltxsdelement nameaddress
typexsdstring/gt lt/xsdsequencegt
lt/xsdcomplexTypegtlt/xsdelementgt
ltxsdelement nameperson
typetttgtltxsdcomplexType nametttgt
ltxsdsequencegt ltxsdelement namename
typexsdstring/gt
ltxsdelement nameaddress
typexsdstring/gt lt/xsdsequencegtlt/xsdco
mplexTypegt
DTD lt!ELEMENT person (name,address)gt
16Elements v.s. Types in XML Schema
- Types
- Simple types (integers, strings, ...)
- Complex types (regular expressions, like in DTDs)
- Element-type-element alternation
- Root element has a complex type
- That type is a regular expression of elements
- Those elements have their complex types...
- ...
- On the leaves we have simple types
17Local and Global Types in XML Schema
- Local type
- ltxsdelement namepersongt
define locally the persons type
lt/xsdelementgt - Global type ltxsdelement nameperson
typettt/gt ltxsdcomplexType nametttgt
define here the type ttt
lt/xsdcomplexTypegt
Global types can be reused in other elements
18Local v.s. Global Elements inXML Schema
- Local element
- ltxsdcomplexType nametttgt
ltxsdsequencegt ltxsdelement
nameaddress type.../gt...
lt/xsdsequencegt lt/xsdcomplexTypegt - Global element ltxsdelement nameaddress
type.../gt ltxsdcomplexType nametttgt
ltxsdsequencegt ltxsdelement
refaddress/gt ... lt/xsdsequencegt
lt/xsdcomplexTypegt
Global elements like in DTDs
19Regular Expressions in XML Schema
- Recall the element-type-element alternation
- ltxsdcomplexType name....gt
regular expression on
elements lt/xsdcomplexTypegt - Regular expressions
- ltxsdsequencegt A B C lt/...gt
A B C - ltxsdchoicegt A B C lt/...gt
A B C - ltxsdgroupgt A B C lt/...gt
(A B C) - ltxsd... minOccurs0 maxOccursunboundedgt
..lt/...gt (...) - ltxsd... minOccurs0 maxOccurs1gt ..lt/...gt
(...)?
20Attributes in XML Schema
ltxsdelement namepaper typepapertype/gt ltxsd
complexType namepapertypegt
ltxsdsequencegt ltxsdelement
nametitle typexsdstring/gt . .
. . . . lt/xsdsequencegt ltxsdattribute
namelanguage" type"xsdNMTOKEN"
fixedEnglish"/gt lt/xsdcomplexTypegt
Attributes are associated to the type, not to the
element Only to complex types more trouble if we
want to add attributes to simple types.
21Mixed Content, Any Type
ltxsdcomplexType mixed"true"gt . . . .
- Better than in DTDs can still enforce the type,
but now may have text between any elements - Means anything is permitted there
ltxsdelement name"anything" type"xsdanyType"/gt
. . . .
22Derived Types by Extensions
ltcomplexType name"Address"gt ltsequencegt
ltelement name"street" type"string"/gt
ltelement name"city"
type"string"/gt lt/sequencegt lt/complexTypegt
ltcomplexType name"USAddress"gt
ltcomplexContentgt ltextension
base"ipoAddress"gt ltsequencegt ltelement
name"state" type"ipoUSState"/gt
ltelement name"zip"
type"positiveInteger"/gt lt/sequencegt
lt/extensiongt lt/complexContentgt lt/complexTypegt
Corresponds to inheritance
23Derived Types by Restrictions
- () may restrict cardinalities, e.g. (0,infty)
to (1,1) may restrict choices other
restrictions
ltcomplexContentgt ltrestriction
base"ipoItemsgt rewrite the entire
content, with restrictions...
lt/restrictiongt lt/complexContentgt
Corresponds to set inclusion
24Keys in XML Schema
XML
ltpurchaseReportgt ltregionsgt ltzip code"95819"gt
ltpart number"872-AA" quantity"1"/gt ltpart
number"926-AA" quantity"1"/gt ltpart
number"833-AA" quantity"1"/gt ltpart
number"455-BX" quantity"1"/gt lt/zipgt ltzip
code"63143"gt ltpart number"455-BX"
quantity"4"/gt lt/zipgt lt/regionsgt ltpartsgt
ltpart number"872-AA"gtLawnmowerlt/partgt ltpart
number"926-AA"gtBaby Monitorlt/partgt ltpart
number"833-AA"gtLapis Necklacelt/partgt ltpart
number"455-BX"gtSturdy Shelveslt/partgt
lt/partsgt lt/purchaseReportgt
XML Schema
ltkey name"NumKey"gt ltselector
xpath"parts/part"/gt ltfield xpath"_at_number"/gt lt
/keygt
25Keys in XML Schema
ltkey namesomeDummyNameHere"gt ltselector
xpathp"/gt ltfield xpathp1"/gt ltfield
xpathp2"/gt . . . ltfield
xpathpk"/gt lt/keygt
ltunique namesomeDummyNameHere"gt ltselector
xpathp"/gt ltfield xpathp1"/gt ltfield
xpathp2"/gt . . . ltfield
xpathpk"/gt lt/keygt
Note all Xpath expressions start at the
element currently being defined The fields must
identify a single node
26Keys in XML Schema
- Unique guarantees uniqueness
- Key guarantees uniqueness and existence
- All Xpath expressions are restricted
- /a/b /a/c OK for selector
- //a/b//c OK for field
- Note better than DTDs ID mechanism
27Keys in XML Schema
ltkey name"fullName"gt ltselector
xpath".//person"/gt ltfield xpath"forename"/gt
ltfield xpath"surname"/gt lt/keygt ltunique
name"nearlyID"gt ltselector xpath".//"/gt
ltfield xpath"_at_id"/gt lt/uniquegt
Recall must have A single forename, Single
surname
28Foreign Keys in XML Schema
ltkeyref name"personRef" refer"fullName"gt
ltselector xpath".//personPointer"/gt ltfield
xpath"_at_first"/gt ltfield xpath"_at_last"/gt lt/keyrefgt
29XPATH
30XPath
- Goal permit to access some nodes from document
- XPath main construct axis navigation
- XPath path consists of one or more navigation
steps, separated by / - Navigation step axis node-test predicates
- Examples
- /descendantnode()/childauthor
- /descendantnode()/childauthorparent/attribute
booktitle XML2 - XPath also offers shortcuts
- no axis means child
- // º /descendant-or-selfnode()/
31XPath- Child axis navigation
- author is shorthand for childauthor. Examples
- aaa -- all the child nodes labeled aaa (1,3)
- aaa/bbb -- all the bbb grandchildren of aaa
children (4) - /bbb all the bbb grandchildren of any child
(4,6) - . -- the context node
- / -- the root node
32XPath- child axis navigation
- /doc -- all the doc children of the root
- ./aaa -- all the aaa children of the context node
(equivalent to aaa) - text() -- all the text children of the context
node - node() -- all the children of the context node
(includes text and attribute nodes) - .. -- parent of the context node
- .// -- the context node and all its descendants
- // -- the root node and all its descendants
- //text() -- all the text nodes in the document
33Predicates
- 2 -- the second child node of the context node
- chapter5 -- the fifth chapter child of the
context node - last() -- the last child node of the context
node - chaptertitleintroduction -- the chapter
children of the context node that have one or
more title children whose string-value is
introduction (the string-value is the
concatenation of all the text on descendant text
nodes) - person.//firstname joe -- the person
children of the context node that have in their
descendants a firstname element with string-value
Joe
34Axis navigation
- So far, nearly all our expressions have moved us
down by moving to child nodes. Exceptions were - . -- stay where you are
- / go to the root
- // all descendants of the root
- .// all descendants of the context node
- XPath has several axes ancestor,
ancestor-or-self, attribute, child, descendant,
descendant-or-self, following, following-sibling,
namespace, parent, preceding, preceding-sibling,
self - Some of these (self, parent) describe single
nodes, others describe sequences of nodes.
35XPath Navigation Axes
ancestor
following-sibling
preceding-sibling
self
child
attribute
following
preceding
namespace
descendant
36XPath abbreviated syntax
(nothing) child _at_ attribute // /descendan
t-or-selfnode() . selfnode() .// descendan
t-or-selfnode .. parentnode() / (document
root)
37XPath
- Reasonably widely adopted -- in XML-Schema and
query languages. - Neither more expressive nor less expressive than
regular path expressions
38Quilt
- proposed by Chamberlin, Robbie and Florescu
- (from the authors slides)
- Leverage the most effective features of several
existing and proposed query languages - Design a small, clean, implementable language
- Cover the functionality required by all the XML
Query use cases in a single language - Write queries that fit on a slide
- Design a quilt, not a camel
39Quilt/Kweelt URLs
Quilt (the language) http//www.almaden.ibm.com/
cs/people/chamberlin/quilt_lncs.pdf Kweelt (the
implementation) http//db.cis.upenn.edu/Kwee
lt/ http//db.cis.upenn.edu/Kweelt/useCases
(examples in these notes)
40Quilt XPathcomprehension
41Examples of Quilt(http//db.cis.upenn.edu/Kweelt/
useCases/R/Q1.qlt )
lt?xml version"1.0" ?gt lt!DOCTYPE items
lt!ELEMENT items (item_tuple)gt lt!ELEMENT
item_tuple (itemno, description, offered_by,
start_date?,
end_date?, reserve_price? )gt lt!ELEMENT
itemno (PCDATA)gt lt!ELEMENT description
(PCDATA)gt lt!ELEMENT offered_by (PCDATA)gt
lt!ELEMENT start_date (PCDATA)gt lt!ELEMENT
end_date (PCDATA)gt lt!ELEMENT reserve_price
(PCDATA)gt gt
lt?xml version"1.0" ?gt lt!DOCTYPE bids
lt!ELEMENT bids (bid_tuple)gt lt!ELEMENT
bid_tuple (userid, itemno, bid, bid_date)gt
lt!ELEMENT userid (PCDATA)gt lt!ELEMENT itemno
(PCDATA)gt lt!ELEMENT bid (PCDATA)gt
lt!ELEMENT bid_date (PCDATA)gt gt
42The data
ltitemsgt ltitem_tuplegt ltitemnogt1001lt/itemnogt ltdescr
iptiongtRed Bicyclelt/descriptiongt ltoffered_bygtU01lt/
offered_bygt ltstart_dategt1999-01-05lt/start_dategt lte
nd_dategt1999-01-20lt/end_dategt ltreserve_pricegt40lt/r
eserve_pricegt lt/item_tuplegt ltitem_tuplegt ltitemnogt
1002lt/itemnogt ltdescriptiongtMotorcyclelt/description
gt ltoffered_bygtU02lt/offered_bygt ltstart_dategt1999-02
-11lt/start_dategt ltend_dategt1999-03-15lt/end_dategt lt
reserve_pricegt500lt/reserve_pricegt lt/item_tuplegt
lt/itemsgt
ltbidsgt ltbid_tuplegt ltuseridgtU02lt/useridgt ltitemnogt1
001lt/itemnogt ltbidgt35lt/bidgt ltbid_dategt99-01-07lt/bid
_dategt lt/bid_tuplegt ltbid_tuplegt ltuseridgtU04lt/user
idgt ltitemnogt1001lt/itemnogt ltbidgt40lt/bidgt ltbid_dategt
99-01-08lt/bid_dategt lt/bid_tuplegt lt/bidsgt
43Query 1
simple function definitions
FUNCTION date() "1999-02-01" ltresultgt
( FOR i IN document("items.xml")//item_tuple
WHERE i/start_date LEQ date() AND
i/end_date GEQ date() AND
contains(i/description, "Bicycle") RETURN
ltitem_tuplegt i/itemno ,
i/description lt/item_tuplegt SORTBY
(itemno) ) lt/resultgt
XPath expressions in orange
dates are formatted so that lexicographic
ordering gives the right result
44Output from Q1
lt?xml version"1.0" ?gt ltresultgt ltitem_tuplegt
ltitemnogt 1003 lt/itemnogt ltdescriptiongt Old
Bicycle lt/descriptiongt lt/item_tuplegt
ltitem_tuplegt ltitemnogt 1007 lt/itemnogt
ltdescriptiongt Racing Bicycle lt/descriptiongt
lt/item_tuplegt lt/resultgt
45Query Q2
For all bicycles, list the item number,
description, and highest bid (if any), ordered by
item number. ltresultgt ( FOR i IN
document("items.xml")//item_tuple LET b
document("bids.xml")//bid_tupleitemno
i/itemno WHERE contains(i/description,
"Bicycle") RETURN ltitem_tuplegt
i/itemno , i/description , IF
(b) THEN lthigh_bidgt
NumFormat(".", max(-1, b/bid))
lt/high_bidgt ELSE "" lt/item_tuplegt
SORTBY (itemno) ) lt/resultgt
use of variable in Xpath
lots of coercion
46Output from Q2
ltresultgt ltitem_tuplegt
ltitemnogt 1001 lt/itemnogt
ltdescriptiongt Red Bicycle lt/descriptiongt
lthigh_bidgt 55 lt/high_bidgt
lt/item_tuplegt ltitem_tuplegt
ltitemnogt 1003 lt/itemnogt
ltdescriptiongt Old Bicycle lt/descriptiongt
lthigh_bidgt 20 lt/high_bidgt
lt/item_tuplegt ltitem_tuplegt
ltitemnogt 1007 lt/itemnogt
ltdescriptiongt Racing Bicycle lt/descriptiongt
lthigh_bidgt 225 lt/high_bidgt
lt/item_tuplegt ltitem_tuplegt
ltitemnogt 1008 lt/itemnogt
ltdescriptiongt Broken Bicycle lt/descriptiongt
lt/item_tuplegt lt/resultgt
47Query Q3
Find cases where a user with a rating worse
(alphabetically greater than "C" ) offers an item
with a reserve price of more than
1000. ltresultgt ( FOR u IN
document("users.xml")//user_tuple, i IN
document("items.xml")//item_tuple WHERE
u/rating GT 'C' AND i/reserve_price GT
1000 AND i/offered_by u/userid
RETURN ltwarninggt
ltuser_namegtu/name/text()lt/user_namegt,
ltuser_ratinggtu/rating/text()lt/user_ratinggt,
ltitem_descriptiongti/description/text()lt/item
_descriptiongt, i/reserve_price
lt/warninggt ) lt/resultgt
Comparing sets with singletons Same rules as in
XPath? In this case the DTD gives uniqueness
48Quilt -- Attributes and IDs
ltcensusgt ltperson name "Bill" job
"Teacher"gt ltperson name "Joe" job
"Painter" spouse "Martha"gt ltperson name
"Sam" job "Nurse"gt ltperson name
"Fred" job "Senator" spouse "Jane"gt
lt/persongt lt/persongt ltperson name
"Karen" job "Doctor" spouse "Steve"gt
lt/persongt lt/persongt ltperson name "Mary"
job "Pilot"gt ltperson name "Susan" job
"Pilot" spouse "Dave"gt lt/persongt
lt/persongt lt/persongt ltperson name "Frank"
job "Writer"gt ltperson name "Martha" job
"Programmer" spouse "Joe"gt ltperson name
"Dave" job "Athlete" spouse "Susan"gt
lt/persongt lt/persongt ...
lt/persongt lt/censusgt
lt?xml version"1.0" ?gt lt!DOCTYPE census
lt!ELEMENT census (person)gt lt!ELEMENT person
(person)gt lt!ATTLIST person name ID
REQUIRED spouse IDREF IMPLIED
job CDATA IMPLIED gt gt
49Query Q1
Find Martha's spouse FOR m IN
document("census.xml")//person_at_name"Martha" RET
URN shallow(m/_at_spouse-gtperson_at_name)
A hack. Kweelt does not read the DTD
The shallow function strips an element of its
subelements.
Dereferencing
50Query Q6
Find Bill's grandchildren. ltresultgt (
FOR b IN document("census.xml")//person_at_name
"Bill" , c IN b/person
b/_at_spouse-gtperson_at_name/person , g IN
c/person c/_at_spouse-gtperson_at_name/person
RETURN shallow(g) ) lt/resultgt
51Query Languages - XQuery
52Summary of XQuery
- FLWR expressions
- FOR and LET expressions
- Collections and sorting
- Resources
- XQuery A Query Language for XML Chamberlin,
Florescu, et al. - W3C recommendation www.w3.org/TR/xquery/
53XQuery
- Based on Quilt (which is based on XML-QL)
- http//www.w3.org/TR/xquery/2/2001
- XML Query data model (ordered)
54FLWR (Flower) Expressions
- FOR ... LET... FOR... LET...
- WHERE...
- RETURN...
55XQuery
- Find all book titles published after 1995
FOR x IN document("bib.xml")/bib/book WHERE
x/year gt 1995 RETURN x/title
Result lttitlegt abc lt/titlegt lttitlegt def
lt/titlegt lttitlegt ghi lt/titlegt
56XQuery
- For each author of a book by Morgan Kaufmann,
list all books she published
FOR a IN distinct(document("bib.xml")
/bib/bookpublisherMorgan
Kaufmann/author) RETURN ltresultgt
a, FOR t IN
/bib/bookauthora/title
RETURN t lt/resultgt
distinct a function that eliminates duplicates
57XQuery
- Result
- ltresultgt
- ltauthorgtJoneslt/authorgt
- lttitlegt abc lt/titlegt
- lttitlegt def lt/titlegt
- lt/resultgt
- ltresultgt
- ltauthorgt Smith lt/authorgt
- lttitlegt ghi lt/titlegt
- lt/resultgt
58XQuery
- FOR x in expr -- binds x to each element in
the list expr - LET x expr -- binds x to the entire list
expr - Useful for common subexpressions and for
aggregations
59XQuery
ltbig_publishersgt FOR p IN
distinct(document("bib.xml")//publisher)
LET b document("bib.xml")/bookpublisher
p WHERE count(b) gt 100 RETURN
p lt/big_publishersgt
count a (aggregate) function that returns the
number of elms
60XQuery
- Find books whose price is larger than average
LET aavg(document("bib.xml")/bib/book/_at_price) FO
R b in document("bib.xml")/bib/book WHERE
b/_at_price gt a RETURN b
61XQuery
- Summary
- FOR-LET-WHERE-RETURN FLWR
FOR/LET Clauses
List of tuples
WHERE Clause
List of tuples
RETURN Clause
Instance of Xquery data model
62FOR v.s. LET
- FOR
- Binds node variables ? iteration
- LET
- Binds collection variables ? one value
63FOR v.s. LET
Returns ltresultgt ltbookgt...lt/bookgtlt/resultgt
ltresultgt ltbookgt...lt/bookgtlt/resultgt ltresultgt
ltbookgt...lt/bookgtlt/resultgt ...
FOR x IN document("bib.xml")/bib/book RETURN
ltresultgt x lt/resultgt
LET x document("bib.xml")/bib/book RETURN
ltresultgt x lt/resultgt
Returns ltresultgt ltbookgt...lt/bookgt
ltbookgt...lt/bookgt
ltbookgt...lt/bookgt ... lt/resultgt
64Collections in XQuery
- Ordered and unordered collections
- /bib/book/author an ordered collection
- Distinct(/bib/book/author) an unordered
collection - LET a /bib/book ? a is a collection
- b/author ? a collection (several authors...)
Returns ltresultgt ltauthorgt...lt/authorgt
ltauthorgt...lt/authorgt
ltauthorgt...lt/authorgt
... lt/resultgt
RETURN ltresultgt b/author lt/resultgt
65Sorting in XQuery
ltpublisher_listgt FOR p IN distinct(document("
bib.xml")//publisher) RETURN ltpublishergt
ltnamegt p/text() lt/namegt ,
FOR b IN document("bib.xml")//bookpublisher
p RETURN ltbookgt
b/title ,
b/_at_price
lt/bookgt SORTBY(price DESCENDING)
lt/publishergt SORTBY(name)
lt/publisher_listgt
66Sorting in XQuery
- Sorting arguments refer to name space of RETURN
clause, not FOR clause - To sort on an element you dont want to display,
first return it, then remove it with an
additional query.
67If-Then-Else
FOR h IN //holding RETURN ltholdinggt
h/title, IF
h/_at_type "Journal"
THEN h/editor ELSE
h/author lt/holdinggt SORTBY
(title)
68Existential Quantifiers
FOR b IN //book WHERE SOME p IN b//para
SATISFIES contains(p, "sailing") AND
contains(p, "windsurfing") RETURN b/title
69Universal Quantifiers
FOR b IN //book WHERE EVERY p IN b//para
SATISFIES contains(p, "sailing") RETURN
b/title