Title: XML,%20Schemas,%20and%20XPath
 1XML, Schemas, and XPath
- Zachary G. Ives 
- University of Pennsylvania 
- CIS 550  Database  Information Systems 
- October 14, 2004 
Some slide content courtesy of Susan Davidson  
Raghu Ramakrishnan 
 2Announcements
- Homework 3 due today 
- Homework 4 handed out 
- Midterm Thursday 10/28
3Why Were Interested in XML
- Can get data from all sorts of sources 
- Allows us to touch data we dont own! 
- This was actually a huge change in the DB 
 community
- Used for sharing data 
- Interesting relationships with DB techniques 
- Useful to do relational-style operations 
- Leverages ideas from object-oriented, 
 semistructured data
- Blends schema and data into one format 
- Unlike relational model, where we need schema 
 first
-  But too little schema can be a drawback, too!
4Basic XML Anatomy
Processing Instr.
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt 
- ltdblpgt 
-  ltmastersthesis mdate"2002-01-03" 
 key"ms/Brown92"gt
-   ltauthorgtKurt P. Brownlt/authorgt 
-   lttitlegtPRPL A Database Workload 
 Specification Languagelt/titlegt
-   ltyeargt1992lt/yeargt 
-   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt 
-   lt/mastersthesisgt 
-  ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
 018"gt
-   lteditorgtPaul R. McJoneslt/editorgt 
-   lttitlegtThe 1995 SQL Reunionlt/titlegt 
-   ltjournalgtDigital System Research Center 
 Reportlt/journalgt
-   ltvolumegtSRC1997-018lt/volumegt 
-   ltyeargt1997lt/yeargt 
-   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt 
-   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
 n_95/lt/eegt
-   lt/articlegt
Open-tag
Element
Attribute
Close-tag 
 5Well-Formed XML
- A legal XML document  fully parsable by an XML 
 parser
- All open-tags have matching close-tags (unlike so 
 many HTML documents!), or a special
- lttag/gt shortcut for empty tags (equivalent to 
 lttaggtlt/taggt
- Attributes (which are unordered, in contrast to 
 elements) only appear once in an element
- Theres a single root element 
- XML is case-sensitive
6XML as a Data Model
- XML information set includes 7 types of nodes 
- Document (root) 
- Element 
- Attribute 
- Processing instruction 
- Text (content) 
- Namespace 
- Comment 
- XML data model includes this, plus typing info, 
 plus order info and a few other things
7XML Data Model Visualized(and simplified!)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R. 
Kurt P.
SRC
http//www. 
 8What Does XML Do?
- Serves as a document format (super-HTML) 
- Allows custom tags (e.g., used by MS Word, 
 openoffice)
- Supplement it with stylesheets (XSL) to define 
 formatting
- Data exchange format (must agree on terminology) 
- Marshalling and unmarshalling data in SOAP and 
 Web Services
9XML as a Super-HTML(MS Word)
- lth1 class"Section1"gtlta name"_top /gtCIS 550 
 Database and Information Systemslt/h1gt
-  lth2 class"Section1"gtFall 2004lt/h2gt 
-  ltp class"MsoNormal"gt 
-  ltplacegt311 Townelt/placegt,  Tuesday/Thursday 
-  lttime Hour"13" Minute"30"gt130PM  
 300PMlt/timegt
- lt/pgt 
-  
10XML Easily Encodes Relations
Student-course-grade
sid serno exp-grade
1 570103 B
23 550103 A
- ltstudent-course-gradegt 
-  lttuplegtltsidgt1lt/sidgtltsernogt570103lt/sernogtltexp-grad
 egtBlt/exp-gradegtlt/tuplegt
-  lttuplegtltsidgt23lt/sidgtltsernogt550103lt/sernogtltexp-gra
 degtAlt/exp-gradegtlt/tuplegt
- lt/student-course-gradegt
11But XML is More FlexibleNon-First-Normal-Form 
(NF2)
- ltparentsgt 
-  ltparent nameJean gt 
-  ltsongtJohnlt/songt 
-  ltdaughtergtJoanlt/daughtergt 
-  ltdaughtergtJilllt/daughtergt 
-  lt/parentgt 
-  ltparent nameFenggt 
-  ltdaughtergtFelicitylt/daughtergt 
-  lt/parentgt 
Coincides with semi-structured data, invented 
by DB people at Penn and Stanford 
 12XML and Code
- Web Services (.NET, recent Java web service 
 toolkits) are using XML to pass parameters and
 make function calls
- Why? 
- Easy to be forwards-compatible 
- Easy to read over and validate (?) 
- Generally firewall-compatible 
- Drawbacks? XML is a verbose and inefficient 
 encoding!
- XML is used to represent 
- SOAP the envelope that data is marshalled 
 into
- XML Schema gives some typing info about 
 structures being passed
- WSDL the IDL (interface def language) 
- UDDI provides an interface for querying about 
 web services
13Integrating XML What If We Have Multiple 
Sources with the Same Tags?
- Namespaces allow us to specify a context for 
 different tags
- Two parts 
- Binding of namespace to URI 
- Qualified names 
- ltroot xmlnshttp//www.first.com/aspace 
 xmlnsothernsgt
-  lttag xmlnsmynshttp//www.fictitious.com/mypath
 gt
-  ltthistaggtis in the default namespace 
 (aspace)lt/thistaggt
-  ltmynsthistaggtis in mynslt/mynsthistaggtltotherns
 thistaggtis a different tag in othernslt/othernsthi
 staggt
-  lt/taggt 
- lt/rootgt
14XML Isnt Enough on Its Own
- Its too unconstrained for many cases! 
- How will we know when were getting garbage? 
- How will we query? 
- How will we understand what we got? 
- We also need 
- Some idea of the structure 
- Our focus next 
- Presentation, in some cases  XSL(T) 
- Well talk about this soon 
- Some way of interpreting the tags? 
- Well talk about this later in the semester
15Structural ConstraintsDocument Type Definitions 
(DTDs)
- The DTD is an EBNF grammar defining XML structure 
- XML document specifies an associated DTD, plus 
 the root element
- DTD specifies children of the root (and so on) 
- DTD defines special significance for attributes 
- IDs  special attributes that are analogous to 
 keys for elements
- IDREFs  references to IDs 
- IDREFS  a nasty hack that represents a list of 
 IDREFs
16An Example DTD
- Example DTD 
- lt!ELEMENT dblp((mastersthesis  article))gt 
- lt!ELEMENT mastersthesis(author,title,year,school,c
 ommitteemember)gt
- lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
 y ID REQUIRED
-  advisor CDATA IMPLIEDgt 
- lt!ELEMENT author(PCDATA)gt 
-   
- Example use of DTD in XML file 
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt 
- lt!DOCTYPE dblp SYSTEM my.dtd"gt 
- ltdblpgt
17Representing Graphs and Links in XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt 
- lt!DOCTYPE graph SYSTEM special.dtd"gt 
- ltgraphgt 
-  ltauthor idauthor1gt 
-  ltnamegtJohn Smithlt/namegt 
-  lt/authorgt 
-  ltarticlegt 
-  ltauthor refauthor1 /gt lttitlegtPaper1lt/titlegt 
-  lt/articlegt 
-  ltarticlegt 
-  ltauthor refauthor1 /gt lttitlegtPaper2lt/titlegt 
-  lt/articlegt 
18Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
author1
author1 
 19Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith 
 20DTDs Arent Expressive Enough
- DTDs capture grammatical structure, but have some 
 drawbacks
- Not themselves in XML  inconvenient to build 
 tools for them
- Dont capture database datatypes domains 
- IDs arent a good implementation of keys 
- Why not? 
- No way of defining OO-like inheritance
21XML Schema
- Aims to address the shortcomings of DTDs 
- XML syntax 
- Can define keys using XPaths 
- Type subclassing thats more complex than in a 
 programming language
- Programming languages dont consider order of 
 member variables!
- Subclassing by extension and by restriction 
-  And, of course, domains and built-in datatypes
22Basics of XML Schema
- Need to use the XML Schema namespace (generally 
 named xsd)
- simpleTypes are a way of restricting domains on 
 scalars
- Can define a simpleType based on integer, with 
 values within a particular range
- complexTypes are a way of defining 
 element/attribute structures
- Basically equivalent to !ELEMENT, but more 
 powerful
- Specify sequence, choice between child elements 
- Specify minOccurs and maxOccurs (default 1) 
- Must associate an element/attribute with a 
 simpleType, or an element with a complexType
23Simple Schema Example
- ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
 chema"gt
- ltxsdelement namemastersthesis" 
 typeThesisType"/gt
- ltxsdcomplexType nameThesisType"gt 
- ltxsdattribute namemdate" type"xsddate"/gt 
- ltxsdattribute namekey" type"xsdstring"/gt 
- ltxsdattribute nameadvisor" type"xsdstring"/gt 
- ltxsdsequencegt 
- ltxsdelement nameauthor" typexsdstring"/gt 
- ltxsdelement nametitle" typexsdstring"/gt 
- ltxsdelement nameyear" typexsdinteger"/gt 
- ltxsdelement nameschool" typexsdstring/gt 
- ltxsdelement namecommitteemember" 
 typeCommitteeType minOccurs0"/gt
- lt/xsdsequencegt 
- lt/xsdcomplexTypegt 
- lt/xsdschemagt
24Designing an XML Schema/DTD
- Not as formalized as relational data design 
- We can still use ER diagrams to break into 
 entity, relationship sets
- ER diagrams have extensions for aggregation  
 treating smaller diagrams as entities  and for
 composite attributes
- Note that often we already have our data in 
 relations and need to design the XML schema to
 export them!
- Generally orient the XML tree around the 
 central objects
- Big decision element vs. attribute 
- Element if it has its own properties, or if you 
 might have more than one of them
- Attribute if it is a single property  or perhaps 
 not!
25XML as a Data Model
- XML is a non-first-normal-form (NF2) 
 representation
- Can represent documents, data 
- Standard data exchange format 
- Several competing schema formats  esp., DTD and 
 XML Schema  provide typing information
- Next basics of querying XML
26Querying XML
- How do you query a directed graph? a tree? 
- The standard approach used by many XML, 
 semistructured-data, and object query languages
- Define some sort of a template describing 
 traversals from the root of the directed graph
- In XML, the basis of this template is called an 
 XPath
27XPaths
- In its simplest form, an XPath is like a path in 
 a file system
- /mypath/subpath//morepath 
- The XPath returns a node set representing the XML 
 nodes (and their subtrees) at the end of the path
- XPaths can have node tests at the end, returning 
 only particular node types, e.g., text(),
 processing-instruction(), comment(), element(),
 attribute()
- XPath is fundamentally an ordered language it 
 can query in order-aware fashion, and it returns
 nodes in order
28Sample XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt 
- ltdblpgt 
-  ltmastersthesis mdate"2002-01-03" 
 key"ms/Brown92"gt
-   ltauthorgtKurt P. Brownlt/authorgt 
-   lttitlegtPRPL A Database Workload 
 Specification Languagelt/titlegt
-   ltyeargt1992lt/yeargt 
-   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt 
-   lt/mastersthesisgt 
-  ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
 018"gt
-   lteditorgtPaul R. McJoneslt/editorgt 
-   lttitlegtThe 1995 SQL Reunionlt/titlegt 
-   ltjournalgtDigital System Research Center 
 Reportlt/journalgt
-   ltvolumegtSRC1997-018lt/volumegt 
-   ltyeargt1997lt/yeargt 
-   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt 
-   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
 n_95/lt/eegt
-   lt/articlegt
29XML Data Model Visualized
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R. 
Kurt P.
SRC
http//www. 
 30Some Example XPath Queries
- /dblp/mastersthesis/title 
- /dblp//editor 
- //title 
- //title/text()
31Context Nodes and Relative Paths
- XPath has a notion of a context node its 
 analogous to a current directory
- . represents this context node 
- .. represents the parent node 
- We can express relative paths 
- subpath/sub-subpath/../.. gets us back to the 
 context node
- By default, the document root is the context node
32Predicates  Selection Operations
- A predicate allows us to filter the node set 
 based on selection-like conditions over
 sub-XPaths
-  /dblp/articletitle  Paper1 
-  which is equivalent to 
-  /dblp/article./title/text()  Paper1
33Axes More Complex Traversals
- Thus far, weve seen XPath expressions that go 
 down the tree (and up one step)
- But we might want to go up, left, right, etc. 
- These are expressed with so-called axes 
- selfpath-step 
- childpath-step parentpath-step 
- descendantpath-step ancestorpath-step 
- descendant-or-selfpath-step ancestor-or-selfpa
 th-step
- preceding-siblingpath-step following-siblingpa
 th-step
- precedingpath-step followingpath-step 
- The previous XPaths we saw were in abbreviated 
 form
34Querying Order
- We saw in the previous slide that we could query 
 for preceding or following siblings or nodes
- We can also query a node for its position 
 according to some index
- fnfirst() , fnlast() return index of 0th  
 last element matching the last step
- fnposition() gives the relative count of the 
 current node
- childarticlefnposition()  fnlast()
35Users of XPath
- XML Schema uses simple XPaths in defining keys 
 and uniqueness constraints
- XQuery 
- XSLT 
- XLink and XPointer, hyperlinks for XML 
- Next time well focus on XQuery, the 
 nearly-complete SQL of XML
-  And well briefly discuss XSLT, a different 
 attempt to manipulate XML data