Title: XML, Schemas, and Queries
1XML, Schemas, and Queries
- Zachary G. Ives
- University of Pennsylvania
- CIS 455 / 555 Internet and Web Systems
- December 8, 2015
2Readings Reminders
- Reminder Homework 1 Milestone 2 due 2/15 _at_
1159PM - XML, DTD, Schema
- XPath
- XSLT
- For next week Altinel Franklin paper on XFilter
3Kinds of Content
- Keyword search and inverted indices are great for
locating text documents - But what if we want to index and/or share other
kinds of content? - Spreadsheets
- Maps
- Purchase records
- Objects
- etc.
- Lets talk about structured data representation
and transport, then later indexing and retrieval
4Sending Data
- How do we send data within a program?
- What is the implicit model?
- How does this change when we need to make the
data persistent? - What happens when we are coupling systems?
- How do we send data between programs on the same
machine? - Between different machines?
5Marshalling
- Converting from an in-memory data structure to
something that can be sent elsewhere - Pointers -gt something else
- Specific byte orderings
- Metadata
- Note that the same logical data gets a different
physical encoding - A specific case of Codds idea of
logical-physical separation - Data model vs. data
6Communication and Streams
- When storing data to disk, we have a combination
of sequential and random access - When sending data on the wire, data is only
sequential - Stream-based communication based on packets
- What are the implications here?
- Pipelining, incremental evaluation,
7Why Data Interchange Is Hard
- Need to be able to understand
- Data encoding (physical data model)
- May have syntactic heterogeneity
- Endian-ness, marshalling issues
- Impedance mismatches
- Data representation (logical data model)
- May have semantic heterogeneity
- Imprecise and ambiguous values/descriptions
8Examples
- MP3 ID3 format record at end of file
offset length description
0 3 "TAG" identifier string.
3 30 Song title string.
33 30 Artist string.
63 30 Album string.
93 4 Year string.
97 28 Comment string.
125 1 Zero byte separator.
126 1 Track byte.
127 1 Genre byte.
9Examples
- JPEG JFIF header
- Start of Image (SOI) marker -- two bytes (FFD8)
- JFIF marker (FFE0)
- length -- two bytes
- identifier -- five bytes 4A, 46, 49, 46, 00
(the ASCII code equivalent of a zero terminated
"JFIF" string) - version -- two bytes often 01, 02
- the most significant byte is used for major
revisions - the least significant byte for minor revisions
- units -- one byte Units for the X and Y
densities - 0 gt no units, X and Y specify the pixel aspect
ratio - 1 gt X and Y are dots per inch
- 2 gt X and Y are dots per cm
- Xdensity -- two bytes
- Ydensity -- two bytes
- Xthumbnail -- one byte 0 no thumbnail
- Ythumbnail -- one byte 0 no thumbnail
- (RGB)n -- 3n bytes packed (24-bit) RGB values
for the thumbnail pixels, n Xthumbnail
Ythumbnail
10Finding File Formats
- http//www.wikipedia.org/
- http//www.wotsit.org/
- etc.
11The Problem
- You need to look into a manual to find file
formats - (At best, e.g., MS .DOC file format)
- The Web is about making data exchange easier
Maybe we can do better! - The mother of all file formats
12Desiderata for Data Interchange
- Ability to represent many kinds of information
- Different data structures
- Hardware-independent encoding
- Endian-ness, UTF vs. ASCII vs. EBCDIC
- Standard tools and interfaces
- Ability to define shape of expected data
- With forwards- and backwards-compatibility!
- Thats XML
13Consumers of XML
- A myriad of tools and interfaces, including
- DOM document object model
- Standard OO representation of an XML tree
- SAX simple API for XML
- An event-driven parser interface for XML
- startElement, endElement, etc.
- Ant Java-based make tool with XML makefile
- XPath, XQuery, XSL, XSLT
- Web service standards
- Anything AJAX (mash-ups)
14XML as a Data Model
- XML information set includes 7 types of nodes
- Document (root)
- Element
- Attribute
- Processing instruction
- Text (content)
- Namespace
- Comment
- XML data model includes this, plus typing info,
plus order info and a few other things
15Example XML Document
Processing Instr.
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- ltdblpgt
- ltmastersthesis mdate"2002-01-03"
key"ms/Brown92"gt - Â ltauthorgtKurt P. Brownlt/authorgt
- Â lttitlegtPRPL A Database Workload
Specification Languagelt/titlegt - Â ltyeargt1992lt/yeargt
- Â ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
- Â lt/mastersthesisgt
- ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
018"gt - Â lteditorgtPaul R. McJoneslt/editorgt
- Â lttitlegtThe 1995 SQL Reunionlt/titlegt
- Â ltjournalgtDigital System Research Center
Reportlt/journalgt - Â ltvolumegtSRC1997-018lt/volumegt
- Â ltyeargt1997lt/yeargt
- Â lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
- Â lteegthttp//www.mcjones.org/System_R/SQL_Reunio
n_95/lt/eegt - Â lt/articlegt
Open-tag
Element
Attribute
Close-tag
16XML Data Model Visualized( Document Object
Model)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
17A Few Common Uses of XML
- Serves as an extensible HTML
- Allows custom tags (e.g., used by MS Word,
openoffice) - Supplement it with stylesheets (XSL) to define
formatting - Provides an exchange format for data (still need
to agree on terminology) - Tables, objects, etc.
- Format for marshalling and unmarshalling data in
Web Services
18XML as a Super-HTML(MS Word)
- lth1 class"Section1"gtlta name"_top /gtCIS 550
Database and Information Systemslt/h1gt - lth2 class"Section1"gtFall 2003lt/h2gt
- ltp class"MsoNormal"gt
- ltplacegt311 Townelt/placegt, Tuesday/Thursday
- lttime Hour"13" Minute"30"gt130PM
300PMlt/timegt - lt/pgt
-
19XML Easily Encodes Relations
Student-course-grade
id course grade
1 330-f03 B
23 455-s04 A
- ltstudent-course-gradegt
- lttuplegt ltsidgt1lt/sidgtltcoursegt330-f03lt/coursegtltgra
degtBlt/gradegtlt/tuplegt - lttuplegt ltsidgt23lt/sidgtltcoursegt455-s04lt/coursegtltgr
adegtAlt/gradegtlt/tuplegt - lt/student-course-gradegt
20It Also Encodes Objects (with Pointers
Represented as IDs)
- ltprojectsgt
- ltproject classcse455 gt
- lttypegtProgramminglt/typegtltmemberListgt
- ltteamMembergtJoanlt/teamMembergt
- ltteamMembergtJilllt/teamMembergt
- lt/memberListgtltcodeURLgtwww.lt/codeURLgtltincorpora
tesProjectFrom classcse330 /gt - lt/projectgt
21XML and Code
- Web Services (.NET, Java web service toolkits)
are using XML to pass parameters and make
function calls marshalling as part of remote
procedure calls - SOAP WSDL
- Why?
- Easy to be forwards-compatible
- Easy to read over and validate (?)
- Generally firewall-compatible
- Drawbacks? XML is a verbose and inefficient
encoding! - But if the calls are only sending a few 100s of
bytes, who cares?
22XML When Tags Are Used by Different Sources
- Namespaces allow us to specify a context for
different tags - Two parts
- Binding of namespace to URI
- Qualified names
- lttag xmlnsmynshttp//www.fictitious.com/mypath
xmlnshttp//www.default/mypathgt - ltthistaggtis in default namespacelt/thistaggt
- ltmynsthistaggtthis a different
taglt/mynsthistaggtlt/taggt
23XML Isnt Enough on Its Own
- Its too unconstrained for many cases!
- How will we know when were getting garbage?
- How will we query?
- How will we understand what we got?
24Document Type Definitions (DTDs)
- DTD is an EBNF grammar defining XML structure
- XML document specifies an associated DTD, plus
the root element - DTD specifies children of the root (and so on)
- DTD defines special significance for attributes
- IDs special attributes that are analogous to
keys for elements - IDREFs references to IDs
- IDREFS space-delimited list of IDREFs
25An Example DTD
- Example DTD
- lt!ELEMENT dblp((mastersthesis article))gt
- lt!ELEMENT mastersthesis(author,title,year,school,c
ommitteemember)gt - lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
y ID REQUIRED - advisor CDATA IMPLIEDgt
- lt!ELEMENT author(PCDATA)gt
-
- Example use of DTD in XML file
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE dblp SYSTEM my.dtd"gt
- ltdblpgt
26DTDs Are Very Limited
- DTDs capture grammatical structure, but have some
drawbacks - Only string scalar types
- Global ID/reference space is inconvenient
- No way of defining OO-like inheritance
27XML Schema DTDs Rethought
- Features
- XML syntax
- Better way of defining keys using XPaths
- Type subclassing
- And, of course, built-in datatypes
28Basic Constructs of Schema
- Separation of elements (and attributes) from
types - complexType is a structured type
- It can have sequences or choices
- element and attribute have name and type
- Elements may also have minOccurs and maxOccurs
- Subtyping, most commonly using
- ltcomplexContentgt ltextension baseprevTypegt
lt/gt
29Simple Schema Example
- ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
chema"gt - ltxsdelement namemastersthesis"
typeThesisType"/gt - ltxsdcomplexType nameThesisType"gt
- ltxsdattribute namemdate" type"xsddate"/gt
- ltxsdattribute namekey" type"xsdstring"/gt
- ltxsdattribute nameadvisor" type"xsdstring"/gt
- ltxsdsequencegt
- ltxsdelement nameauthor" typexsdstring"/gt
- ltxsdelement nametitle" typexsdstring"/gt
- ltxsdelement nameyear" typexsdinteger"/gt
- ltxsdelement nameschool" typexsdstring/gt
- ltxsdelement namecommitteemember"
typeCommitteeType minOccurs0"/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
30Embedding XML Schema
- ltroot xmlnsxsi"http//www.w3.org/2000/10/XMLSche
ma-instance" xsinoNamespaceSchemaLocation"s1.xsd
" gt ltgradegtalt/gradegt lt/rootgt - lts1root xmlnss1"http//www.schemaValid.com/s1ns
" xmlnsxsi"http//www.w3.org/2000/10/XMLSchema-i
nstance" xsischemaLocation"http//www.schemaVali
d.com/s1ns s1ns.xsd" gt lts1gradegtalt/s1gradegt
lt/s1rootgt - But the XML parser is actually free to ignore
this the schema is typically specified from
outside the document
31Designing an XML Schema/DTD
- Often we are given a DTD/Schema if not, we need
to design one - We orient the XML tree around the central
objects in a particular application
32Manipulating XML
- Sometimes
- Need to restructure an XML document
- Or simply need to retrieve certain parts that
satisfy a constraint, e.g. - All books
- All books by author XYZ
33Document Object Model (DOM)vs. Queries
- Build a DOM tree (as we saw earlier) and access
via Java (etc.) DOMNode object - DOM objects have methods like getFirstChild(),
getNextSibling - Common way of traversing the tree
- Can also modify the DOM tree alter the XML
via insertAfter(), etc. - Alternate approach a query language
- Define some sort of a template describing
traversals from the root of the directed graph - In XML, the basis of this template is called an
XPath - Can also declare some constraints on the values
you want - The XPath returns a node set of matches
34XPaths
- In its simplest form, an XPath is like a path in
a file system - /mypath/subpath//morepath
- The XPath returns a node set representing the XML
nodes (and their subtrees) at the end of the path - XPaths can have node tests at the end, returning
only particular node types, e.g., text(),
processing-instruction(), comment(), element(),
attribute() - XPath is fundamentally an ordered language it
can query in order-aware fashion, and it returns
nodes in order
35Sample XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- ltdblpgt
- ltmastersthesis mdate"2002-01-03"
key"ms/Brown92"gt - Â ltauthorgtKurt P. Brownlt/authorgt
- Â lttitlegtPRPL A Database Workload
Specification Languagelt/titlegt - Â ltyeargt1992lt/yeargt
- Â ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
- Â lt/mastersthesisgt
- ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
018"gt - Â lteditorgtPaul R. McJoneslt/editorgt
- Â lttitlegtThe 1995 SQL Reunionlt/titlegt
- Â ltjournalgtDigital System Research Center
Reportlt/journalgt - Â ltvolumegtSRC1997-018lt/volumegt
- Â ltyeargt1997lt/yeargt
- Â lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
- Â lteegthttp//www.mcjones.org/System_R/SQL_Reunio
n_95/lt/eegt - Â lt/articlegt
36XML Data Model Visualized
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
37Some Example XPath Queries
- /dblp/mastersthesis/title
- /dblp//editor
- //title
- //title/text()
38Context Nodes and Relative Paths
- XPath has a notion of a context node its
analogous to a current directory - . represents this context node
- .. represents the parent node
- We can express relative paths
- subpath/sub-subpath/../.. gets us back to the
context node - By default, the document root is the context node
39Predicates Filtering Operations
- A predicate allows us to filter the node set
based on selection-like conditions over
sub-XPaths - /dblp/articletitle Paper1
- which is equivalent to
- /dblp/article./title/text() Paper1
- because of type coercion. What does this do
- /dblp/article_at_key 123 and ./title/text()
Paper1 and ./author//element()
40Axes More Complex Traversals
- Thus far, weve seen XPath expressions that go
down the tree (and up one step) - But we might want to go up, left, right, etc.
- These are expressed with so-called axes
- selfpath-step
- childpath-step parentpath-step
- descendantpath-step ancestorpath-step
- descendant-or-selfpath-step ancestor-or-selfpa
th-step - preceding-siblingpath-step following-siblingpa
th-step - precedingpath-step followingpath-step
- The previous XPaths we saw were in abbreviated
form
41Users of XPath
- XML Schema uses simple XPaths in defining keys
and uniqueness constraints - XLink and XPointer, hyperlinks for XML
- XSLT useful for converting from XML to other
representations (e.g., HTML, PDF, SVG) - XQuery useful for restructuring an XML document
or combining multiple documents - Might well turn into the glue between Web
Services, etc.
42A Functional Language for XML
- XSLT is based on a series of templates that match
different parts of an XML document - Theres a policy for what rule or template is
applied if more than one matches (its not what
youd think!) - XSLT templates can invoke other templates
- XSLT templates can be nonterminating (beware!)
- XSLT templates are based on XPath matches, and
we can also apply other templates (potentially to
selected XPaths) - Within each template, directly describe what
should be output
43An XSLT Template
- An XML document itself
- XML tags create output OR are XSL operations
- All XSL tags are prefixed with xsl namespace
- All non-XSL tags are part of the XML output
- Common XSL operations
- template with a match XPath
- Recursive call to apply-templates, which may also
select where it should be applied - Attach to XML document with a processing-instructi
on - lt?xml version 1.0 ?gtlt?xml-stylesheet
typetext/xsl hrefhttp//www.com/my.xsl ?gt
44An Example XSLT Stylesheet
- ltxslstylesheet version1.1gt
- ltxsltemplate match/dblpgt
- lthtmlgtltheadgtThis is DBLPlt/headgt
- ltbodygt
- ltxslapply-templates /gt
- lt/bodygt
- lt/htmlgt
- lt/xsltemplategt
- ltxsltemplate matchinproceedingsgt
- lth2gtltxslapply-templates selecttitle /gtlt/h2gt
- ltpgtltxslapply-templates selectauthor/gtlt/pgt
- lt/xsltemplategt
-
- lt/xslstylesheetgt
45XSLT Processing Model
- List of source nodes ? result tree fragment(s)
- Start with root
- Find all template rules with matching patterns
from root - Find best match according to some heuristics
- Set the current node list to be the set of things
it maches - Iterate over each node in the current node list
- Apply the operations of the template
- Append the results of the matching template
rule to the result tree structure - Repeat recursively if specified to by
apply-templates
46What If Theres More than One Match?
- Eliminate rules of lower precedence due to
importing - Break a rule into any branches and consider
separately - Choose rule with highest computed or specified
priority - Simple rules for computing priority based on
precision - QName preceded by XPath child/axis specifier
priority 0 - NCName preceded by child/axis specifier priority
-0.25 - NodeTest preceded by child/axis specifier
pririty -0.5 - else priority 0.5
47Other Common Operations
- Iteration
- ltxslfor-each selectpathgtlt/xslfor-eachgt
- Conditionals
- ltxslif test./text() lt abcgtlt/xslifgt
- Copying current node and children to the result
set - ltxslcopygt ltxslapply-templates /gtlt/xslcopygt
48Creating Output Nodes
- Return text/attribute data (this is a default
rule) - ltxsltemplate matchtext()_at_gt ltxslvalue-of
select./gtlt/xsltemplategt - Create an element from text (attribute is
similar) - ltxslelement nametext()gt ltxslapply-templates
/gtlt/xslelementgt - Copy nodes matching a path
- ltxslcopy-of select/gt
49Embedding Stylesheets
- You can import or include one stylesheet from
another - ltxslimport hrefhttp//www.com/my.xsl/gt
- ltxslinclude hrefhttp//www.com/my.xsl/gt
- Include the rules get same precedence as in
including template - Import the rules are given lower precedence
50XSLT Summary
- A very powerful, template-based transformation
language for XML document ? other structured
document - Commonly used to convert XML ? PDF, SVG, GraphViz
DOT format, HTML, WML, - Primarily useful for presentation of XML or for
very simple conversions - But sometimes we need more complex operations
when converting data from one source to another - Joins combining and correlating information
from multiple sources - Aggregation computing averages, counts, etc.
51XSLT and Alternatives
- XSLT is focused on reformatting documents
- Stylesheets are focused around one XML file
- XML file must reference the stylesheet
- What if we want to
- Manage and combine collections of XML documents?
- Make Web service requests for XML?
- Glue together different Web service requests?
- Query for keywords within documents, with ranked
answers - This is where XQuery plays a role see CIS 330 /
550 for details