Title: XML Basics
1XML Basics
- Susan B. Davidson
- University of Pennsylvania
- CIS330 Database Management Systems
- October 30, 2008
Some slide content courtesy of Zack ives Raghu
Ramakrishnan
2Why XML?
- XML was motivated by several needs
- Documents needed a mechanism for extended tags
- (XML, like HTML, came from the document
community) - The Web needed a machine-friendly format for data
- (otherwise applications do
screen-scrapping) - Database people needed a more flexible data
interchange format
(part of
the evolution Rel -gt OODB -gt semistructured)
3How XML? (1)
- XML and related technologies (DTD, XML Schema,
XPath, XQuery, etc.) have been standardized
mainly by the - World Wide Web Consortium (W3C)
- http//www.w3.org
- More resources at http//www.xml.com
- Java-XML (and web services) info at
http//java.sun.com/javaee/technologies - .NET-XML (via web services) info at
http//www.microsoft.com/net/TechnicalResources -
4How XML? (2)
- Original expectation behind every HTML page on
the Web there would be an XML page and a
stylesheet (CSS, XLS). - Todays reality not quite so but XML is used
in many places under the covers - Example the Ajax technology. Small volume
browser-server communication in XML supports more
interactive Web pages. -
- Example Web services. Marshalling and
unmarshalling data in SOAP uses XML. Service
descriptions use XML.
5How XML? (3)
- Example Data exchange formats. (Applications
must agree on common meaning for tags.) - Older data exchange formats have been redesigned
as instances of XML, eg. HL7 in health
informatics, FIX in the financial industry, etc.
Even proprietary formats like MS Word have now
open XML versions. - Example Software development configuration
files, eg., in W3C, Apache, Java EE, .NET
frameworks. - (All this may be geek paradise but its awfully
verbose and the scarcity of visual editors is
puzzling.)
6Why DB People Like XML (1)
- Can get data from all sorts of sources
- Allows us to touch data we dont own!
- Can integrate various data sources as if they
were databases (almost) - We can publish some of the data in our databases
on the Web conveniently - This was actually a huge change in the DB
community
7Why DB People Like XML (2)
- Modeling and technologies
- Much more flexible than the relational model
- Independently of XML, there was already a push
toward escaping the relational straightjacket in
DB complex values, OODBs, semistructured data - Can query for some XML data with only partial (or
no) knowledge about the schema ( but too little
schema can be a drawback, too!) - Is supported by a nice query language, XQuery,
distinctly cleaner than SQL
8XML Anatomy
Processing Instr.
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- ltdblpgt
- ltmastersthesis mdate"2002-01-03"
key"ms/Brown92"gt - Â ltauthorgtKurt P. Brownlt/authorgt
- Â lttitlegtPRPL A Database Workload
Specification Languagelt/titlegt - Â ltyeargt1992lt/yeargt
- Â ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
- Â lt/mastersthesisgt
- ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
018"gt - Â lteditorgtPaul R. McJoneslt/editorgt
- Â lttitlegtThe 1995 SQL Reunionlt/titlegt
- Â ltjournalgtDigital System Research Center
Reportlt/journalgt - Â ltvolumegtSRC1997-018lt/volumegt
- Â ltyeargt1997lt/yeargt
- Â lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
- Â lteegthttp//www.mcjones.org/System_R/SQL_Reunio
n_95/lt/eegt - Â lt/articlegt
Open-tag
Element
Attribute
Close-tag
9Well-Formed XML
- A legal XML document fully parsable by an XML
parser - All open-tags have matching close-tags (unlike so
many HTML documents!), or a special - lttag/gt shortcut for empty tags (equivalent to
lttaggtlt/taggt - Attributes (which are unordered, in contrast to
elements) only appear once in an element - Theres a single root element
- XML is case-sensitive
10XML as a Data Model
- XML information set includes 7 types of nodes
- Document (root)
- Element
- Attribute
- Processing instruction
- Text (content)
- Namespace
- Comment
- XML data model includes this, plus order info and
a few other things
11XML Data Model Visualized(and simplified!)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
12XML Easily Encodes Relations
Student-course-grade
- ltstudent-course-gradegt
- lttuplegtltsidgt1lt/sidgtltsernogt570103lt/sernogtltexp-grad
egtBlt/exp-gradegtlt/tuplegt - lttuplegtltsidgt23lt/sidgtltsernogt550103lt/sernogtltexp-gra
degtAlt/exp-gradegtlt/tuplegt - lt/student-course-gradegt
13But XML is More Flexible(Non-First-Normal-Form
(NF2))
- ltparentsgt
- ltparent nameJean gt
- ltsongtJohnlt/songt
- ltdaughtergtJoanlt/daughtergt
- ltdaughtergtJilllt/daughtergt
- lt/parentgt
- ltparent nameFenggt
- ltdaughtergtFelicitylt/daughtergt
- lt/parentgt
14Integrating XML What If We Have Multiple
Sources with the Same Tags?
- Namespaces allow us to specify a context for
different tags - Two parts
- Binding of namespace to URI
- Qualified names
- lttag xmlnsmynshttp//www.fictitious.com/mypath
gt - ltthistaggtis in namespace mynslt/thistaggt
- ltmynsthistaggtis the samelt/mynsthistaggtltotherns
thistaggtis a different taglt/othernsthistaggt - lt/taggt
15XML Isnt Enough on Its Own
- Its too unconstrained for many cases!
- How will we know when were getting garbage?
- How will we query?
- How will we understand what we got?
- We also need
- Some idea of the structure
- Our focus next
- Presentation, in some cases CSS, XSL
- You can read about this separately at
http//www.w3.org/Style - Some way of interpreting the tags?
- Well talk about this later in the semester
16Structural ConstraintsDocument Type Definitions
(DTDs)
- The DTD is an EBNF grammar defining XML structure
- XML document specifies an associated DTD, plus
the root element - DTD specifies children of the root (and so on)
- DTD defines special significance for attributes
- IDs special attributes that are analogous to
keys for elements - IDREFs references to IDs
- IDREFS a nasty hack that represents a list of
IDREFs
17An Example DTD
- Example DTD
- lt!ELEMENT dblp((mastersthesis article))gt
- lt!ELEMENT mastersthesis(author,title,year,school,c
ommitteemember)gt - lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
y ID REQUIRED - advisor CDATA IMPLIEDgt
- lt!ELEMENT author(PCDATA)gt
-
- Example use of DTD in XML file
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE dblp SYSTEM my.dtd"gt
- ltdblpgt
18Representing Graphs in XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE graph SYSTEM special.dtd"gt
- ltgraphgt
- ltauthor idauthor1gt
- ltnamegtJohn Smithlt/namegt
- lt/authorgt
- ltarticlegt
- ltauthor refauthor1 /gt lttitlegtPaper1lt/titlegt
- lt/articlegt
- ltarticlegt
- ltauthor refauthor1 /gt lttitlegtPaper2lt/titlegt
- lt/articlegt
DTD declares ID
DTD declares IDREF
19Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
author1
author1
20Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
21Two other ways of representing a DB
22Project and Employee relations in XML
Projects and employees are intermixed
- ltdbgt
- ltprojectgt
- lttitlegt Pattern recognition lt/titlegt
- ltbudgetgt 10000 lt/budgetgt
- ltmanagedBygt Joe lt/managedBygt
- lt/projectgt
- ltemployeegt
- ltnamegt Joe lt/namegt
- ltssngt 344556 lt/ssngt
- ltagegt 34 lt /agegt
- lt/employeegt
-
ltemployeegt ltnamegt Sandra lt/namegt
ltssngt 2234 lt/ssngt ltagegt 35 lt/agegt
lt/employeegt ltprojectgt lttitlegt Auto
guided vehicle lt/titlegt ltbudgetgt 70000
lt/budgetgt ltmanagedBygt Sandra lt/managedBygt
lt/projectgt lt/dbgt
23Project and Employee relations in XML (contd)
Employees follow projects
ltemployeesgt ltemployeegt ltnamegt Joe
lt/namegt ltssngt 344556 lt/ssngt ltagegt
34 lt/agegt lt/employeegt ltemployeegt
ltnamegt Sandra lt/namegt ltssngt 2234
lt/ssngt ltagegt35 lt/agegt lt/employeegt
ltemployeesgt lt/dbgt
ltdbgt ltprojectsgt ltprojectgt
lttitlegt Pattern recognition lt/titlegt
ltbudgetgt 10000 lt/budgetgt ltmanagedBygt
Joe lt/managedBygt lt/projectgt ltprojectgt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt
Sandra lt/managedBygt lt/projectgt
lt/projectsgt
24Project and Employee relations in XML (contd)
Or without separator tags
ltdbgt ltprojectsgt lttitlegt Pattern
recognition lt/titlegt ltbudgetgt 10000
lt/budgetgt ltmanagedBygt Joe lt/managedBygt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt Sandra
lt/managedBygt lt/projectsgt
ltemployeesgt ltnamegt Joe lt/namegt
ltssngt 344556 lt/ssngt ltagegt 34 lt/agegt
ltnamegt Sandra lt/namegt ltssngt 2234 lt/ssngt
ltagegt 35 lt/agegt lt/employeesgt lt/dbgt
25More on attributes
- An (opening) tag may contain attributes. These
are typically used to describe the content of
an element - ltentrygt
- ltword language engt cheese lt/wordgt
- ltword language frgt fromage lt/wordgt
- ltword language rogt branza lt/wordgt
- ltmeaninggt A food made lt/meaninggt
- lt/entrygt
26Attributes (contd)
- Another common use for attributes is to express
dimension or type - ltpicturegt
- ltheight dim cmgt 2400 lt/heightgt
- ltwidth dim ingt 96 lt/widthgt
- ltdata encoding gif compression zipgt
- M05-.C_at_02!G96YEltFEC ...
- lt/datagt
- lt/picturegt
27When to use attributes
- The choice between representing data as
attributes or as elements is sometimes unclear,
taste applies.
ltperson ssno 123 45 6789gt ltnamegt F. MacNiel
lt/namegt ltemailgt fmacn_at_dcs.barra.ac.sc
lt/emailgt ... lt/persongt
ltpersongt ltssnogt 123 45 6789 lt/ssnogt ltnamegt
F. MacNiel lt/namegt ltemailgt
fmacn_at_dcs.barra.ac.sc lt/emailgt
... lt/persongt
28Using IDs
DTD declares IDREF
DTD declares ID
- ltfamilygt
- ltperson id"jane" mother"mary"
father"john"gt - ltnamegt Jane Doe lt/namegt
- lt/persongt
- ltperson id"john" children"jane jack"gt
- ltnamegt John Doe lt/namegt ltmother/gt
- lt/persongt
- ltperson id"mary" children"jane jack"gt
- ltnamegt Mary Doe lt/namegt
- lt/persongt
- ltperson id"jack" mothermary"
father"john"gt - ltnamegt Jack Doe lt/namegt
- lt/persongt
- lt/familygt
DTD declares IDREFS
29ODL (OO) schema
- class Movie
- ( extent Movies, key title )
-
- attribute string title
- attribute string director
- relationship setltActorgt casts
- inverse Actoracted_In
- attribute int budget
-
class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
30Encoding of OO instance into XML
- ltdbgt
- ltmovie idm1gt
- lttitlegtWaking Ned Divinelt/titlegt
- ltdirectorgtKirk Jones IIIlt/directorgt
- ltcast idrefsa1 a3gtlt/castgt
- ltbudgetgt100,000lt/budgetgt
- lt/moviegt
- ltmovie idm2gt
- lttitlegtDragonheartlt/titlegt
- ltdirectorgtRob Cohenlt/directorgt
- ltcast idrefsa2 a9 a21gtlt/castgt
- ltbudgetgt110,000lt/budgetgt
- lt/moviegt
- ltmovie idm3gt
- lttitlegtMoondancelt/titlegt
- ltdirectorgtDagmar Hirtzlt/directorgt
- ltcast idrefsa1 a8gtlt/castgt
- ltbudgetgt90,000lt/budgetgt
- lt/moviegt
. . ltactor ida1gt ltnamegtDavid
Kellylt/namegt ltacted_In idrefsm1 m3 m78
gt lt/acted_Ingt lt/actorgt ltactor
ida2gt ltnamegtSean Connerylt/namegt
ltacted_In idrefsm2 m9 m11gt lt/acted_Ingt
ltagegt68lt/agegt lt/actorgt ltactor
ida3gt ltnamegtIan Bannenlt/namegt
ltacted_In idrefsm1 m35gt lt/acted_Ingt
lt/actorgt lt/dbgt
31Back to DTDs
- A DTD describes some of the information one would
expect to find in a schema (based on what we saw
with SQL DDL) - But a DTD is a structural specification.
- There is a need for additional typing systems
(XML Schema coming soon)
32Example The Address Book
- ltpersongt
- ltnamegt MacNiel, John lt/namegt
- ltgreetgt Dr. John MacNiel lt/greetgt
- ltaddrgt1234 Huron Street lt/addrgt
- ltaddrgt Rome, OH 98765 lt/addrgt
- lttelgt (321) 786 2543 lt/telgt
- ltfaxgt (321) 786 2543 lt/faxgt
- lttelgt (321) 786 2543 lt/telgt
- ltemailgt jm_at_abc.com lt/emailgt
- lt/persongt
-
Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As many as needed
33Specifying the structure
- The structure of a person entry can be specified
by - name, greet?, addr, (tel fax), email
- This is known as a regular expression, an
important device for specifying what strings look
like.
34Regular Expressions
- Each regular expression determines a
corresponding finite state automaton. Lets start
with a simpler example - name, addr, email
- This suggests a simple parsing program
35Another example
- name,address,(tel fax),email
address
email
tel
tel
name
email
fax
fax
email
Adding in the optional greet further complicates
things
36 A DTD for the address book
- lt!DOCTYPE addressbook
- lt!ELEMENT addressbook (project)gt
- lt!ELEMENT project
- (name, greet?, address, (fax tel),
email)gt - lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT greet (PCDATA)gt
- lt!ELEMENT address (PCDATA)gt
- lt!ELEMENT tel (PCDATA)gt
- lt!ELEMENT fax (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- gt
37Our relational DB revisited
38Two DTDs for the relational DB
lt!DOCTYPE db lt!ELEMENT db
(projects,employees)gt lt!ELEMENT projects
(project)gt lt!ELEMENT employees (employee)gt
lt!ELEMENT project (title, budget,
managedBy)gt lt!ELEMENT employee (name, ssn,
age)gt ... gt
lt!DOCTYPE db lt!ELEMENT db (project
employee)gt lt!ELEMENT project (title,
budget, managedBy)gt lt!ELEMENT employee (name,
ssn, age)gt ... gt
39Recursive DTDs
- ltDOCTYPE genealogy
- lt!ELEMENT genealogy (person)gt
- lt!ELEMENT person (
- name,
- dateOfBirth,
- person, -- mother
- person )gt -- father
- ...
- gt
- What is the problem with this?
40Recursive DTDs contd.
- ltDOCTYPE genealogy
- lt!ELEMENT genealogy (person)gt
- lt!ELEMENT person (
- name,
- dateOfBirth,
- person?, -- mother
- person? )gt -- father
- ...
- gt
- What is now the problem with this?
41Some things are hard to specify
- Each employee element is to contain name, age and
ssn elements in some order. - lt!ELEMENT employee
- ( (name, age, ssn) (age, ssn, name)
- (ssn, name, age) ...
- )gt
- Very awkward, even unfeasible with many fields !
42DTDs Arent Enough
- DTDs capture grammatical structure, but have some
drawbacks - Not themselves in XML inconvenient to build
tools for them - Dont capture database datatypes domains
- IDs arent a good implementation of keys
- No element type may have more than one ID
attribute specified. The value of an ID attribute
must be unique between all values of all ID
attributes. Why is this insufficient? - No way of defining OO-like inheritance
- No way of expressing FDs
43XML Schema
- Aims to address the shortcomings of DTDs
- Features
- XML syntax
- Can define keys using XPaths
- Type subclassing thats more complex than in a
programming language - Programming languages dont consider order of
member variables! - Subclassing by extension and by restriction
- And, of course, domains and built-in datatypes
44Simple Schema Example
- ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
chema"gt - ltxsdelement namemastersthesis"
typeThesisType"/gt - ltxsdcomplexType nameThesisType"gt
- ltxsdattribute namemdate" type"xsddate"/gt
- ltxsdattribute namekey" type"xsdstring"/gt
- ltxsdattribute nameadvisor" type"xsdstring"/gt
- ltxsdsequencegt
- ltxsdelement nameauthor" typexsdstring"/gt
- ltxsdelement nametitle" typexsdstring"/gt
- ltxsdelement nameyear" typexsdinteger"/gt
- ltxsdelement nameschool" typexsdstring/gt
- ltxsdelement namecommitteemember"
typeCommitteeType minOccurs0"/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
45Designing an XML Schema/DTD
- Not as formalized as relational data design
- We can still use ER diagrams to break into
entity, relationship sets - ER diagrams have extensions for aggregation
treating smaller diagrams as entities and for
composite attributes - Note that often we already have our data in
relations and need to design the XML schema to
export them! - Generally orient the XML tree around the
central objects - Big decision element vs. attribute
- Element if it has its own properties, or if you
might have more than one of them - Attribute if it is a single property or perhaps
not!
46XML as a Data Model
- XML is a flexible representation mechanism
- Can represent documents, data
- Standard data exchange format
- Several competing schema formats esp., DTD and
XML Schema provide typing information - Next time querying XML