XML Basics - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

XML Basics

Description:

XML Basics. Susan B. Davidson. University of Pennsylvania. CIS330 Database Management Systems ... in W3C, Apache, Java EE, .NET frameworks. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 47
Provided by: zack4
Category:
Tags: xml | basics | ee

less

Transcript and Presenter's Notes

Title: XML Basics


1
XML Basics
  • Susan B. Davidson
  • University of Pennsylvania
  • CIS330 Database Management Systems
  • October 30, 2008

Some slide content courtesy of Zack ives Raghu
Ramakrishnan
2
Why XML?
  • XML was motivated by several needs
  • Documents needed a mechanism for extended tags
  • (XML, like HTML, came from the document
    community)
  • The Web needed a machine-friendly format for data
  • (otherwise applications do
    screen-scrapping)
  • Database people needed a more flexible data
    interchange format
    (part of
    the evolution Rel -gt OODB -gt semistructured)

3
How XML? (1)
  • XML and related technologies (DTD, XML Schema,
    XPath, XQuery, etc.) have been standardized
    mainly by the
  • World Wide Web Consortium (W3C)
  • http//www.w3.org
  • More resources at http//www.xml.com
  • Java-XML (and web services) info at
    http//java.sun.com/javaee/technologies
  • .NET-XML (via web services) info at
    http//www.microsoft.com/net/TechnicalResources

4
How XML? (2)
  • Original expectation behind every HTML page on
    the Web there would be an XML page and a
    stylesheet (CSS, XLS).
  • Todays reality not quite so but XML is used
    in many places under the covers
  • Example the Ajax technology. Small volume
    browser-server communication in XML supports more
    interactive Web pages.
  • Example Web services. Marshalling and
    unmarshalling data in SOAP uses XML. Service
    descriptions use XML.

5
How XML? (3)
  • Example Data exchange formats. (Applications
    must agree on common meaning for tags.)
  • Older data exchange formats have been redesigned
    as instances of XML, eg. HL7 in health
    informatics, FIX in the financial industry, etc.
    Even proprietary formats like MS Word have now
    open XML versions.
  • Example Software development configuration
    files, eg., in W3C, Apache, Java EE, .NET
    frameworks.
  • (All this may be geek paradise but its awfully
    verbose and the scarcity of visual editors is
    puzzling.)

6
Why DB People Like XML (1)
  • Can get data from all sorts of sources
  • Allows us to touch data we dont own!
  • Can integrate various data sources as if they
    were databases (almost)
  • We can publish some of the data in our databases
    on the Web conveniently
  • This was actually a huge change in the DB
    community

7
Why DB People Like XML (2)
  • Modeling and technologies
  • Much more flexible than the relational model
  • Independently of XML, there was already a push
    toward escaping the relational straightjacket in
    DB complex values, OODBs, semistructured data
  • Can query for some XML data with only partial (or
    no) knowledge about the schema ( but too little
    schema can be a drawback, too!)
  • Is supported by a nice query language, XQuery,
    distinctly cleaner than SQL

8
XML Anatomy
Processing Instr.
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • ltdblpgt
  • ltmastersthesis mdate"2002-01-03"
    key"ms/Brown92"gt
  •   ltauthorgtKurt P. Brownlt/authorgt
  •   lttitlegtPRPL A Database Workload
    Specification Languagelt/titlegt
  •   ltyeargt1992lt/yeargt
  •   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
  •   lt/mastersthesisgt
  • ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
    018"gt
  •   lteditorgtPaul R. McJoneslt/editorgt
  •   lttitlegtThe 1995 SQL Reunionlt/titlegt
  •   ltjournalgtDigital System Research Center
    Reportlt/journalgt
  •   ltvolumegtSRC1997-018lt/volumegt
  •   ltyeargt1997lt/yeargt
  •   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
  •   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
    n_95/lt/eegt
  •   lt/articlegt

Open-tag
Element
Attribute
Close-tag
9
Well-Formed XML
  • A legal XML document fully parsable by an XML
    parser
  • All open-tags have matching close-tags (unlike so
    many HTML documents!), or a special
  • lttag/gt shortcut for empty tags (equivalent to
    lttaggtlt/taggt
  • Attributes (which are unordered, in contrast to
    elements) only appear once in an element
  • Theres a single root element
  • XML is case-sensitive

10
XML as a Data Model
  • XML information set includes 7 types of nodes
  • Document (root)
  • Element
  • Attribute
  • Processing instruction
  • Text (content)
  • Namespace
  • Comment
  • XML data model includes this, plus order info and
    a few other things

11
XML Data Model Visualized(and simplified!)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
12
XML Easily Encodes Relations
Student-course-grade
  • ltstudent-course-gradegt
  • lttuplegtltsidgt1lt/sidgtltsernogt570103lt/sernogtltexp-grad
    egtBlt/exp-gradegtlt/tuplegt
  • lttuplegtltsidgt23lt/sidgtltsernogt550103lt/sernogtltexp-gra
    degtAlt/exp-gradegtlt/tuplegt
  • lt/student-course-gradegt

13
But XML is More Flexible(Non-First-Normal-Form
(NF2))
  • ltparentsgt
  • ltparent nameJean gt
  • ltsongtJohnlt/songt
  • ltdaughtergtJoanlt/daughtergt
  • ltdaughtergtJilllt/daughtergt
  • lt/parentgt
  • ltparent nameFenggt
  • ltdaughtergtFelicitylt/daughtergt
  • lt/parentgt

14
Integrating XML What If We Have Multiple
Sources with the Same Tags?
  • Namespaces allow us to specify a context for
    different tags
  • Two parts
  • Binding of namespace to URI
  • Qualified names
  • lttag xmlnsmynshttp//www.fictitious.com/mypath
    gt
  • ltthistaggtis in namespace mynslt/thistaggt
  • ltmynsthistaggtis the samelt/mynsthistaggtltotherns
    thistaggtis a different taglt/othernsthistaggt
  • lt/taggt

15
XML Isnt Enough on Its Own
  • Its too unconstrained for many cases!
  • How will we know when were getting garbage?
  • How will we query?
  • How will we understand what we got?
  • We also need
  • Some idea of the structure
  • Our focus next
  • Presentation, in some cases CSS, XSL
  • You can read about this separately at
    http//www.w3.org/Style
  • Some way of interpreting the tags?
  • Well talk about this later in the semester

16
Structural ConstraintsDocument Type Definitions
(DTDs)
  • The DTD is an EBNF grammar defining XML structure
  • XML document specifies an associated DTD, plus
    the root element
  • DTD specifies children of the root (and so on)
  • DTD defines special significance for attributes
  • IDs special attributes that are analogous to
    keys for elements
  • IDREFs references to IDs
  • IDREFS a nasty hack that represents a list of
    IDREFs

17
An Example DTD
  • Example DTD
  • lt!ELEMENT dblp((mastersthesis article))gt
  • lt!ELEMENT mastersthesis(author,title,year,school,c
    ommitteemember)gt
  • lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
    y ID REQUIRED
  • advisor CDATA IMPLIEDgt
  • lt!ELEMENT author(PCDATA)gt
  • Example use of DTD in XML file
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • lt!DOCTYPE dblp SYSTEM my.dtd"gt
  • ltdblpgt

18
Representing Graphs in XML
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • lt!DOCTYPE graph SYSTEM special.dtd"gt
  • ltgraphgt
  • ltauthor idauthor1gt
  • ltnamegtJohn Smithlt/namegt
  • lt/authorgt
  • ltarticlegt
  • ltauthor refauthor1 /gt lttitlegtPaper1lt/titlegt
  • lt/articlegt
  • ltarticlegt
  • ltauthor refauthor1 /gt lttitlegtPaper2lt/titlegt
  • lt/articlegt

DTD declares ID
DTD declares IDREF
19
Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
author1
author1
20
Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
21
Two other ways of representing a DB

22
Project and Employee relations in XML
Projects and employees are intermixed
  • ltdbgt
  • ltprojectgt
  • lttitlegt Pattern recognition lt/titlegt
  • ltbudgetgt 10000 lt/budgetgt
  • ltmanagedBygt Joe lt/managedBygt
  • lt/projectgt
  • ltemployeegt
  • ltnamegt Joe lt/namegt
  • ltssngt 344556 lt/ssngt
  • ltagegt 34 lt /agegt
  • lt/employeegt

ltemployeegt ltnamegt Sandra lt/namegt
ltssngt 2234 lt/ssngt ltagegt 35 lt/agegt
lt/employeegt ltprojectgt lttitlegt Auto
guided vehicle lt/titlegt ltbudgetgt 70000
lt/budgetgt ltmanagedBygt Sandra lt/managedBygt
lt/projectgt lt/dbgt
23
Project and Employee relations in XML (contd)
Employees follow projects
ltemployeesgt ltemployeegt ltnamegt Joe
lt/namegt ltssngt 344556 lt/ssngt ltagegt
34 lt/agegt lt/employeegt ltemployeegt
ltnamegt Sandra lt/namegt ltssngt 2234
lt/ssngt ltagegt35 lt/agegt lt/employeegt
ltemployeesgt lt/dbgt
ltdbgt ltprojectsgt ltprojectgt
lttitlegt Pattern recognition lt/titlegt
ltbudgetgt 10000 lt/budgetgt ltmanagedBygt
Joe lt/managedBygt lt/projectgt ltprojectgt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt
Sandra lt/managedBygt lt/projectgt
lt/projectsgt
24
Project and Employee relations in XML (contd)
Or without separator tags
ltdbgt ltprojectsgt lttitlegt Pattern
recognition lt/titlegt ltbudgetgt 10000
lt/budgetgt ltmanagedBygt Joe lt/managedBygt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt Sandra
lt/managedBygt lt/projectsgt
ltemployeesgt ltnamegt Joe lt/namegt
ltssngt 344556 lt/ssngt ltagegt 34 lt/agegt
ltnamegt Sandra lt/namegt ltssngt 2234 lt/ssngt
ltagegt 35 lt/agegt lt/employeesgt lt/dbgt
25
More on attributes
  • An (opening) tag may contain attributes. These
    are typically used to describe the content of
    an element
  • ltentrygt
  • ltword language engt cheese lt/wordgt
  • ltword language frgt fromage lt/wordgt
  • ltword language rogt branza lt/wordgt
  • ltmeaninggt A food made lt/meaninggt
  • lt/entrygt

26
Attributes (contd)
  • Another common use for attributes is to express
    dimension or type
  • ltpicturegt
  • ltheight dim cmgt 2400 lt/heightgt
  • ltwidth dim ingt 96 lt/widthgt
  • ltdata encoding gif compression zipgt
  • M05-.C_at_02!G96YEltFEC ...
  • lt/datagt
  • lt/picturegt

27
When to use attributes
  • The choice between representing data as
    attributes or as elements is sometimes unclear,
    taste applies.

ltperson ssno 123 45 6789gt ltnamegt F. MacNiel
lt/namegt ltemailgt fmacn_at_dcs.barra.ac.sc
lt/emailgt ... lt/persongt
ltpersongt ltssnogt 123 45 6789 lt/ssnogt ltnamegt
F. MacNiel lt/namegt ltemailgt
fmacn_at_dcs.barra.ac.sc lt/emailgt
... lt/persongt
28
Using IDs
DTD declares IDREF
DTD declares ID
  • ltfamilygt
  • ltperson id"jane" mother"mary"
    father"john"gt
  • ltnamegt Jane Doe lt/namegt
  • lt/persongt
  • ltperson id"john" children"jane jack"gt
  • ltnamegt John Doe lt/namegt ltmother/gt
  • lt/persongt
  • ltperson id"mary" children"jane jack"gt
  • ltnamegt Mary Doe lt/namegt
  • lt/persongt
  • ltperson id"jack" mothermary"
    father"john"gt
  • ltnamegt Jack Doe lt/namegt
  • lt/persongt
  • lt/familygt

DTD declares IDREFS
29
ODL (OO) schema
  • class Movie
  • ( extent Movies, key title )
  • attribute string title
  • attribute string director
  • relationship setltActorgt casts
  • inverse Actoracted_In
  • attribute int budget

class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
30
Encoding of OO instance into XML
  • ltdbgt
  • ltmovie idm1gt
  • lttitlegtWaking Ned Divinelt/titlegt
  • ltdirectorgtKirk Jones IIIlt/directorgt
  • ltcast idrefsa1 a3gtlt/castgt
  • ltbudgetgt100,000lt/budgetgt
  • lt/moviegt
  • ltmovie idm2gt
  • lttitlegtDragonheartlt/titlegt
  • ltdirectorgtRob Cohenlt/directorgt
  • ltcast idrefsa2 a9 a21gtlt/castgt
  • ltbudgetgt110,000lt/budgetgt
  • lt/moviegt
  • ltmovie idm3gt
  • lttitlegtMoondancelt/titlegt
  • ltdirectorgtDagmar Hirtzlt/directorgt
  • ltcast idrefsa1 a8gtlt/castgt
  • ltbudgetgt90,000lt/budgetgt
  • lt/moviegt

. . ltactor ida1gt ltnamegtDavid
Kellylt/namegt ltacted_In idrefsm1 m3 m78
gt lt/acted_Ingt lt/actorgt ltactor
ida2gt ltnamegtSean Connerylt/namegt
ltacted_In idrefsm2 m9 m11gt lt/acted_Ingt
ltagegt68lt/agegt lt/actorgt ltactor
ida3gt ltnamegtIan Bannenlt/namegt
ltacted_In idrefsm1 m35gt lt/acted_Ingt
lt/actorgt lt/dbgt
31
Back to DTDs
  • A DTD describes some of the information one would
    expect to find in a schema (based on what we saw
    with SQL DDL)
  • But a DTD is a structural specification.
  • There is a need for additional typing systems
    (XML Schema coming soon)

32
Example The Address Book
  • ltpersongt
  • ltnamegt MacNiel, John lt/namegt
  • ltgreetgt Dr. John MacNiel lt/greetgt
  • ltaddrgt1234 Huron Street lt/addrgt
  • ltaddrgt Rome, OH 98765 lt/addrgt
  • lttelgt (321) 786 2543 lt/telgt
  • ltfaxgt (321) 786 2543 lt/faxgt
  • lttelgt (321) 786 2543 lt/telgt
  • ltemailgt jm_at_abc.com lt/emailgt
  • lt/persongt

Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As many as needed
33
Specifying the structure
  • The structure of a person entry can be specified
    by
  • name, greet?, addr, (tel fax), email
  • This is known as a regular expression, an
    important device for specifying what strings look
    like.

34
Regular Expressions
  • Each regular expression determines a
    corresponding finite state automaton. Lets start
    with a simpler example
  • name, addr, email
  • This suggests a simple parsing program

35
Another example
  • name,address,(tel fax),email

address
email
tel
tel
name
email
fax
fax
email
Adding in the optional greet further complicates
things
36
A DTD for the address book
  • lt!DOCTYPE addressbook
  • lt!ELEMENT addressbook (project)gt
  • lt!ELEMENT project
  • (name, greet?, address, (fax tel),
    email)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT greet (PCDATA)gt
  • lt!ELEMENT address (PCDATA)gt
  • lt!ELEMENT tel (PCDATA)gt
  • lt!ELEMENT fax (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • gt

37
Our relational DB revisited

38
Two DTDs for the relational DB
lt!DOCTYPE db lt!ELEMENT db
(projects,employees)gt lt!ELEMENT projects
(project)gt lt!ELEMENT employees (employee)gt
lt!ELEMENT project (title, budget,
managedBy)gt lt!ELEMENT employee (name, ssn,
age)gt ... gt

lt!DOCTYPE db lt!ELEMENT db (project
employee)gt lt!ELEMENT project (title,
budget, managedBy)gt lt!ELEMENT employee (name,
ssn, age)gt ... gt
39
Recursive DTDs
  • ltDOCTYPE genealogy
  • lt!ELEMENT genealogy (person)gt
  • lt!ELEMENT person (
  • name,
  • dateOfBirth,
  • person, -- mother
  • person )gt -- father
  • ...
  • gt
  • What is the problem with this?

40
Recursive DTDs contd.
  • ltDOCTYPE genealogy
  • lt!ELEMENT genealogy (person)gt
  • lt!ELEMENT person (
  • name,
  • dateOfBirth,
  • person?, -- mother
  • person? )gt -- father
  • ...
  • gt
  • What is now the problem with this?

41
Some things are hard to specify
  • Each employee element is to contain name, age and
    ssn elements in some order.
  • lt!ELEMENT employee
  • ( (name, age, ssn) (age, ssn, name)
  • (ssn, name, age) ...
  • )gt
  • Very awkward, even unfeasible with many fields !

42
DTDs Arent Enough
  • DTDs capture grammatical structure, but have some
    drawbacks
  • Not themselves in XML inconvenient to build
    tools for them
  • Dont capture database datatypes domains
  • IDs arent a good implementation of keys
  • No element type may have more than one ID
    attribute specified. The value of an ID attribute
    must be unique between all values of all ID
    attributes. Why is this insufficient?
  • No way of defining OO-like inheritance
  • No way of expressing FDs

43
XML Schema
  • Aims to address the shortcomings of DTDs
  • Features
  • XML syntax
  • Can define keys using XPaths
  • Type subclassing thats more complex than in a
    programming language
  • Programming languages dont consider order of
    member variables!
  • Subclassing by extension and by restriction
  • And, of course, domains and built-in datatypes

44
Simple Schema Example
  • ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
    chema"gt
  • ltxsdelement namemastersthesis"
    typeThesisType"/gt
  • ltxsdcomplexType nameThesisType"gt
  • ltxsdattribute namemdate" type"xsddate"/gt
  • ltxsdattribute namekey" type"xsdstring"/gt
  • ltxsdattribute nameadvisor" type"xsdstring"/gt
  • ltxsdsequencegt
  • ltxsdelement nameauthor" typexsdstring"/gt
  • ltxsdelement nametitle" typexsdstring"/gt
  • ltxsdelement nameyear" typexsdinteger"/gt
  • ltxsdelement nameschool" typexsdstring/gt
  • ltxsdelement namecommitteemember"
    typeCommitteeType minOccurs0"/gt
  • lt/xsdsequencegt
  • lt/xsdcomplexTypegt

45
Designing an XML Schema/DTD
  • Not as formalized as relational data design
  • We can still use ER diagrams to break into
    entity, relationship sets
  • ER diagrams have extensions for aggregation
    treating smaller diagrams as entities and for
    composite attributes
  • Note that often we already have our data in
    relations and need to design the XML schema to
    export them!
  • Generally orient the XML tree around the
    central objects
  • Big decision element vs. attribute
  • Element if it has its own properties, or if you
    might have more than one of them
  • Attribute if it is a single property or perhaps
    not!

46
XML as a Data Model
  • XML is a flexible representation mechanism
  • Can represent documents, data
  • Standard data exchange format
  • Several competing schema formats esp., DTD and
    XML Schema provide typing information
  • Next time querying XML
Write a Comment
User Comments (0)
About PowerShow.com