XML and Beyond: Parts I and II - PowerPoint PPT Presentation

About This Presentation
Title:

XML and Beyond: Parts I and II

Description:

... structure. We can represent a list by using the same. tag repeatedly: ... movie id='m1' title Waking Ned Divine /title director Kirk Jones III /director ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 60
Provided by: lambd
Learn more at: https://lambda.uta.edu
Category:
Tags: xml | beyond | listing | movie | parts

less

Transcript and Presenter's Notes

Title: XML and Beyond: Parts I and II


1
XML and BeyondParts I and II
  • http//db.cis.upenn.edu
  • http//www.w3c.org

2
Outline
  • Background documents (SGML/HTML) and databases
    (structured and semistructured data)
  • XML Basics and Document Type Descriptors
  • XML APIs Document Object Model (DOM), SAX (not
    covered in this course)
  • XML query languages XML-QL, XSL, Quilt.

3
Part I Background
  • Whats the difference between the world of
    documents and information retrieval and databases
    and query interfaces?

4
Documents vs Databases
  • Document world
  • gt plenty of small documents
  • gt usually static
  • gt implicit structure
  • section, paragraph, toc,
  • gt tagging
  • gt human friendly
  • gt content
  • form/layout, annotation
  • gt Paradigms
  • Save as
  • gt meta-data
  • author name, date, subject
  • Database world
  • gt a few large databases
  • gt usually dynamic
  • gt explicit structure (schema)
  • gt records
  • gt machine friendly
  • gt content
  • schema, data, methods
  • gt Paradigms
  • Atomicity, Concurrency, Isolation, Durability
  • gt meta-data
  • schema description

5
What to do with them
  • Documents
  • editing
  • printing
  • spell-checking
  • counting words
  • retrieving (IR)
  • searching
  • Database
  • updating
  • cleaning
  • querying
  • composing/transforming

6
HTML
  • Lingua franca for publishing hypertext on the
    World Wide Web
  • Designed to describe how a Web browser should
    arrange text, images and push-buttons on a page.
  • Easy to learn, but does not convey structure.
  • Fixed tag set.

Text (PCDATA)
Opening tag
ltHTMLgt ltHEADgtltTITLEgtWelcome to the XML
courselt/TITLEgtlt/HEADgt ltBODYgt ltH1gtIntroductionlt/H1
gt ltIMG SRCdragon.jpeg" WIDTH"200" HEIGHT"150
gt lt/BODYgt lt/HTMLgt
Closing tag
Bachelor tag
Attribute name
Attribute value
7
Thin red line
  • The line between the document world and the
    database world is not clear.
  • In some cases, both approaches are legitimate.
  • An interesting middle ground is data formats --
    of which XML is an example
  • Examples
  • Personal address book

8
Personal address book over 20 years
1977 N Achison, Malcolm F Dr. M.P. Achison A
Dept. of Computer Science A University of
Edinburgh A Kings Buildings A Edinburgh E12 8QQ A
Scotland T 031-123-8855 ext. 4359 (work) T
031-345-7570 (home) N Albani, Paolo F Prof.
Paolo Albani A Dip. Informatica e Sistemistica A
Universita di Roma La Sapienza ...
1990 N Achison, Malcolm F Prof. M.P. Achison A
Dept. of Computing Science A University of
Glasgow A Lilybank Gardens A Glasgow G12 8QQ A
Scotland T 041-339-8855 ext. 4359 T 041-357-3787
(private) T 031-667-7570 (home) X 041-339-0090 C
mpa_at_uk.ac.gla.cs N Achison, Malcolm F Prof. M.P.
Achison A 34 Inverness Place A Edinburgh, EH3 8UV
1980 N Achison, Malcolm F Dr. M.P. Achison A
Dept. of Computer Science .... T 031-667-7570
(home) C mpa_at_uk.ac.ed.cs
1997 N Achison, Malcolm F Prof. M.P. Achison A
Department of Computing Science ... T
031-667-7570 (home) X 041-339-0090 C
mpa_at_dcs.gla.ac.uk W http//www.dcs.gla.ac.uk/mpa
2000 ?
9
The Structure of XML
  • XML consists of tags and text
  • Tags come in pairs ltdategt ...lt/dategt
  • They must be properly nested
  • ltdategt ltdaygt ... lt/daygt ... lt/dategt --- good
  • ltdategt ltdaygt ... lt/dategt... lt/daygt --- bad
  • (You cant do ltigt ... ltbgt ... lt/igt ...lt/bgt in
    HTML)

10
XML text
  • XML has only one basic type -- text.
  • It is bounded by tags e.g.
  • lttitlegt The Big Sleep lt/titlegt
  • ltyeargt 1935 lt/ yeargt --- 1935 is still text
  • XML text is called PCDATA (for parsed
  • character data). It uses a 16-bit encoding,
  • e.g. \\x0152 for the Hebrew letter Mem

11
XML structure
  • Nesting tags can be used to express various
    structures. E.g. A tuple (record)

ltpersongt ltnamegt Malcolm Atchison lt/namegt
lttelgt (215) 898 4321 lt/telgt ltemailgt
mp_at_dcs.gla.ac.sc lt/emailgt lt/persongt
12
XML structure
  • We can represent a list by using the same
  • tag repeatedly

ltaddressesgt ltpersongt ... lt/persongt ltpersongt
... lt/persongt ltpersongt ... lt/persongt
... lt/addressesgt
13
Terminology
  • The segment of an XML document between an opening
    and a corresponding closing tag is called an
    element.

ltpersongt ltnamegt Malcolm Atchison
lt/namegt lttelgt (215) 898 4321 lt/telgt lttelgt
(215) 898 4321 lt/telgt ltemailgt mp_at_dcs.gla.ac.sc
lt/emailgt lt/persongt
element
element, a sub-element of
not an element
14
XML is tree-like
Malcolm Atchison
(215) 898 4321
(215) 898 4321
mp_at_dcs.gla.ac.sc
Semistructured data models typically put the
labels on the edges
15
Mixed Content
  • An element may contain a mixture of sub-elements
    and PCDATA
  • ltairlinegt
  • ltnamegt British Airways lt/namegt
  • ltmottogt
  • Worlds ltdubiousgt
    favoritelt/dubiousgt airline
  • lt/mottogt
  • lt/airlinegt
  • Data of this form is not typically generated from
    databases. It is needed for consistency with HTML

16
A Complete XML Document
  • lt?xml version"1.0"?gt
  • ltpersongt
  • ltnamegt Malcolm Atchison lt/namegt
  • lttelgt (215) 898 4321 lt/telgt
  • ltemailgt mp_at_dcs.gla.ac.sc lt/emailgt
  • lt/persongt

17
Representing relational DBsTwo ways

18
Project and Employee relations in XML
Projects and employees are intermixed
  • ltdbgt
  • ltprojectgt
  • lttitlegt Pattern recognition lt/titlegt
  • ltbudgetgt 10000 lt/budgetgt
  • ltmanagedBygt Joe lt/managedBygt
  • lt/projectgt
  • ltemployeegt
  • ltnamegt Joe lt/namegt
  • ltssngt 344556 lt/ssngt
  • ltagegt 34 lt /agegt
  • lt/employeegt

ltemployeegt ltnamegt Sandra lt/namegt
ltssngt 2234 lt/ssngt ltagegt 35 lt/agegt
lt/employeegt ltprojectgt lttitlegt Auto
guided vehicle lt/titlegt ltbudgetgt 70000
lt/budgetgt ltmanagedBygt Sandra lt/managedBygt
lt/projectgt lt/dbgt
19
Project and Employee relations in XML (contd)
Employees follows projects
ltdbgt ltprojectsgt ltprojectgt
lttitlegt Pattern recognition lt/titlegt
ltbudgetgt 10000 lt/budgetgt ltmanagedBygt
Joe lt/managedBygt lt/projectgt ltprojectgt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt
Sandra lt/managedBygt lt/projectgt
lt/projectsgt
ltemployeesgt ltemployeegt ltnamegt Joe
lt/namegt ltssngt 344556 lt/ssngt ltagegt
34 lt/agegt lt/employeegt ltemployeegt
ltnamegt Sandra lt/namegt ltssngt 2234
lt/ssngt ltagegt35 lt/agegt lt/employeegt
ltemployeesgt lt/dbgt
20
Project and Employee relations in XML (contd)
Or without separator tags
ltdbgt ltprojectsgt lttitlegt Pattern
recognition lt/titlegt ltbudgetgt 10000
lt/budgetgt ltmanagedBygt Joe lt/managedBygt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt Sandra
lt/managedBygt lt/projectsgt
ltemployeesgt ltnamegt Joe lt/namegt
ltssngt 344556 lt/ssngt ltagegt 34 lt/agegt
ltnamegt Sandra lt/namegt ltssngt 2234 lt/ssngt
ltagegt 35 lt/agegt lt/employeesgt lt/dbgt
21
Attributes
  • An (opening) tag may contain attributes. These
    are typically used to describe the content of
    an element
  • ltentrygt
  • ltword language engt cheese lt/wordgt
  • ltword language frgt fromage lt/wordgt
  • ltword language rogt branza lt/wordgt
  • ltmeaninggt A food made lt/meaninggt
  • lt/entrygt

22
Attributes (contd)
  • Another common use for attributes is to express
    dimension or type
  • ltpicturegt
  • ltheight dim cmgt 2400 lt/heightgt
  • ltwidth dim ingt 96 lt/widthgt
  • ltdata encoding gif compression zipgt
  • M05-.C_at_02!G96YEltFEC ...
  • lt/datagt
  • lt/picturegt
  • A document that obeys the nested tags rule and
    does not repeat an attribute within a tag is said
    to be well-formed .

23
When to use attributes
  • Its not always clear when to use attributes

ltperson ssno 123 45 6789gt ltnamegt F.
MacNiel lt/namegt ltemailgt
fmacn_at_dcs.barra.ac.sc lt/emailgt ... lt/persongt OR
ltpersongt ltssnogt123 45 6789lt/ssnogt ltnamegt F.
MacNiel lt/namegt ltemailgt fmacn_at_dcs.barra.ac.sc
lt/emailgt ... lt/persongt
24
Using IDs
  • ltfamilygt
  • ltperson id"jane" mother"mary"
    father"john"gt
  • ltnamegt Jane Doe lt/namegt
  • lt/persongt
  • ltperson id"john" children"jane jack"gt
  • ltnamegt John Doe lt/namegt ltmother/gt
  • lt/persongt
  • ltperson id"mary" children"jane jack"gt
  • ltnamegt Mary Doe lt/namegt
  • lt/persongt
  • ltperson id"jack" mothermary"
    father"john"gt
  • ltnamegt Jack Doe lt/namegt
  • lt/persongt
  • lt/familygt

25
An object-oriented schema
  • class Movie
  • ( extent Movies, key title )
  • attribute string title
  • attribute string director
  • relationship setltActorgt casts
  • inverse Actoracted_In
  • attribute int budget

class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
26
An example
  • ltdbgt
  • ltmovie idm1gt
  • lttitlegtWaking Ned Divinelt/titlegt
  • ltdirectorgtKirk Jones IIIlt/directorgt
  • ltcast idrefsa1 a3gtlt/castgt
  • ltbudgetgt100,000lt/budgetgt
  • lt/moviegt
  • ltmovie idm2gt
  • lttitlegtDragonheartlt/titlegt
  • ltdirectorgtRob Cohenlt/directorgt
  • ltcast idrefsa2 a9 a21gtlt/castgt
  • ltbudgetgt110,000lt/budgetgt
  • lt/moviegt
  • ltmovie idm3gt
  • lttitlegtMoondancelt/titlegt
  • ltdirectorgtDagmar Hirtzlt/directorgt
  • ltcast idrefsa1 a8gtlt/castgt
  • ltbudgetgt90,000lt/budgetgt
  • lt/moviegt

ltactor ida1gt ltnamegtDavid
Kellylt/namegt ltacted_In idrefsm1 m3 m78
gt lt/acted_Ingt lt/actorgt ltactor
ida2gt ltnamegtSean Connerylt/namegt
ltacted_In idrefsm2 m9 m11gt lt/acted_Ingt
ltagegt68lt/agegt lt/actorgt ltactor
ida3gt ltnamegtIan Bannenlt/namegt
ltacted_In idrefsm1 m35gt lt/acted_Ingt
lt/actorgt lt/dbgt
27
Part II Document Type Descriptors
  • Imposing structure on XML documents

28
Document Type Descriptors
  • Document Type Descriptors (DTDs) impose structure
    on an XML document.
  • There is some relationship between a DTD and a
    schema, but it is not close there is still a
    need for additional typing systems.
  • The DTD is a syntactic specification.

29
Example An Address Book
  • ltpersongt
  • ltnamegt MacNiel, John lt/namegt
  • ltgreetgt Dr. John MacNiel lt/greetgt
  • ltaddrgt1234 Huron Street lt/addrgt
  • ltaddrgt Rome, OH 98765 lt/addrgt
  • lttelgt (321) 786 2543 lt/telgt
  • ltfaxgt (321) 786 2543 lt/faxgt
  • lttelgt (321) 786 2543 lt/telgt
  • ltemailgt jm_at_abc.com lt/emailgt
  • lt/persongt

Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As many as needed
30
Specifying the structure
  • name to specify a name element
  • greet? to specify an optional (0 or 1)
    greet elements
  • name,greet? to specify a name followed by an
    optional greet

31
Specifying the structure (cont)
  • addr to specify 0 or more address lines
  • tel fax a tel or a fax element
  • (tel fax) 0 or more repeats of tel or fax
  • email 0 or more email elements

32
Specifying the structure (cont)
  • So the whole structure of a person entry is
    specified by
  • name, greet?, addr, (tel fax), email
  • This is known as a regular expression. Why is it
    important?

33
Regular Expressions
  • Each regular expression determines a
    corresponding finite state automaton. Lets start
    with a simpler example
  • name, addr, email
  • This suggests a simple parsing program

addr
name
email
34
Another example
  • name,address,(tel fax),email

address
email
tel
tel
name
email
fax
fax
email
Adding in the optional greet further complicates
things
35
A DTD for the address book
  • lt!DOCTYPE addressbook
  • lt!ELEMENT addressbook (person)gt
  • lt!ELEMENT person
  • (name, greet?, address, (fax tel),
    email)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT greet (PCDATA)gt
  • lt!ELEMENT address (PCDATA)gt
  • lt!ELEMENT tel (PCDATA)gt
  • lt!ELEMENT fax (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • gt

36
Two DTDs for the relational DB
lt!DOCTYPE db lt!ELEMENT db
(projects,employees)gt lt!ELEMENT projects
(project)gt lt!ELEMENT employees (employee)gt
lt!ELEMENT project (title, budget,
managedBy)gt lt!ELEMENT employee (name, ssn,
age)gt ... gt

lt!DOCTYPE db lt!ELEMENT db (project
employee)gt lt!ELEMENT project (title,
budget, managedBy)gt lt!ELEMENT employee (name,
ssn, age)gt ... gt
37
Recursive DTDs
  • ltDOCTYPE genealogy
  • lt!ELEMENT genealogy (person)gt
  • lt!ELEMENT person (
  • name,
  • dateOfBirth,
  • person, -- mother
  • person )gt -- father
  • ...
  • gt
  • What is the problem with this?

38
Recursive DTDs contd.
  • ltDOCTYPE genealogy
  • lt!ELEMENT genealogy (person)gt
  • lt!ELEMENT person (
  • name,
  • dateOfBirth,
  • person?, -- mother
  • person? )gt -- father
  • ...
  • gt
  • What is now the problem with this?

39
Some things are hard to specify
  • Each employee element is to contain name, age and
    ssn elements in some order.
  • lt!ELEMENT employee
  • ( (name, age, ssn) (age, ssn, name)
  • (ssn, name, age) ...
  • )gt
  • Suppose there were many more fields !

40
Summary of XML regular expressions
  • A The tag A occurs
  • e1,e2 The expression e1 followed by e2
  • e 0 or more occurrences of e
  • e? Optional -- 0 or 1 occurrences
  • e 1 or more occurrences
  • e1 e2 either e1 or e2
  • (e) grouping

41
Its easy to get confused
  • lt!ELEMENT PARTNER (NAME?, ONETIME?, PARTNRID?,
    PARTNRTYPE?, SYNCIND?, ACTIVE?, CURRENCY?,
    DESCRIPTN?, DUNSNUMBER?, GLENTITYS?, NAME,
    PARENTID?, PARTNRIDX?, PARTNRRATG?, PARTNRROLE?,
    PAYMETHOD?, TAXEXEMPT?, TAXID?, TERMID?,
    USERAREA?, ADDRESS, CONTACT)gt
  • Cited from oagis_segments.dtd (one of the files
    in the Novell Developer Kit http//developer.novel
    l.com/ndk/indexexe.htm)
  • ltPARTNERgt ltNAMEgt Ben Franklin lt/NAMEgt lt/PARTNERgt
  • Q. Which NAME is it?

42
Specifying attributes in the DTD
  • lt!ELEMENT height (PCDATA)gt
  • lt!ATTLIST height
  • dimension CDATA REQUIRED
  • accuracy CDATA IMPLIED gt
  • The dimension attribute is required the accuracy
    attribute is optional.
  • CDATA is the type of the attribute -- it means
    string.

43
Specifying ID and IDREF attributes
  • lt!DOCTYPE family
  • lt!ELEMENT family (person)gt
  • lt!ELEMENT person (name)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ATTLIST person
  • id ID REQUIRED
  • mother IDREF IMPLIED
  • father IDREF IMPLIED
  • children IDREFS IMPLIEDgt
  • gt

44
Some conforming data
  • ltfamilygt
  • ltperson id"jane" mother"mary"
    father"john"gt
  • ltnamegt Jane Doe lt/namegt
  • lt/persongt
  • ltperson id"john" children"jane jack"gt
  • ltnamegt John Doe lt/namegt
  • lt/persongt
  • ltperson id"mary" children"jane jack"gt
  • ltnamegt Mary Doe lt/namegt
  • lt/persongt
  • ltperson id"jack" mothermary"
    father"john"gt
  • ltnamegt Jack Doe lt/namegt
  • lt/persongt
  • lt/familygt

45
Consistency of ID and IDREF attribute values
  • If an attribute is declared as ID
  • the associated values must all be distinct (no
    confusion)
  • If an attribute is declared as IDREF
  • the associated value must exist as the value of
    some ID attribute (no dangling pointers)
  • Similarly for all the values of an IDREFS
    attribute
  • ID and IDREF attributes are not typed

46
An alternative specification
  • lt!DOCTYPE family
  • lt!ELEMENT family (person)gt
  • lt!ELEMENT person (mother?, father?, children,
    name)gt
  • lt!ATTLIST person id ID REQUIREDgt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT mother EMPTYgt
  • lt!ATTLIST mother idref IDREF REQUIREDgt
  • lt!ELEMENT father EMPTYgt
  • lt!ATTLIST father idref IDREF REQUIREDgt
  • lt!ELEMENT children EMPTYgt
  • lt!ATTLIST children idrefs IDREFS REQUIREDgt
  • gt

47
The revised data
  • ltfamilygt
  • ltperson id "janegt
  • ltnamegt Jane Doe lt/namegt
  • ltmother idref "marygtlt/mothergt
  • ltfather idref "john"gtlt/fathergt
  • lt/persongt
  • ltperson id "johngt
  • ltnamegt John Doe lt/namegt
  • ltchildren idrefs "jane jack"gt lt/childrengt
  • lt/persongt
  • ...
  • lt/familygt

48
A useful abbreviation
  • When an element has empty content we can use
  • lttag blahblahbla/gt for lttag
    blahblahblagtlt/taggt
  • For example
  • ltfamilygt
  • ltperson id "janegt
  • ltnamegt Jane Doe lt/namegt
  • ltmother idref "mary/gt
  • ltfather idref "john/gt
  • lt/persongt
  • ...
  • lt/familygt

49
Back to the object-oriented schema
  • class Movie
  • ( extent Movies, key title )
  • attribute string title
  • attribute string director
  • relationship setltActorgt casts
  • inverse Actoracted_In
  • attribute int budget

class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
50
Schema.dtd
  • lt!DOCTYPE db
  • lt!ELEMENT db (movie, actor)gt
  • lt!ELEMENT movie (title,director,casts,budget
    )gt
  • lt!ATTLIST movie id ID REQUIREDgt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT director (PCDATA)gt
  • lt!ELEMENT casts EMPTYgt
  • lt!ATTLIST casts idrefs IDREFS
    REQUIREDgt
  • lt!ELEMENT budget (PCDATA)gt

51
Schema.dtd (contd)
  • lt!ELEMENT actor (name, acted_In,age?,
    directed)gt
  • lt!ATTLIST actor id ID REQUIREDgt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT acted_In EMPTYgt
  • lt!ATTLIST acted_In idrefs IDREFS
    REQUIREDgt
  • lt!ELEMENT age (PCDATA)gt
  • lt!ELEMENT directed (PCDATA)gt
  • gt

52
More on ODL and DTD
  • Earlier last year (May 2000), Object Data
    Management Group (ODMG) suggested OIFML, a XML
    document type of Object Interchange Format.
  • http//www.odmg.org/library/readingroom/oifml.pdf

53
Constraints on IDs and IDREFs
  • ID stands for identifier. No two ID attributes
    with the same name may have the same value (of
    type CDATA)
  • IDREF stands for identifier reference. Every
    value associated with an IDREF attribute must
    exist as an ID attribute value
  • IDREFS specifies several (0 or more) identifiers

54
Connecting the document with its DTD
  • In line
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE db lt!ELEMENT ...gt gt
  • ltdbgt ... lt/dbgt
  • Another file
  • lt!DOCTYPE db SYSTEM "schema.dtd"gt
  • A URL
  • lt!DOCTYPE db SYSTEM
  • "http//www.schemaauthority.com/
    schema.dtd"gt

55
Well-formed and Valid Documents
  • Well-formed applies to any document (with or
    without a DTD) proper nesting of tags and unique
    attributes
  • Valid specifies that the document conforms to the
    DTD conforms to regular expression grammar,
    types of attributes correct, and constraints on
    references satisfied

56
DTDs v.s Schemas (or Types)
  • By database (or programming language) standards
    DTDs are rather weak specifications.
  • Only one base type -- PCDATA
  • No useful abstractions e.g., sets
  • IDREFs are untyped. You point to something, but
    you dont know what!
  • No constraints e.g., child is inverse of parent
  • No methods
  • Tag definitions are global
  • Some of the XML extensions impose something like
    a schema or type on an XML document. Well see
    these later

57
Lots of possibilities for schemas
  • XML Schema (under W3Cs spotlight)
  • XDR (Microsofts BizTalk)
  • SOX (Schema for Object-Oriented XML)
  • Schematron
  • DSD (ATT Labs and BRICS)
  • and more.

58
Some tools
  • XML Authority http//www.extensibility.com/tibco/s
    olutions/xml_authority/index.htm
  • XML Spy http//www.xmlspy.com
    /download.html

59
Summary
  • XML is a new data format. Its main virtues are
    widespread acceptance and the (important) ability
    to handle semistructured data (data without
    schema).
  • DTDs provide some useful syntactic constraints on
    documents. As schemas they are weak.
  • Next slides XML programming, XML querying.
Write a Comment
User Comments (0)
About PowerShow.com