Title: XML and Beyond: Parts I and II
1XML and BeyondParts I and II
- http//db.cis.upenn.edu
- http//www.w3c.org
2Outline
- Background documents (SGML/HTML) and databases
(structured and semistructured data) - XML Basics and Document Type Descriptors
- XML APIs Document Object Model (DOM), SAX (not
covered in this course) - XML query languages XML-QL, XSL, Quilt.
3Part I Background
- Whats the difference between the world of
documents and information retrieval and databases
and query interfaces?
4Documents vs Databases
- Document world
- gt plenty of small documents
- gt usually static
- gt implicit structure
- section, paragraph, toc,
- gt tagging
- gt human friendly
- gt content
- form/layout, annotation
- gt Paradigms
- Save as
- gt meta-data
- author name, date, subject
- Database world
- gt a few large databases
- gt usually dynamic
- gt explicit structure (schema)
- gt records
- gt machine friendly
- gt content
- schema, data, methods
- gt Paradigms
- Atomicity, Concurrency, Isolation, Durability
- gt meta-data
- schema description
5What to do with them
- Documents
- editing
- printing
- spell-checking
- counting words
- retrieving (IR)
- searching
- Database
- updating
- cleaning
- querying
- composing/transforming
6HTML
- Lingua franca for publishing hypertext on the
World Wide Web - Designed to describe how a Web browser should
arrange text, images and push-buttons on a page. - Easy to learn, but does not convey structure.
- Fixed tag set.
Text (PCDATA)
Opening tag
ltHTMLgt ltHEADgtltTITLEgtWelcome to the XML
courselt/TITLEgtlt/HEADgt ltBODYgt ltH1gtIntroductionlt/H1
gt ltIMG SRCdragon.jpeg" WIDTH"200" HEIGHT"150
gt lt/BODYgt lt/HTMLgt
Closing tag
Bachelor tag
Attribute name
Attribute value
7Thin red line
- The line between the document world and the
database world is not clear. - In some cases, both approaches are legitimate.
- An interesting middle ground is data formats --
of which XML is an example - Examples
- Personal address book
8Personal address book over 20 years
1977 N Achison, Malcolm F Dr. M.P. Achison A
Dept. of Computer Science A University of
Edinburgh A Kings Buildings A Edinburgh E12 8QQ A
Scotland T 031-123-8855 ext. 4359 (work) T
031-345-7570 (home) N Albani, Paolo F Prof.
Paolo Albani A Dip. Informatica e Sistemistica A
Universita di Roma La Sapienza ...
1990 N Achison, Malcolm F Prof. M.P. Achison A
Dept. of Computing Science A University of
Glasgow A Lilybank Gardens A Glasgow G12 8QQ A
Scotland T 041-339-8855 ext. 4359 T 041-357-3787
(private) T 031-667-7570 (home) X 041-339-0090 C
mpa_at_uk.ac.gla.cs N Achison, Malcolm F Prof. M.P.
Achison A 34 Inverness Place A Edinburgh, EH3 8UV
1980 N Achison, Malcolm F Dr. M.P. Achison A
Dept. of Computer Science .... T 031-667-7570
(home) C mpa_at_uk.ac.ed.cs
1997 N Achison, Malcolm F Prof. M.P. Achison A
Department of Computing Science ... T
031-667-7570 (home) X 041-339-0090 C
mpa_at_dcs.gla.ac.uk W http//www.dcs.gla.ac.uk/mpa
2000 ?
9The Structure of XML
- XML consists of tags and text
- Tags come in pairs ltdategt ...lt/dategt
- They must be properly nested
- ltdategt ltdaygt ... lt/daygt ... lt/dategt --- good
- ltdategt ltdaygt ... lt/dategt... lt/daygt --- bad
- (You cant do ltigt ... ltbgt ... lt/igt ...lt/bgt in
HTML)
10XML text
- XML has only one basic type -- text.
- It is bounded by tags e.g.
- lttitlegt The Big Sleep lt/titlegt
- ltyeargt 1935 lt/ yeargt --- 1935 is still text
- XML text is called PCDATA (for parsed
- character data). It uses a 16-bit encoding,
- e.g. \\x0152 for the Hebrew letter Mem
11XML structure
- Nesting tags can be used to express various
structures. E.g. A tuple (record)
ltpersongt ltnamegt Malcolm Atchison lt/namegt
lttelgt (215) 898 4321 lt/telgt ltemailgt
mp_at_dcs.gla.ac.sc lt/emailgt lt/persongt
12XML structure
- We can represent a list by using the same
- tag repeatedly
ltaddressesgt ltpersongt ... lt/persongt ltpersongt
... lt/persongt ltpersongt ... lt/persongt
... lt/addressesgt
13Terminology
- The segment of an XML document between an opening
and a corresponding closing tag is called an
element.
ltpersongt ltnamegt Malcolm Atchison
lt/namegt lttelgt (215) 898 4321 lt/telgt lttelgt
(215) 898 4321 lt/telgt ltemailgt mp_at_dcs.gla.ac.sc
lt/emailgt lt/persongt
element
element, a sub-element of
not an element
14XML is tree-like
Malcolm Atchison
(215) 898 4321
(215) 898 4321
mp_at_dcs.gla.ac.sc
Semistructured data models typically put the
labels on the edges
15Mixed Content
- An element may contain a mixture of sub-elements
and PCDATA - ltairlinegt
- ltnamegt British Airways lt/namegt
- ltmottogt
- Worlds ltdubiousgt
favoritelt/dubiousgt airline - lt/mottogt
- lt/airlinegt
- Data of this form is not typically generated from
databases. It is needed for consistency with HTML
16A Complete XML Document
- lt?xml version"1.0"?gt
- ltpersongt
- ltnamegt Malcolm Atchison lt/namegt
- lttelgt (215) 898 4321 lt/telgt
- ltemailgt mp_at_dcs.gla.ac.sc lt/emailgt
- lt/persongt
17Representing relational DBsTwo ways
18Project and Employee relations in XML
Projects and employees are intermixed
- ltdbgt
- ltprojectgt
- lttitlegt Pattern recognition lt/titlegt
- ltbudgetgt 10000 lt/budgetgt
- ltmanagedBygt Joe lt/managedBygt
- lt/projectgt
- ltemployeegt
- ltnamegt Joe lt/namegt
- ltssngt 344556 lt/ssngt
- ltagegt 34 lt /agegt
- lt/employeegt
-
ltemployeegt ltnamegt Sandra lt/namegt
ltssngt 2234 lt/ssngt ltagegt 35 lt/agegt
lt/employeegt ltprojectgt lttitlegt Auto
guided vehicle lt/titlegt ltbudgetgt 70000
lt/budgetgt ltmanagedBygt Sandra lt/managedBygt
lt/projectgt lt/dbgt
19Project and Employee relations in XML (contd)
Employees follows projects
ltdbgt ltprojectsgt ltprojectgt
lttitlegt Pattern recognition lt/titlegt
ltbudgetgt 10000 lt/budgetgt ltmanagedBygt
Joe lt/managedBygt lt/projectgt ltprojectgt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt
Sandra lt/managedBygt lt/projectgt
lt/projectsgt
ltemployeesgt ltemployeegt ltnamegt Joe
lt/namegt ltssngt 344556 lt/ssngt ltagegt
34 lt/agegt lt/employeegt ltemployeegt
ltnamegt Sandra lt/namegt ltssngt 2234
lt/ssngt ltagegt35 lt/agegt lt/employeegt
ltemployeesgt lt/dbgt
20Project and Employee relations in XML (contd)
Or without separator tags
ltdbgt ltprojectsgt lttitlegt Pattern
recognition lt/titlegt ltbudgetgt 10000
lt/budgetgt ltmanagedBygt Joe lt/managedBygt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt Sandra
lt/managedBygt lt/projectsgt
ltemployeesgt ltnamegt Joe lt/namegt
ltssngt 344556 lt/ssngt ltagegt 34 lt/agegt
ltnamegt Sandra lt/namegt ltssngt 2234 lt/ssngt
ltagegt 35 lt/agegt lt/employeesgt lt/dbgt
21Attributes
- An (opening) tag may contain attributes. These
are typically used to describe the content of
an element - ltentrygt
- ltword language engt cheese lt/wordgt
- ltword language frgt fromage lt/wordgt
- ltword language rogt branza lt/wordgt
- ltmeaninggt A food made lt/meaninggt
- lt/entrygt
22Attributes (contd)
- Another common use for attributes is to express
dimension or type - ltpicturegt
- ltheight dim cmgt 2400 lt/heightgt
- ltwidth dim ingt 96 lt/widthgt
- ltdata encoding gif compression zipgt
- M05-.C_at_02!G96YEltFEC ...
- lt/datagt
- lt/picturegt
- A document that obeys the nested tags rule and
does not repeat an attribute within a tag is said
to be well-formed .
23When to use attributes
- Its not always clear when to use attributes
ltperson ssno 123 45 6789gt ltnamegt F.
MacNiel lt/namegt ltemailgt
fmacn_at_dcs.barra.ac.sc lt/emailgt ... lt/persongt OR
ltpersongt ltssnogt123 45 6789lt/ssnogt ltnamegt F.
MacNiel lt/namegt ltemailgt fmacn_at_dcs.barra.ac.sc
lt/emailgt ... lt/persongt
24Using IDs
- ltfamilygt
- ltperson id"jane" mother"mary"
father"john"gt - ltnamegt Jane Doe lt/namegt
- lt/persongt
- ltperson id"john" children"jane jack"gt
- ltnamegt John Doe lt/namegt ltmother/gt
- lt/persongt
- ltperson id"mary" children"jane jack"gt
- ltnamegt Mary Doe lt/namegt
- lt/persongt
- ltperson id"jack" mothermary"
father"john"gt - ltnamegt Jack Doe lt/namegt
- lt/persongt
- lt/familygt
25An object-oriented schema
- class Movie
- ( extent Movies, key title )
-
- attribute string title
- attribute string director
- relationship setltActorgt casts
- inverse Actoracted_In
- attribute int budget
-
class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
26An example
- ltdbgt
- ltmovie idm1gt
- lttitlegtWaking Ned Divinelt/titlegt
- ltdirectorgtKirk Jones IIIlt/directorgt
- ltcast idrefsa1 a3gtlt/castgt
- ltbudgetgt100,000lt/budgetgt
- lt/moviegt
- ltmovie idm2gt
- lttitlegtDragonheartlt/titlegt
- ltdirectorgtRob Cohenlt/directorgt
- ltcast idrefsa2 a9 a21gtlt/castgt
- ltbudgetgt110,000lt/budgetgt
- lt/moviegt
- ltmovie idm3gt
- lttitlegtMoondancelt/titlegt
- ltdirectorgtDagmar Hirtzlt/directorgt
- ltcast idrefsa1 a8gtlt/castgt
- ltbudgetgt90,000lt/budgetgt
- lt/moviegt
ltactor ida1gt ltnamegtDavid
Kellylt/namegt ltacted_In idrefsm1 m3 m78
gt lt/acted_Ingt lt/actorgt ltactor
ida2gt ltnamegtSean Connerylt/namegt
ltacted_In idrefsm2 m9 m11gt lt/acted_Ingt
ltagegt68lt/agegt lt/actorgt ltactor
ida3gt ltnamegtIan Bannenlt/namegt
ltacted_In idrefsm1 m35gt lt/acted_Ingt
lt/actorgt lt/dbgt
27Part II Document Type Descriptors
- Imposing structure on XML documents
28Document Type Descriptors
- Document Type Descriptors (DTDs) impose structure
on an XML document. - There is some relationship between a DTD and a
schema, but it is not close there is still a
need for additional typing systems. - The DTD is a syntactic specification.
29Example An Address Book
- ltpersongt
- ltnamegt MacNiel, John lt/namegt
- ltgreetgt Dr. John MacNiel lt/greetgt
- ltaddrgt1234 Huron Street lt/addrgt
- ltaddrgt Rome, OH 98765 lt/addrgt
- lttelgt (321) 786 2543 lt/telgt
- ltfaxgt (321) 786 2543 lt/faxgt
- lttelgt (321) 786 2543 lt/telgt
- ltemailgt jm_at_abc.com lt/emailgt
- lt/persongt
-
Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As many as needed
30Specifying the structure
- name to specify a name element
- greet? to specify an optional (0 or 1)
greet elements - name,greet? to specify a name followed by an
optional greet
31Specifying the structure (cont)
- addr to specify 0 or more address lines
- tel fax a tel or a fax element
- (tel fax) 0 or more repeats of tel or fax
- email 0 or more email elements
32Specifying the structure (cont)
- So the whole structure of a person entry is
specified by - name, greet?, addr, (tel fax), email
- This is known as a regular expression. Why is it
important?
33Regular Expressions
- Each regular expression determines a
corresponding finite state automaton. Lets start
with a simpler example - name, addr, email
- This suggests a simple parsing program
addr
name
email
34Another example
- name,address,(tel fax),email
address
email
tel
tel
name
email
fax
fax
email
Adding in the optional greet further complicates
things
35 A DTD for the address book
- lt!DOCTYPE addressbook
- lt!ELEMENT addressbook (person)gt
- lt!ELEMENT person
- (name, greet?, address, (fax tel),
email)gt - lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT greet (PCDATA)gt
- lt!ELEMENT address (PCDATA)gt
- lt!ELEMENT tel (PCDATA)gt
- lt!ELEMENT fax (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- gt
36Two DTDs for the relational DB
lt!DOCTYPE db lt!ELEMENT db
(projects,employees)gt lt!ELEMENT projects
(project)gt lt!ELEMENT employees (employee)gt
lt!ELEMENT project (title, budget,
managedBy)gt lt!ELEMENT employee (name, ssn,
age)gt ... gt
lt!DOCTYPE db lt!ELEMENT db (project
employee)gt lt!ELEMENT project (title,
budget, managedBy)gt lt!ELEMENT employee (name,
ssn, age)gt ... gt
37Recursive DTDs
- ltDOCTYPE genealogy
- lt!ELEMENT genealogy (person)gt
- lt!ELEMENT person (
- name,
- dateOfBirth,
- person, -- mother
- person )gt -- father
- ...
- gt
- What is the problem with this?
38Recursive DTDs contd.
- ltDOCTYPE genealogy
- lt!ELEMENT genealogy (person)gt
- lt!ELEMENT person (
- name,
- dateOfBirth,
- person?, -- mother
- person? )gt -- father
- ...
- gt
- What is now the problem with this?
39Some things are hard to specify
- Each employee element is to contain name, age and
ssn elements in some order. - lt!ELEMENT employee
- ( (name, age, ssn) (age, ssn, name)
- (ssn, name, age) ...
- )gt
- Suppose there were many more fields !
40Summary of XML regular expressions
- A The tag A occurs
- e1,e2 The expression e1 followed by e2
- e 0 or more occurrences of e
- e? Optional -- 0 or 1 occurrences
- e 1 or more occurrences
- e1 e2 either e1 or e2
- (e) grouping
41Its easy to get confused
- lt!ELEMENT PARTNER (NAME?, ONETIME?, PARTNRID?,
PARTNRTYPE?, SYNCIND?, ACTIVE?, CURRENCY?,
DESCRIPTN?, DUNSNUMBER?, GLENTITYS?, NAME,
PARENTID?, PARTNRIDX?, PARTNRRATG?, PARTNRROLE?,
PAYMETHOD?, TAXEXEMPT?, TAXID?, TERMID?,
USERAREA?, ADDRESS, CONTACT)gt - Cited from oagis_segments.dtd (one of the files
in the Novell Developer Kit http//developer.novel
l.com/ndk/indexexe.htm) - ltPARTNERgt ltNAMEgt Ben Franklin lt/NAMEgt lt/PARTNERgt
- Q. Which NAME is it?
42Specifying attributes in the DTD
- lt!ELEMENT height (PCDATA)gt
- lt!ATTLIST height
- dimension CDATA REQUIRED
- accuracy CDATA IMPLIED gt
- The dimension attribute is required the accuracy
attribute is optional. - CDATA is the type of the attribute -- it means
string.
43Specifying ID and IDREF attributes
- lt!DOCTYPE family
- lt!ELEMENT family (person)gt
- lt!ELEMENT person (name)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ATTLIST person
- id ID REQUIRED
- mother IDREF IMPLIED
- father IDREF IMPLIED
- children IDREFS IMPLIEDgt
- gt
44Some conforming data
- ltfamilygt
- ltperson id"jane" mother"mary"
father"john"gt - ltnamegt Jane Doe lt/namegt
- lt/persongt
- ltperson id"john" children"jane jack"gt
- ltnamegt John Doe lt/namegt
- lt/persongt
- ltperson id"mary" children"jane jack"gt
- ltnamegt Mary Doe lt/namegt
- lt/persongt
- ltperson id"jack" mothermary"
father"john"gt - ltnamegt Jack Doe lt/namegt
- lt/persongt
- lt/familygt
45Consistency of ID and IDREF attribute values
- If an attribute is declared as ID
- the associated values must all be distinct (no
confusion) - If an attribute is declared as IDREF
- the associated value must exist as the value of
some ID attribute (no dangling pointers) - Similarly for all the values of an IDREFS
attribute - ID and IDREF attributes are not typed
46An alternative specification
- lt!DOCTYPE family
- lt!ELEMENT family (person)gt
- lt!ELEMENT person (mother?, father?, children,
name)gt - lt!ATTLIST person id ID REQUIREDgt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT mother EMPTYgt
- lt!ATTLIST mother idref IDREF REQUIREDgt
- lt!ELEMENT father EMPTYgt
- lt!ATTLIST father idref IDREF REQUIREDgt
- lt!ELEMENT children EMPTYgt
- lt!ATTLIST children idrefs IDREFS REQUIREDgt
- gt
47The revised data
- ltfamilygt
- ltperson id "janegt
- ltnamegt Jane Doe lt/namegt
- ltmother idref "marygtlt/mothergt
- ltfather idref "john"gtlt/fathergt
- lt/persongt
- ltperson id "johngt
- ltnamegt John Doe lt/namegt
- ltchildren idrefs "jane jack"gt lt/childrengt
- lt/persongt
- ...
- lt/familygt
48A useful abbreviation
- When an element has empty content we can use
- lttag blahblahbla/gt for lttag
blahblahblagtlt/taggt - For example
- ltfamilygt
- ltperson id "janegt
- ltnamegt Jane Doe lt/namegt
- ltmother idref "mary/gt
- ltfather idref "john/gt
- lt/persongt
- ...
- lt/familygt
49Back to the object-oriented schema
- class Movie
- ( extent Movies, key title )
-
- attribute string title
- attribute string director
- relationship setltActorgt casts
- inverse Actoracted_In
- attribute int budget
-
class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
50Schema.dtd
- lt!DOCTYPE db
- lt!ELEMENT db (movie, actor)gt
- lt!ELEMENT movie (title,director,casts,budget
)gt - lt!ATTLIST movie id ID REQUIREDgt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT director (PCDATA)gt
- lt!ELEMENT casts EMPTYgt
- lt!ATTLIST casts idrefs IDREFS
REQUIREDgt - lt!ELEMENT budget (PCDATA)gt
-
51Schema.dtd (contd)
- lt!ELEMENT actor (name, acted_In,age?,
directed)gt - lt!ATTLIST actor id ID REQUIREDgt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT acted_In EMPTYgt
- lt!ATTLIST acted_In idrefs IDREFS
REQUIREDgt - lt!ELEMENT age (PCDATA)gt
- lt!ELEMENT directed (PCDATA)gt
- gt
52More on ODL and DTD
- Earlier last year (May 2000), Object Data
Management Group (ODMG) suggested OIFML, a XML
document type of Object Interchange Format. - http//www.odmg.org/library/readingroom/oifml.pdf
53Constraints on IDs and IDREFs
- ID stands for identifier. No two ID attributes
with the same name may have the same value (of
type CDATA) - IDREF stands for identifier reference. Every
value associated with an IDREF attribute must
exist as an ID attribute value - IDREFS specifies several (0 or more) identifiers
54Connecting the document with its DTD
- In line
- lt?xml version"1.0"?gt
- lt!DOCTYPE db lt!ELEMENT ...gt gt
- ltdbgt ... lt/dbgt
- Another file
- lt!DOCTYPE db SYSTEM "schema.dtd"gt
- A URL
- lt!DOCTYPE db SYSTEM
- "http//www.schemaauthority.com/
schema.dtd"gt
55Well-formed and Valid Documents
- Well-formed applies to any document (with or
without a DTD) proper nesting of tags and unique
attributes - Valid specifies that the document conforms to the
DTD conforms to regular expression grammar,
types of attributes correct, and constraints on
references satisfied
56DTDs v.s Schemas (or Types)
- By database (or programming language) standards
DTDs are rather weak specifications. - Only one base type -- PCDATA
- No useful abstractions e.g., sets
- IDREFs are untyped. You point to something, but
you dont know what! - No constraints e.g., child is inverse of parent
- No methods
- Tag definitions are global
- Some of the XML extensions impose something like
a schema or type on an XML document. Well see
these later
57Lots of possibilities for schemas
- XML Schema (under W3Cs spotlight)
- XDR (Microsofts BizTalk)
- SOX (Schema for Object-Oriented XML)
- Schematron
- DSD (ATT Labs and BRICS)
- and more.
58Some tools
- XML Authority http//www.extensibility.com/tibco/s
olutions/xml_authority/index.htm - XML Spy http//www.xmlspy.com
/download.html
59Summary
- XML is a new data format. Its main virtues are
widespread acceptance and the (important) ability
to handle semistructured data (data without
schema). - DTDs provide some useful syntactic constraints on
documents. As schemas they are weak. - Next slides XML programming, XML querying.