Title: Using XML to Describe Hierarchically Structured Documents
1Using XML to Describe Hierarchically Structured
Documents
- Miles Efron
- School of Information
- UT Austin
2The Idea of Metadata
- Assuming we know how we want to represent our
documents, metadata provides us with a suitable
medium. - Metadata is structured data about information
Metadata typically adheres to some agreed-upon
conventions. This consistency promotes
interoperability...So your web browser always
knows how to represent a properly structured
document.
- This structure is usually expressed as
attribute-value pairs - Element attribute value
- where an attribute is a feature (who, what, when,
etc.) and a value assigns a specific measurement
(or answer) to the feature.
3Elements The Attribute-Value Model
- Title Sense and Sensibility
- Author Jane Austen
- Year 1811
- Chapter Heading 1
- Text The family of.The End
Each document is composed of a group of
elements each of which is made up by an
attribute and a value
4Encoding The Grammar of Metadata
Title Sense and Sensibility Creator Jane
Austen Year 1811 Number 1 Body The family
of.The End
- Title Sense and Sensibility
- Author Jane Austen
- Year 1811
- Chapter Heading 1
- Text The family of.The End
lttitlegtSense and Sensibilitylt/titlegt ltauthorgtJane
Austenlt/authorgt ltyeargt1811lt/yeargt ltchapterHeadinggt
1lt/chapterHeadinggt lttextgtThe family of.The
Endlt/textgt
5SGML-Based Metadata
SGML
XML
HTML
XHTML
A document is composed of pieces called elements.
The elements nest inside each other like small
boxes inside larger boxes, shaping and labeling
the content of the document. (Ray 4)
6The Idea of Markup Languages
- A markup language defines metadata that we add to
a document. - Typically markup is interspersed with data from
the document itself in order to communicate the
structure inherent in the document in a
machine-readable format.
7HTML Representing form
lthtmlgt ltbody bgcolor"white"gt ltbgt10 September
2008lt/bgt ltpgtDear Dean Dillon,lt/pgt ltpgtI would
like very much to teach another section of
INF384C this year. Would you please let me know
if any opportunities become available?lt/pgt ltpgtSin
cerely, ltbr/gtltemgtMileslt/emgt lt/pgt lt/bodygt lt/htmlgt
8HTML Representing form
9HTML a Familiar Markup Language
- Markup
- ltbgtHello, Worldlt/bgt
- ltulgtHello, Worldlt/ulgt
Displayed as Hello, World Hello, World
10Anatomy of an Element
- ltexamplegtThis is an examplelt/examplegt
11Anatomy of an Element
- ltexamplegtThis is an examplelt/examplegt
End tag
Start tag
Content
HTML Samples ltbodygthere is some
textlt/bodygt ltbgtThis is bold textlt/bgt
12Anatomy of an Element
- lte1gtThis lte2gtislt/e2gt an examplelt/e1gt
This element contains another element, in
addition to its textual content.
HTML Sample ltbodygtThis is ltbgtboldlt/bgt textlt/bodygt
13Anatomy of an Element
- ltexample3 interestingnogtan examplelt/example3gt
Some elements contain one or more
attributes. An attribute consists of a name and
a value. Attributes modify the behavior of an
element
lta hrefmiles.htmlgtA link to my home pagelt/agt
14Anatomy of an Element
- ltexample4 exampleAttributevalue /gt
Some elements ONLY contain attributes. These
are called empty elements.
HTML Sample ltimg srcmiles.jpg alta terrible
picture /gt
15A really simple XML document
- lt?xml version1.0?gt
- ltmessagegt
- ltexclamationgtHello, World!lt/exclamationgt
- ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
easy. - ltgraphic filerefimage.jpg/gt
- lt/paragraphgt
- lt/messagegt
16A really simple XML document
Elements in this document
- lt?xml version1.0?gt
- ltmessagegt
- ltexclamationgtHello, World!lt/exclamationgt
- ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
easy. - ltgraphic filerefimag.jpg/gtlt/paragraphgt
- lt/messagegt
message
exclamation
paragraph
graphic
emphasis
17A really simple XML document
Elements in this document
- lt?xml version1.0?gt
- ltmessagegt
- ltexclamationgtHello, World!lt/exclamationgt
- ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
easy. - ltgraphic filerefimag.jpg/gtlt/paragraphgt
- lt/messagegt
These are all called nodes in the XML document
tree.
message
exclamation
paragraph
graphic
emphasis
18A really simple XML document
Elements in this document
- lt?xml version1.0?gt
- ltmessagegt
- ltexclamationgtHello, World!lt/exclamationgt
- ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
easy. - ltgraphic filerefimag.jpg/gtlt/paragraphgt
- lt/messagegt
This node is special. We call it the root node.
message
exclamation
paragraph
graphic
emphasis
19Logical Markup and Markup for formatting
- HTML is largely geared toward formatting
documents for display in Web browsers. - People usually use XML for describing the
logical structure of documents. - What, to your thinking does this mean?
20Some Virtues of XML (cf. Ray 11-13)
- Application-Specific Markup Unlike HTML, where
everyone uses the same set of tags, with XML user
communities define their own element sets to suit
their needs. - Maximum Portability Despite its expressiveness,
XML is an open standard. Thus user communities
can exchange metadata expressed in XML freely.
21Some Virtues of XML (cf. Ray 11-13)
- Unambiguous Structure Unlike HTML, XML documents
are considered to be in error if they violate
basic rules of syntax. While this makes XML a
bit difficult to write, it makes manipulating XML
easy.
22Some Virtues of XML (cf. Ray 14-15)
- Separation of Format and Content When written
well, XML should reflect the logical structure of
data, not its formatting. This allows us to
change formatting as we see fit, and allows us to
treat documents logically.
23Structure of an XML Document
lt?xml version"1.0?gt lt!DOCTYPE letter SYSTEM
http//www.ibiblio.org/mefron/xml/dtd/letter.dtd"
gt ltletter letterDate"2008-09-10"gt ltgreetinggt ltsa
lutationgtDearlt/salutationgt ltrecipientgtDean
Dillonlt/recipientgt lt/greetinggt ltbodygt I would
like very much to teach another section of INF
384c this yes. Would you please let me know if
any opportunities become available? lt/bodygt ltclosi
nggt ltsignoffgtSincerelylt/signoffgt ltsendergtMileslt/se
ndergt lt/closinggt lt/lettergt
24Structure of an XML Document
PROLOGUE
lt?xml version"1.0"?gt lt!DOCTYPE letter SYSTEM
http//www.ibiblio.org/mefron/xml/dtd/letter.dtd"
gt ltletter letterDate"2008-09-10"gt ltgreetinggt lts
alutationgtDearlt/salutationgt ltrecipientgtDean
Dillonlt/recipientgt lt/greetinggt ltbodygt I would
like very much to teach another section of INF
384c this fall. Would you please let me know if
any opportunities become available? lt/bodygt ltclosi
nggt ltsignoffgtSincerelylt/signoffgt ltsendergtMileslt/se
ndergt lt/closinggt lt/lettergt
BODY
25Structure of an XML Document
PROLOGUE
lt?xml version"1.0"?gt lt!DOCTYPE letter SYSTEM
http//www.ibiblio.org/mefron/xml/dtd/letter.dtd"
gt
DOCTYPE declaration. This defines the tagset
that will be used to mark up the document. DTDs
may be defined locally or remotely.
XML declaration. Tells the parser what language
the document is expressed in. Additionally, may
specify the character encoding of the document.
26Structure of an XML Document
ltletter letterDate"2008-09-10"gt ltgreetinggt ltsalut
ationgtDearlt/salutationgt ltrecipientgtDean
Dillonlt/recipientgt lt/greetinggt ltbodygt I would
like very much to teach another section of
INF384C this year. Would you please let me know
if any opportunities become available? lt/bodygt ltcl
osinggt ltsignoffgtSincerelylt/signoffgt ltsendergtMileslt
/sendergt lt/closinggt lt/lettergt
BODY
Well formed?
Valid?
27Well formed XML
Well formed
Not well formed
- ltlistgt
- ltitemgtonelt/itemgt
- ltitemgttwolt/itemgt
- lt/listgt
ltlistgt ltitemgtone ltitemgttwo lt/listgt
28Well formed XML
Well formed
Not well formed
- ltfigure
- fileNamef.jpg /gt
ltfigure fileNamef.jpg gt
29Well formed XML
Well formed
Not well formed
ltagtA ltbgtbadlt/agt nestinglt/bgt ltmathgt2 lt 5lt/mathgt
- ltagtA good ltbgtnestinglt/bgt examplelt/agt
- ltmathgt2 lt 5lt/mathgt
30Valid XML
- What does it mean for a document to be valid XML?
31Valid XML
- What does it mean for a document to be valid XML?
- It is well-formed
- Its syntax also follows the rules specified in
the document type definition (DTD) to which it
refers. - Instead of a DTD the rules that define a valid
document can be expressed using an XML schema
(well focus more on DTDs).
32Document Type Definitions (DTDs)
33Document Type Definitions (DTDs)
- In a markup language, a DTD serves the function
of a grammar. - It provides the rules governing how elements are
expressed and combined in efforts to organize
document content.
34Document Type Definitions (DTDs)
- In the most literal sense, a DTD is a file on a
computer system. Using special statements, this
file defines what is legal behavior for a given
markup language. - In a more conceptual sense, a DTD operates as a
document model, expressing the relationship
among elements in a family of documents.
35Document Type Definitions (DTDs)
lt?xml version1.0?gt lt!DOCTYPE rootElement
SYSTEM root.dtdgt ltrootElementgt lt/rootElementgt
doc1.xml
root.dtd
lt!ELEMENT rootElement (PCDATA)gt
36What if I dont have a DTD?
- Without a DTD your XML can still be well-formed.
- Without a DTD your XML can still be very useful
37DTDs Doctype Definitions
- lt!ELEMENT letter (greeting,body,closing)gt
- lt!ELEMENT greeting (salutation?,recipient)gt
- lt!ELEMENT body (PCDATA)gt
- lt!ELEMENT closing (signoff?,sender)gt
- lt!ELEMENT salutation (PCDATA)gt
- lt!ELEMENT recipient (PCDATA)gt
- lt!ELEMENT signoff (PCDATA)gt
- lt!ELEMENT sender (PCDATA)gt
- lt!ATTLIST letter letterDate CDATA REQUIREDgt
letter.dtd file containing the DTD for the
letter document.
38DTDs Doctype Definitions
A letter element contains 3 sub-elements
greeting, body, and closing.
- lt!ELEMENT letter (greeting,body,closing)gt
- lt!ELEMENT greeting (salutation?,recipient)gt
- lt!ELEMENT body (PCDATA)gt
- lt!ELEMENT closing (signoff?,sender)gt
- lt!ELEMENT salutation (PCDATA)gt
- lt!ELEMENT recipient (PCDATA)gt
- lt!ELEMENT signoff (PCDATA)gt
- lt!ELEMENT sender (PCDATA)gt
- lt!ATTLIST letter letterDate CDATA REQUIREDgt
It also has a mandatory attribute, letterDate..
letter.dtd file containing the DTD for the
letter document.
39DTDs Doctype Definitions
A greeting element contains 2 sub-elements
salutation (optional) and recipient (occurs at
least once).
- lt!ELEMENT letter (greeting,body,closing)gt
- lt!ELEMENT greeting (salutation?,recipient)gt
- lt!ELEMENT body (PCDATA)gt
- lt!ELEMENT closing (signoff?,sender)gt
- lt!ELEMENT salutation (PCDATA)gt
- lt!ELEMENT recipient (PCDATA)gt
- lt!ELEMENT signoff (PCDATA)gt
- lt!ELEMENT sender (PCDATA)gt
- lt!ATTLIST letter letterDate CDATA REQUIREDgt
letter.dtd file containing the DTD for the
letter document.
40DTDs Doctype Definitions
- lt!ELEMENT letter (greeting,body,closing)gt
- lt!ELEMENT greeting (salutation?,recipient)gt
- lt!ELEMENT body (PCDATA)gt
- lt!ELEMENT closing (signoff?,sender)gt
- lt!ELEMENT salutation (PCDATA)gt
- lt!ELEMENT recipient (PCDATA)gt
- lt!ELEMENT signoff (PCDATA)gt
- lt!ELEMENT sender (PCDATA)gt
- lt!ATTLIST letter letterDate CDATA REQUIREDgt
A body element contains parsed character
data...i.e. text.
letter.dtd file containing the DTD for the
letter document.
41Creating a Simple DTDa novel
42Creating a Simple DTDa novel
- What are some of the elements we might keep
track of in a novel?
43Creating a Simple DTDa novel
- What are some of the elements we might keep
track of in a novel?
chapter author year title text
44Creating a Simple DTDa novel
- Lets arrange these into a tree structure
45Creating a Simple DTDa novel
- Lets arrange these into a tree structure
novel
46Creating a Simple DTDa novel
- Lets arrange these into a tree structure
novel
title
author
year
chapter
text
47Creating a Simple DTDa novel
- Lets arrange these into a tree structure
novel
title
author
year
chapter
Could some of these elements be attributes of the
novel? What are the implications/motivations for
making them elements or attributes?
text
48Creating a Simple DTDa novel
lt!ELEMENT novel (title, author, chapter)gt
This just says, hey, were defining an element now
49Creating a Simple DTDa novel
lt!ELEMENT novel (title, author, chapter)gt
and the element were defining is called novel.
50Creating a Simple DTDa novel
lt!ELEMENT novel (title, author, chapter)gt
and a novel contains a title, and author, and 1
or more chapters
51Creating a Simple DTDa novel
lt!ELEMENT novel (title, author,
chapter)gt lt!ATTLIST novel year CDATA IMPLIEDgt
lastly, a novel has an attribute called year
that contains unparsed text. Including a year is
optional.
52Creating a Simple DTDa novel
lt!ELEMENT novel (title, author,
chapter)gt lt!ELEMENT title (PCDATA)gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT chapter
(text)gt lt!ELEMENT text (PCDATA)gt lt!ATTLIST
novel year CDATA IMPLIEDgt
53XML Namespaces Combining Markup Languages
- As we read in Ray, XML isnt a markup language
it provides a way of defining your own markup
language - Often we might want to combine two pre-existing
sets of tags in a single document. - Do do this, we can use XML namespaces to clarify
the relationships among our elements.
54XML Namespaces Combining Markup Languages
- Syntax for declaring a namespace namespaces are
declared as attributes to an element. The
namespace is available to all elements below this
one - ltelementName xmlnsnsNameurlgt
55XML Namespaces Combining Markup Languages
- Syntax for declaring a namespace namespaces are
declared as attributes to an element. The
namespace is available to all elements below this
one - ltelementName xmlnsnsNameurlgt
We say that all elements below this element are
in the scope of the namespace url.
56XML Namespaces Combining Markup Languages
- Example combining Dublin Core and Vcard (a
standard for representing business card
information). - Dublin Core elements are defined at
http//purl.org/dc/elements/1.1/ - Vcard elements are defined at http//www.imc.org/r
fc2426
57- lt?xml version1.0?gt
- ltdcdc xmlnsdchttp//purl.org/dc/elements/1.
1/ - xmlnsvchttp//www.imc.org/rfc24
26 gt - ltdccreatorgt
- ltvcngt
- ltvcfamilygtEfronlt/vcfamilygt
- ltvcgivengtMileslt/vcgivengt
- lt/vcngt
- lt/dccreatorgt
- ltdctitlegtMiles Efrons Home Pagelt/dctitlegt
- lt/dcdcgt
58(No Transcript)
59OAI-PMH
- Data structure standard ???
- Data communication standard ???
60XML 1 Summary
- XML is a standard maintained by the W3C.
- XML imposes a tree structure on documents. Can
we think of a document type that doesnt lend
itself to a tree structure? - The nodes of the document tree are the ______ of
the document? - (Some) types of XML elements (Ray p. 50)
- Empty
- Container
- Character reference
61XML 1 Summary
- In order to parse, an XML document must be
- Well formed (always)
- Valid (under what condition?)
- A DTD defines a particular XML language (i.e. a
data structure definition). - Namespaces allow us to combine XML-encoded dad
structure definitions.