Title: CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226
1CIS336Website design, implementation and
management(also Semester 2 of CIS219, CIS221 and
IT226)
- Lecture 2
- XML Documents
- (Based on Møller and Schwartzbach, 2006, Chapter
2)
David Meredith d.meredith_at_gold.ac.uk www.titanmus
ic.com/teaching/cis336-2006-7.html
2What is XML?
- XML Extensible Markup Language
- XML is a framework for defining markup languages
- It is a subset of SGML
- No fixed collection of tags like HTML
- XML lets us define our own tags, designed for the
kind of information we want to represent - Each XML language is targeted at a particular
application domain (e.g., chemical formulae,
legal documents, religious texts, music notation) - All XML languages use the same basic markup
syntax and can benefit from a common set of
generic tools for processing documents - XML is intended to be the future of all
structured information - Including all information previously stored in
relational databases - Prompted development of powerful query language,
XQuery, which is designed to replace SQL
3XML and HTML
- XML is not an extensible markup language
- It is not a single markup language at all!
- It defines a class of markup languages and a
common notation that any markup language can use - XML is not an extension of or replacement for
HTML - HTML should ideally be a particular application
of XML, i.e., an XML language - HTML doesnt fit directly into the XML framework
- So W3C designed XHTML which is an XML-compliant
variant of HTML
4What XML doesnt do
- XML specification says nothing about the
semantics of the markup tags - Specified by the individual XML languages
- XML says nothing about how an XML document should
be rendered in a browser - Can specify an XML stylesheet (using XSL) that
defines how each tag should be rendered in a
browser
5XML and interoperability
- XML is designed to be inherently
internationalized and platform independent - All XML documents must use the Unicode character
set - contains all international characters, past and
present - XML also deals with different line-break
encodings on different platforms by normalising
all such breaks to the same sequence of
characters - Defined by a public, free specification which can
be viewed and implemented by anyone
6Development of XML(see http//www.w3.org/xml/)
- XML development started in mid 1990s
- Initial draft specification of XML produced in
November 1996 - Pure subset of SGML
- XML 1.0 became a W3C recommendation in February
1998 - Latest version of XML 1.0 is the Fourth Edition,
published in August 2006 - Available here http//www.w3.org/TR/xml/
- XML 1.1 became a W3C recommendation in February
1998 - Latest version of XML 1.1 is the Second Edition,
published in August 2006 - Available here http//www.w3.org/TR/xml11/
- XML 1.1 incorporates recent and future changes in
the Unicode standard and introduces the idea of
normalization of character encodings - XML 1.1 is not fully compatible with XML 1.0
- Many applications written in XML 1.0, so many
keep with this standard in preference to XML 1.1 - Perhaps standard should have been simpler
- But now huge amount of technology and information
that relies on the standard so core features will
probably not change
7Limitations of HTML
- HTML tags used here to indicate structure of
recipes - But no way of enforcing correct format for data
- Cannot easily sort recipes or select a subset
with particular features - Cannot easily perform computations on the recipe
data - HTML is not a good language for making a database
- HTML is designed specifically for the hypertext
domain, not other domains (like recipes) - In HTML, syntax and semantics (or structure and
layout) are intertwined, even if we use cascading
style sheets - For data storage, we want to store data with
logical structure only so that it can be
processed and formatted in all sorts of different
ways
8Recipes in XML
- Can define our own recipe markup language in XML,
RecipeML - Tags in RecipeML directly correspond to concepts
in the recipe domain - e.g., recipe, ingredient, preparation step, etc.
- Similar to identifying key domain abstractions in
OO software engineering - XML-ification is the process of developing an XML
representation for a particular domain - Essential information is in attributes and text
between tags - Tags indicate structure only, not layout
- Tags provide meta-information
- For any domain, usually many possible markup
designs, e.g., - could break up date into day, month and year
- Could enclose ingredient list in ltingredientsgt
tag - XML is semi-structured
- Can choose level of detail at which to mark up
text - Often have to choose between using attributes or
elements, e.g., - name, amount, unit attributes could be tags
9XML Language Syntax, Semantics Use as a Database
- Define syntax of an XML language (like RecipeML,
XHTML) using XML Schema - i.e., what tags are allowed and where they can
appear in the XML document - e.g., preparation tag can only contain step tags
and step tag can only contain text - Define semantics of an XML language using XSLT
- Transforms XML into appropriate XHTML file that
can be displayed in a web browser - Use XQuery to search recipe collection and
extract all sorts of information from it - For more specialized applications can use a
general-purpose programming language like Java - e.g., to write a web-based recipe editor, might
need to use Servlets and JSP
10XML Trees
- Each XML document represents a hierarchical
structure called an XML tree - Various ways of describing the structure of an
XML tree, but here will adopt XPath Data Model - XML tree can be represented graphically with root
node at the top (A in top diagram) - Edges between nodes represent parent-child
relationships - A is parent of B B is child of A in top diagram
- Content of a node is sequence of child nodes
- Sequence (B, C, D) is content of A in top diagram
- Leaf node is one with no children
- E, F, C and D are leaf nodes in top diagram
- XML tree is ordered so ordering of children of a
node is important - Two trees at right are not equivalent in XML
- Siblings of a node are the other nodes that are
children of the parent of the node - C and D are siblings of B in top diagram
- Ancestors of a node include its parent, its
parents parent, etc. back to root node - A and B are the ancestors of F in top diagram
- Descendants of a node include its children, its
childrens children and so on - Descendants of A are B, E, F, C and D in top
diagram
11XML Tree Node Types
- In XPath data model, XML tree is a special
ordered tree in which each node is one of the
following types - Text nodes
- Plain text, not an element, raw data
- Always leaf nodes (i.e., cannot have child nodes)
- Cannot have two consecutive sibling text nodes
- Node labelled with text
- Element nodes
- Logical grouping of information represented by
descendants - Node labelled with element name
- Attribute nodes
- Parent is always an element node
- Specify global properties of parent element
- Each attribute is a name-value pair where value
is always a text string - Names of attributes of a given element must be
distinct - Comment nodes
- Always a leaf node
- Always contains a text string
- Processing instruction nodes
- Used to convey specialized meta-information to
XML processing tools
12Tree view of XML recipe
- Some subtleties
- Parent of each attribute node is an element node,
but children of an element node do not include
attributes - Attributes of an element node form an unordered
set but children of an element form an ordered
set (or sequence) - Document ordering of nodes
- Node x occurs before node y if its start tag
occurs earlier in the textual representation of
the document than that of y - Parent precedes children, siblings ordered
left-to-right - Tree-view conventions
- Root node drawn as a circle
- Element nodes drawn as rounded boxes
- Text nodes drawn as parallelograms
- Attribute nodes drawn as rectangles containing
name value pairs
13Viewing tree structure in a browser
- If you load an XML file in a modern browser and
the file has no associated style sheet, then its
tree structure is shown
14Other XML data models
- Foregoing is how XML document described in XPath
data model - In DOM (Document Object Model) and JDOM (Java
Document Object Model), an XML tree can contain
other types of nodes such as - Document Type nodes corresponding to Document
Type Definitions (DTDs) - Entity reference nodes which are references to
XML fragments defined in the DTD schema - CDATA nodes which are a special type of text node
15Issues in designing an XML language
- Text nodes usually contain the actual information
or data - Elements and their attributes used to convey
logical structure and meta-information about the
data - Difference between information and
meta-information not always obvious - Some languages use elements for everything
- Others use attributes for everything so that all
elements are empty - Most languages use a mixture of elements and
attributes
16Textual representation of XML documents
- XML document is a Unicode text with markup tags
and other meta-information representing elements,
attributes and other nodes - Text nodes are written as the text they represent
(character data) - Element nodes delimited by start and end tags
- ltrelated ref"42"gtGarden Quiche is also
yummy.lt/relatedgt - Text in between start and end tags is the content
- This constitutes descendants of the element node
- Attributes written inside element start tag and
attribute values always written within double or
single quotes - ref"42" or ref'42'
- Empty element is one without content (i.e.,
nothing between start and end tags) - ltpineapplegtlt/pineapplegt or ltpineapple/gt
- XML document must be well-formed
- Nodes organised into a strictly nested tree
structure - Every start tag must have an end tag (or use
abbreviated form for empty element) - Elements must nest properly
- Properly nested ltbananagtltorangegtlt/orangegtlt/banana
gt - Improperly nested ltbananagtltorangegtlt/bananagtlt/oran
gegt - Cf. HTML which allows certain tags (particularly
many end tags) to be omitted and also allows
improper nesting - XML is case-sensitive
- ltTaggtlt/taggt is not well-formed because end tag
not the same name as start tag
17Textual representation of XML documents
- XML document should begin with an XML
declaration - lt?xml version"1.0" encoding"UTF-8" ?gt
- Version attribute indicates version of XML being
used - Should be 1.0 or 1.1
- Encoding attribute indicates encoding used in
file - All XML parsers required to understand Unicode
encodings UTF-8 and UTF-16 - Some parsers support other popular encodings like
ISO-8859-1 but must then be able to convert from
these encodings to Unicode code points - Best to use UTF-8 or UTF-16 if possible
- XML declaration followed by root element
18Character data and attribute values
- In character data (text nodes) and attribute
values, special characters have to be escaped
using Unicode character references - N denotes Unicode character with code point N
represented in decimal - xN denotes Unicode character with code point N
represented in hexadecimal - Some characters are predefined entities in XML
(see table above) - Examples
- lt can be referenced as lt, x3C or 60
- can be referenced as amp, x26 or 38
- lt and must be escaped in both character data
and attribute values - In attribute values that are enclosed by " or '
this character must also be escaped - Also use character references to encode Unicode
characters that are not accessible from the
keyboard - e.g., sake in hiragana script is
which is encoded as x3055x3051 - Complete list of Unicode character code points is
available on the Unicode website at
http//www.unicode.org/charts/
19CDATA sections
- If you have some text that contains lots of
characters that have to be escaped, then you can
enclose the text within a CDATA section - CDATA section corresponds to a CDATA node in the
DOM and JDOM data models - For example, in most situations, lt!CDATAaltb
bgtcgtis equivalent to altb amp bgtc - Strange syntax for CDATA sections originates in
SGML
20Comments, processing instructions and DTD
information
- Comment nodes are encoded in the source in the
same way as in HTML lt!--This is a comment--gt - A processing instruction is a target-value pair
delimited by lt?...?gt lt?xml-stylesheet
type"text/xsl" href"mystyle.xsl"?gtin which
xml-stylesheet is the target and the string
type"text/xsl" href"mystyle.xsl" is the
single value - Document type nodes (recognized in DOM and JDOM)
are encoded as follows lt!DOCTYPE gt
21Example XML document
- Contains an XML declaration, followed by a
document type definition and then a single root
element named features - The features element contains a processing
instruction, some character data and a comment
22White space in XML
- Often convenient to use white space (spaces,
tabs and new lines) to format source and make it
more readable - Usually this white space is not supposed to be
included in the delivered version of the document - However, sometimes we want the white space to be
preserved - e.g., in poetry or computer programme source code
- By default, the way that white space is handled
in an XML document is decided by the application
that is used to process the document - If we definitely want white space within the
content of an element to be preserved, then we
assign the value "preserve" to the attribute
xmlspace in that element - Applies to all elements within content of element
where xmlspace attribute value specified, unless
overridden by another instance of the xmlspace
attribute - Typically, white space handling is defined in the
DTD for the specific language
23Is XML too verbose?
- Some argue that XML markup is more verbose than
necessary - Same information can often be represented much
more parsimoniously in a relational database - Leads to (misguided) advice to use attributes in
preference to elements and short names for both
attributes and elements - This usually leads to inflexible and
incomprehensible language designs! - Better to disregard such considerations in the
design phase and then compress files later using
either a general purpose compression program or
one that is optimized for XML, such as XMill
http//sourceforge.net/projects/xmill - To represent structured information, all you
really need are text and element nodes - One simpler alternative is to use something like
Lisp S-Expressions which date back to 1958 - For example,(collection (recipe
(title "Rhubarb Cobbler") (date "Wed, 14
Jun 95") ))represents the same
asltcollectiongt ltrecipegt lttitlegtRhubarb
Cobblerlt/titlegt ltdategtWed, 14 Jun
95lt/dategt lt/recipegtlt/collectiongt
24Applications of XML
- Hundreds of XML applications have been developed
for many different domains http//xml.coverpages.
org/xmlApplications.html - XML languages can be roughlyl classified into
- Data-oriented languages for describing data that
would traditionally have been stored in
databases. - Usually have a flat, wide structure, with the
root element containing many similar children,
each with a simple structure - Document-oriented languages (e.g., XHTML) are for
annotating the structure of natural language text - Elements often have mixed content (elements and
character data) - Unlike documents in a data-oriented language,
documents in a document-oriented XML language can
usually be understood even if the markup tags are
removed - Protocols and programming languages including,
e.g., XML Schema and XSLT - Usually have most complex syntax
- Hybrids often combine features of data- and
document-oriented languages - Typically allow freeform text as content of
certain elements - e.g., ltcommentgt element in RecipeML
25Examples of XML languages XHTML
- XHTML 1.0 is W3Cs XML-ification of HTML 4.01
- Apart from the XML declaration and the XHTML
namespace declaration, XHTML is very similar to
HTML 4.01 - However, XHTML document must be a well-formed XML
document, therefore - Omitting end tags is forbidden in XHTML
- Can abbreviate end tags by using lt/gt notation,
e.g., ltbr/gt - XHTML element and attribute names must be lower
case - Attribute values cannot be omitted and must be
surrounded by double or single quotes - e.g., attribute checked in HTML must be written
checked"checked"
26XHTML Variants
- XHTML 1.0 Strict
- Clean markup in which all layout is specified
using CSS - XHTML 1.0 Transitional
- Additionally permits explicit layout markup like
bgcolor attribute and font tag - XHTML 1.0 Frameset
- Allows use of frames
- XHTML 1.1
- Modularization of XHTML 1.0 in which language
partitioned by functionality into modules, e.g., - Structure includes html, head and body tags
- Text includes basic text markup
- Hypertext includes the anchor tag (ltagt)
- Lists ul, ol, dl,
- Forms form, input, select,
-
- Each module defined using a separate DTD
- Allows specific subsets of the XHTML language to
be included in new languages
27Other XML languages
- CML (Chemical Markup Language)
- Data-oriented language for representing molecules
and chemical reactions - One of the first XML applications
- Supported by wide range of tools such as browsers
and editors - WML (Wireless Markup Language)
- Document-oriented XML language that replaces HTML
on mobile devices that typically have small
displays, limited user-input facilities and low
bandwidth - ebXML (Electronic Business XML Initiative)
- Worldwide initiative to use XML for exchanging
electronic business data - Has provided comprehensive standards for business
processes, data components, collaboration
protocol agreements, messaging etc. - Complex language that belongs to protocols and
programming languages category of XML languages - ThML (Theological Markup Language)
- Superset of XHTML for markup of theological texts
- Supports references, annotations, glossaries
- MusicXML
- For encoding Western musical staff notation
- Many other XML applications and initiatives
listed here http//xml.coverpages.org/xmlApplica
tions.html
28Namespaces in XML
- Not part of the XML specification
- Defined separately from XML specification
- For XML 1.0
- http//www.w3.org/TR/xml-names/
- For XML 1.1
- http//www.w3.org/TR/xml-names11/
- XML namespaces provide a simple method for
qualifying element and attribute names used in
Extensible Markup Language documents by
associating them with namespaces identified by
IRI references(http//www.w3.org/TR/xml-names/)
29XML NamespacesMotivating problem
- The (fictitious) XML language, WidgetML, is
designed for describing widgets - Can include explanatory text written in XHTML
within a WidgetML document - i.e., WidgetML uses XHTML as a sublanguage
- Example above describes a widget called gadget
with a medium-sized head and a big gizmo
subwidget - XHTML message contained within info element
- Both XHTML and non-XHTML part of WidgetML use
tags big and head - Means that big and head tags can mean different
things within a WidgetML document, depending on
context - Demonstrates need to be able to avoid name
clashes when combining languages that may use
elements with the same names to mean different
things - Programming languages uses name spaces and
qualified names to avoid clashing names - In XML we also use namespaces and each namespace
is identified by a unique URI - E.g., XHTML namespace is associated with the URI,
http//www.w3.org/1999/xhtml - WidgetML developed by a company called Widget
Inc. whose domain is www.widget.inc, so assigns a
URI under their domain to the WidgetML namespace,
such as http//www.widget.inc/widgetml/
30Namespaces
- So now, instead of just writing ltheadgtlt/headgt
inside the info element, we can
write lthttp//www.w3.org/1999/xhtmlheadgt
lt/http//www.w3.org/1999/xhtmlheadgtto specify
that we mean the head tag from XHTML, not the
head tag from WidgetML - But prefixing every tag with a long URI would
lead to an extremely verbose and incomprehensible
document - Instead, we assign a short name to the namespace
we want to use within an element and declare this
association between the short name and the
namespace as an attribute in the start tag of the
containing element ltinfo xmlnsfoo"http//www.w
3.org/TR/xhtml1gt ltfooheadgtlt/fooheadgt lt/
infogtWe can then prefix the tags within the
containing element with the short name to
indicate that they are from the declared
namespace - The attribute, xmlnsfoo"http//www.w3.org/TR/xht
ml1", - Declares the namespace named http//www.w3.org/TR/
xhtml1 - Gives this namespace the prefix, foo
31Namespaces
- Namespace declaration applies to all contents of
element in whose start tag it occurs - Can use any name as a prefix except one that
contains a colon or one starting with the letters
XML (in any combination of upper or lower case) - NCName (No Colon Name) is one that does not
contain a colon - QName (Qualified name) may be an NCName or an
NCName prefixed with a namespace prefix and a
colon - Unprefixed element names are assigned a default
namespace - Default namespace can be overridden by setting
attribute xmlns to the URI for the namespace to
be used for unprefixed tags ltwidget
typegadget xmlnshttp//www.widget.incgt
32Default namespaces dont apply to attributes!
- 1 and 2 are equivalent
- The size attribute does not come from the
http//www.widget.inc namespace - In 3, the size attribute does come from the
http//www.widget.inc namespace
33RecipeML
34Example RecipeML document
- An example RecipeML document is available
at http//www.brics.dk/ixwt/examples/recipes.xml
35Summary
- XML is a framework for developing markup
languages in any conceivable domain - XML is just a notation for hierarchically
structuring textual data - Strength of XML is that it is a widely accepted
standard supported by many generic languages and
tools - Means you get lots of free infrastructure if you
build on it - Considered XML in the form of trees and in its
textual representation - Considered the namespace mechanism for resolving
name conflicts