CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226 - PowerPoint PPT Presentation

About This Presentation

Title:

CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226

Description:

... (e.g., chemical formulae, legal documents, religious texts, music notation) ... points is available on the Unicode website at: http://www.unicode.org/charts ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 36

Provided by: davidme152

Category:

more less

Transcript and Presenter's Notes

Title: CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226

1
CIS336Website design, implementation and
management(also Semester 2 of CIS219, CIS221 and
IT226)

Lecture 2
XML Documents
(Based on Møller and Schwartzbach, 2006, Chapter
2)

David Meredith d.meredith_at_gold.ac.uk www.titanmus
ic.com/teaching/cis336-2006-7.html
2
What is XML?

XML Extensible Markup Language
XML is a framework for defining markup languages
It is a subset of SGML
No fixed collection of tags like HTML
XML lets us define our own tags, designed for the
kind of information we want to represent
Each XML language is targeted at a particular
application domain (e.g., chemical formulae,
legal documents, religious texts, music notation)
All XML languages use the same basic markup
syntax and can benefit from a common set of
generic tools for processing documents
XML is intended to be the future of all
structured information
Including all information previously stored in
relational databases
Prompted development of powerful query language,
XQuery, which is designed to replace SQL

3
XML and HTML

XML is not an extensible markup language
It is not a single markup language at all!
It defines a class of markup languages and a
common notation that any markup language can use
XML is not an extension of or replacement for
HTML
HTML should ideally be a particular application
of XML, i.e., an XML language
HTML doesnt fit directly into the XML framework
So W3C designed XHTML which is an XML-compliant
variant of HTML

4
What XML doesnt do

XML specification says nothing about the
semantics of the markup tags
Specified by the individual XML languages
XML says nothing about how an XML document should
be rendered in a browser
Can specify an XML stylesheet (using XSL) that
defines how each tag should be rendered in a
browser

5
XML and interoperability

XML is designed to be inherently
internationalized and platform independent
All XML documents must use the Unicode character
set
contains all international characters, past and
present
XML also deals with different line-break
encodings on different platforms by normalising
all such breaks to the same sequence of
characters
Defined by a public, free specification which can
be viewed and implemented by anyone

6
Development of XML(see http//www.w3.org/xml/)

XML development started in mid 1990s
Initial draft specification of XML produced in
November 1996
Pure subset of SGML
XML 1.0 became a W3C recommendation in February
1998
Latest version of XML 1.0 is the Fourth Edition,
published in August 2006
Available here http//www.w3.org/TR/xml/
XML 1.1 became a W3C recommendation in February
1998
Latest version of XML 1.1 is the Second Edition,
published in August 2006
Available here http//www.w3.org/TR/xml11/
XML 1.1 incorporates recent and future changes in
the Unicode standard and introduces the idea of
normalization of character encodings
XML 1.1 is not fully compatible with XML 1.0
Many applications written in XML 1.0, so many
keep with this standard in preference to XML 1.1
Perhaps standard should have been simpler
But now huge amount of technology and information
that relies on the standard so core features will
probably not change

7
Limitations of HTML

HTML tags used here to indicate structure of
recipes
But no way of enforcing correct format for data
Cannot easily sort recipes or select a subset
with particular features
Cannot easily perform computations on the recipe
data
HTML is not a good language for making a database
HTML is designed specifically for the hypertext
domain, not other domains (like recipes)
In HTML, syntax and semantics (or structure and
layout) are intertwined, even if we use cascading
style sheets
For data storage, we want to store data with
logical structure only so that it can be
processed and formatted in all sorts of different
ways

8
Recipes in XML

Can define our own recipe markup language in XML,
RecipeML
Tags in RecipeML directly correspond to concepts
in the recipe domain
e.g., recipe, ingredient, preparation step, etc.
Similar to identifying key domain abstractions in
OO software engineering
XML-ification is the process of developing an XML
representation for a particular domain
Essential information is in attributes and text
between tags
Tags indicate structure only, not layout
Tags provide meta-information
For any domain, usually many possible markup
designs, e.g.,
could break up date into day, month and year
Could enclose ingredient list in ltingredientsgt
tag
XML is semi-structured
Can choose level of detail at which to mark up
text
Often have to choose between using attributes or
elements, e.g.,
name, amount, unit attributes could be tags

9
XML Language Syntax, Semantics Use as a Database

Define syntax of an XML language (like RecipeML,
XHTML) using XML Schema
i.e., what tags are allowed and where they can
appear in the XML document
e.g., preparation tag can only contain step tags
and step tag can only contain text
Define semantics of an XML language using XSLT
Transforms XML into appropriate XHTML file that
can be displayed in a web browser
Use XQuery to search recipe collection and
extract all sorts of information from it
For more specialized applications can use a
general-purpose programming language like Java
e.g., to write a web-based recipe editor, might
need to use Servlets and JSP

10
XML Trees

Each XML document represents a hierarchical
structure called an XML tree
Various ways of describing the structure of an
XML tree, but here will adopt XPath Data Model
XML tree can be represented graphically with root
node at the top (A in top diagram)
Edges between nodes represent parent-child
relationships
A is parent of B B is child of A in top diagram
Content of a node is sequence of child nodes
Sequence (B, C, D) is content of A in top diagram
Leaf node is one with no children
E, F, C and D are leaf nodes in top diagram
XML tree is ordered so ordering of children of a
node is important
Two trees at right are not equivalent in XML
Siblings of a node are the other nodes that are
children of the parent of the node
C and D are siblings of B in top diagram
Ancestors of a node include its parent, its
parents parent, etc. back to root node
A and B are the ancestors of F in top diagram
Descendants of a node include its children, its
childrens children and so on
Descendants of A are B, E, F, C and D in top
diagram

11
XML Tree Node Types

In XPath data model, XML tree is a special
ordered tree in which each node is one of the
following types
Text nodes
Plain text, not an element, raw data
Always leaf nodes (i.e., cannot have child nodes)
Cannot have two consecutive sibling text nodes
Node labelled with text
Element nodes
Logical grouping of information represented by
descendants
Node labelled with element name
Attribute nodes
Parent is always an element node
Specify global properties of parent element
Each attribute is a name-value pair where value
is always a text string
Names of attributes of a given element must be
distinct
Comment nodes
Always a leaf node
Always contains a text string
Processing instruction nodes
Used to convey specialized meta-information to
XML processing tools

12
Tree view of XML recipe

Some subtleties
Parent of each attribute node is an element node,
but children of an element node do not include
attributes
Attributes of an element node form an unordered
set but children of an element form an ordered
set (or sequence)
Document ordering of nodes
Node x occurs before node y if its start tag
occurs earlier in the textual representation of
the document than that of y
Parent precedes children, siblings ordered
left-to-right
Tree-view conventions
Root node drawn as a circle
Element nodes drawn as rounded boxes
Text nodes drawn as parallelograms
Attribute nodes drawn as rectangles containing
name value pairs

13
Viewing tree structure in a browser

If you load an XML file in a modern browser and
the file has no associated style sheet, then its
tree structure is shown

14
Other XML data models

Foregoing is how XML document described in XPath
data model
In DOM (Document Object Model) and JDOM (Java
Document Object Model), an XML tree can contain
other types of nodes such as
Document Type nodes corresponding to Document
Type Definitions (DTDs)
Entity reference nodes which are references to
XML fragments defined in the DTD schema
CDATA nodes which are a special type of text node

15
Issues in designing an XML language

Text nodes usually contain the actual information
or data
Elements and their attributes used to convey
logical structure and meta-information about the
data
Difference between information and
meta-information not always obvious
Some languages use elements for everything
Others use attributes for everything so that all
elements are empty
Most languages use a mixture of elements and
attributes

16
Textual representation of XML documents

XML document is a Unicode text with markup tags
and other meta-information representing elements,
attributes and other nodes
Text nodes are written as the text they represent
(character data)
Element nodes delimited by start and end tags
ltrelated ref"42"gtGarden Quiche is also
yummy.lt/relatedgt
Text in between start and end tags is the content
This constitutes descendants of the element node
Attributes written inside element start tag and
attribute values always written within double or
single quotes
ref"42" or ref'42'
Empty element is one without content (i.e.,
nothing between start and end tags)
ltpineapplegtlt/pineapplegt or ltpineapple/gt
XML document must be well-formed
Nodes organised into a strictly nested tree
structure
Every start tag must have an end tag (or use
abbreviated form for empty element)
Elements must nest properly
Properly nested ltbananagtltorangegtlt/orangegtlt/banana
gt
Improperly nested ltbananagtltorangegtlt/bananagtlt/oran
gegt
Cf. HTML which allows certain tags (particularly
many end tags) to be omitted and also allows
improper nesting
XML is case-sensitive
ltTaggtlt/taggt is not well-formed because end tag
not the same name as start tag

17
Textual representation of XML documents

XML document should begin with an XML
declaration
lt?xml version"1.0" encoding"UTF-8" ?gt
Version attribute indicates version of XML being
used
Should be 1.0 or 1.1
Encoding attribute indicates encoding used in
file
All XML parsers required to understand Unicode
encodings UTF-8 and UTF-16
Some parsers support other popular encodings like
ISO-8859-1 but must then be able to convert from
these encodings to Unicode code points
Best to use UTF-8 or UTF-16 if possible
XML declaration followed by root element

18
Character data and attribute values

In character data (text nodes) and attribute
values, special characters have to be escaped
using Unicode character references
N denotes Unicode character with code point N
represented in decimal
xN denotes Unicode character with code point N
represented in hexadecimal
Some characters are predefined entities in XML
(see table above)
Examples
lt can be referenced as lt, x3C or 60
can be referenced as amp, x26 or 38
lt and must be escaped in both character data
and attribute values
In attribute values that are enclosed by " or '
this character must also be escaped
Also use character references to encode Unicode
characters that are not accessible from the
keyboard
e.g., sake in hiragana script is
which is encoded as x3055x3051
Complete list of Unicode character code points is
available on the Unicode website at
http//www.unicode.org/charts/

19
CDATA sections

If you have some text that contains lots of
characters that have to be escaped, then you can
enclose the text within a CDATA section
CDATA section corresponds to a CDATA node in the
DOM and JDOM data models
For example, in most situations, lt!CDATAaltb
bgtcgtis equivalent to altb amp bgtc
Strange syntax for CDATA sections originates in
SGML

20
Comments, processing instructions and DTD
information

Comment nodes are encoded in the source in the
same way as in HTML lt!--This is a comment--gt
A processing instruction is a target-value pair
delimited by lt?...?gt lt?xml-stylesheet
type"text/xsl" href"mystyle.xsl"?gtin which
xml-stylesheet is the target and the string
type"text/xsl" href"mystyle.xsl" is the
single value
Document type nodes (recognized in DOM and JDOM)
are encoded as follows lt!DOCTYPE gt

21
Example XML document

Contains an XML declaration, followed by a
document type definition and then a single root
element named features
The features element contains a processing
instruction, some character data and a comment

22
White space in XML

Often convenient to use white space (spaces,
tabs and new lines) to format source and make it
more readable
Usually this white space is not supposed to be
included in the delivered version of the document
However, sometimes we want the white space to be
preserved
e.g., in poetry or computer programme source code
By default, the way that white space is handled
in an XML document is decided by the application
that is used to process the document
If we definitely want white space within the
content of an element to be preserved, then we
assign the value "preserve" to the attribute
xmlspace in that element
Applies to all elements within content of element
where xmlspace attribute value specified, unless
overridden by another instance of the xmlspace
attribute
Typically, white space handling is defined in the
DTD for the specific language

23
Is XML too verbose?

Some argue that XML markup is more verbose than
necessary
Same information can often be represented much
more parsimoniously in a relational database
Leads to (misguided) advice to use attributes in
preference to elements and short names for both
attributes and elements
This usually leads to inflexible and
incomprehensible language designs!
Better to disregard such considerations in the
design phase and then compress files later using
either a general purpose compression program or
one that is optimized for XML, such as XMill
http//sourceforge.net/projects/xmill
To represent structured information, all you
really need are text and element nodes
One simpler alternative is to use something like
Lisp S-Expressions which date back to 1958
For example,(collection (recipe
(title "Rhubarb Cobbler") (date "Wed, 14
Jun 95") ))represents the same
asltcollectiongt ltrecipegt lttitlegtRhubarb
Cobblerlt/titlegt ltdategtWed, 14 Jun
95lt/dategt lt/recipegtlt/collectiongt

24
Applications of XML

Hundreds of XML applications have been developed
for many different domains http//xml.coverpages.
org/xmlApplications.html
XML languages can be roughlyl classified into
Data-oriented languages for describing data that
would traditionally have been stored in
databases.
Usually have a flat, wide structure, with the
root element containing many similar children,
each with a simple structure
Document-oriented languages (e.g., XHTML) are for
annotating the structure of natural language text
Elements often have mixed content (elements and
character data)
Unlike documents in a data-oriented language,
documents in a document-oriented XML language can
usually be understood even if the markup tags are
removed
Protocols and programming languages including,
e.g., XML Schema and XSLT
Usually have most complex syntax
Hybrids often combine features of data- and
document-oriented languages
Typically allow freeform text as content of
certain elements
e.g., ltcommentgt element in RecipeML

25
Examples of XML languages XHTML

XHTML 1.0 is W3Cs XML-ification of HTML 4.01
Apart from the XML declaration and the XHTML
namespace declaration, XHTML is very similar to
HTML 4.01
However, XHTML document must be a well-formed XML
document, therefore
Omitting end tags is forbidden in XHTML
Can abbreviate end tags by using lt/gt notation,
e.g., ltbr/gt
XHTML element and attribute names must be lower
case
Attribute values cannot be omitted and must be
surrounded by double or single quotes
e.g., attribute checked in HTML must be written
checked"checked"

26
XHTML Variants

XHTML 1.0 Strict
Clean markup in which all layout is specified
using CSS
XHTML 1.0 Transitional
Additionally permits explicit layout markup like
bgcolor attribute and font tag
XHTML 1.0 Frameset
Allows use of frames
XHTML 1.1
Modularization of XHTML 1.0 in which language
partitioned by functionality into modules, e.g.,
Structure includes html, head and body tags
Text includes basic text markup
Hypertext includes the anchor tag (ltagt)
Lists ul, ol, dl,
Forms form, input, select,
Each module defined using a separate DTD
Allows specific subsets of the XHTML language to
be included in new languages

27
Other XML languages

CML (Chemical Markup Language)
Data-oriented language for representing molecules
and chemical reactions
One of the first XML applications
Supported by wide range of tools such as browsers
and editors
WML (Wireless Markup Language)
Document-oriented XML language that replaces HTML
on mobile devices that typically have small
displays, limited user-input facilities and low
bandwidth
ebXML (Electronic Business XML Initiative)
Worldwide initiative to use XML for exchanging
electronic business data
Has provided comprehensive standards for business
processes, data components, collaboration
protocol agreements, messaging etc.
Complex language that belongs to protocols and
programming languages category of XML languages
ThML (Theological Markup Language)
Superset of XHTML for markup of theological texts
Supports references, annotations, glossaries
MusicXML
For encoding Western musical staff notation
Many other XML applications and initiatives
listed here http//xml.coverpages.org/xmlApplica
tions.html

28
Namespaces in XML

Not part of the XML specification
Defined separately from XML specification
For XML 1.0
http//www.w3.org/TR/xml-names/
For XML 1.1
http//www.w3.org/TR/xml-names11/
XML namespaces provide a simple method for
qualifying element and attribute names used in
Extensible Markup Language documents by
associating them with namespaces identified by
IRI references(http//www.w3.org/TR/xml-names/)

29
XML NamespacesMotivating problem

The (fictitious) XML language, WidgetML, is
designed for describing widgets
Can include explanatory text written in XHTML
within a WidgetML document
i.e., WidgetML uses XHTML as a sublanguage
Example above describes a widget called gadget
with a medium-sized head and a big gizmo
subwidget
XHTML message contained within info element
Both XHTML and non-XHTML part of WidgetML use
tags big and head
Means that big and head tags can mean different
things within a WidgetML document, depending on
context
Demonstrates need to be able to avoid name
clashes when combining languages that may use
elements with the same names to mean different
things
Programming languages uses name spaces and
qualified names to avoid clashing names
In XML we also use namespaces and each namespace
is identified by a unique URI
E.g., XHTML namespace is associated with the URI,
http//www.w3.org/1999/xhtml
WidgetML developed by a company called Widget
Inc. whose domain is www.widget.inc, so assigns a
URI under their domain to the WidgetML namespace,
such as http//www.widget.inc/widgetml/

30
Namespaces

So now, instead of just writing ltheadgtlt/headgt
inside the info element, we can
write lthttp//www.w3.org/1999/xhtmlheadgt
lt/http//www.w3.org/1999/xhtmlheadgtto specify
that we mean the head tag from XHTML, not the
head tag from WidgetML
But prefixing every tag with a long URI would
lead to an extremely verbose and incomprehensible
document
Instead, we assign a short name to the namespace
we want to use within an element and declare this
association between the short name and the
namespace as an attribute in the start tag of the
containing element ltinfo xmlnsfoo"http//www.w
3.org/TR/xhtml1gt ltfooheadgtlt/fooheadgt lt/
infogtWe can then prefix the tags within the
containing element with the short name to
indicate that they are from the declared
namespace
The attribute, xmlnsfoo"http//www.w3.org/TR/xht
ml1",
Declares the namespace named http//www.w3.org/TR/
xhtml1
Gives this namespace the prefix, foo

31
Namespaces

Namespace declaration applies to all contents of
element in whose start tag it occurs
Can use any name as a prefix except one that
contains a colon or one starting with the letters
XML (in any combination of upper or lower case)
NCName (No Colon Name) is one that does not
contain a colon
QName (Qualified name) may be an NCName or an
NCName prefixed with a namespace prefix and a
colon
Unprefixed element names are assigned a default
namespace
Default namespace can be overridden by setting
attribute xmlns to the URI for the namespace to
be used for unprefixed tags ltwidget
typegadget xmlnshttp//www.widget.incgt

32
Default namespaces dont apply to attributes!

1 and 2 are equivalent
The size attribute does not come from the
http//www.widget.inc namespace
In 3, the size attribute does come from the
http//www.widget.inc namespace

33
RecipeML
34
Example RecipeML document

An example RecipeML document is available
at http//www.brics.dk/ixwt/examples/recipes.xml

35
Summary

XML is a framework for developing markup
languages in any conceivable domain
XML is just a notation for hierarchically
structuring textual data
Strength of XML is that it is a widely accepted
standard supported by many generic languages and
tools
Means you get lots of free infrastructure if you
build on it
Considered XML in the form of trees and in its
textual representation
Considered the namespace mechanism for resolving
name conflicts