CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226 - PowerPoint PPT Presentation

About This Presentation
Title:

CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226

Description:

... (e.g., chemical formulae, legal documents, religious texts, music notation) ... points is available on the Unicode website at: http://www.unicode.org/charts ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 36
Provided by: davidme152
Category:

less

Transcript and Presenter's Notes

Title: CIS336 Website design, implementation and management also Semester 2 of CIS219, CIS221 and IT226


1
CIS336Website design, implementation and
management(also Semester 2 of CIS219, CIS221 and
IT226)
  • Lecture 2
  • XML Documents
  • (Based on Møller and Schwartzbach, 2006, Chapter
    2)

David Meredith d.meredith_at_gold.ac.uk www.titanmus
ic.com/teaching/cis336-2006-7.html
2
What is XML?
  • XML Extensible Markup Language
  • XML is a framework for defining markup languages
  • It is a subset of SGML
  • No fixed collection of tags like HTML
  • XML lets us define our own tags, designed for the
    kind of information we want to represent
  • Each XML language is targeted at a particular
    application domain (e.g., chemical formulae,
    legal documents, religious texts, music notation)
  • All XML languages use the same basic markup
    syntax and can benefit from a common set of
    generic tools for processing documents
  • XML is intended to be the future of all
    structured information
  • Including all information previously stored in
    relational databases
  • Prompted development of powerful query language,
    XQuery, which is designed to replace SQL

3
XML and HTML
  • XML is not an extensible markup language
  • It is not a single markup language at all!
  • It defines a class of markup languages and a
    common notation that any markup language can use
  • XML is not an extension of or replacement for
    HTML
  • HTML should ideally be a particular application
    of XML, i.e., an XML language
  • HTML doesnt fit directly into the XML framework
  • So W3C designed XHTML which is an XML-compliant
    variant of HTML

4
What XML doesnt do
  • XML specification says nothing about the
    semantics of the markup tags
  • Specified by the individual XML languages
  • XML says nothing about how an XML document should
    be rendered in a browser
  • Can specify an XML stylesheet (using XSL) that
    defines how each tag should be rendered in a
    browser

5
XML and interoperability
  • XML is designed to be inherently
    internationalized and platform independent
  • All XML documents must use the Unicode character
    set
  • contains all international characters, past and
    present
  • XML also deals with different line-break
    encodings on different platforms by normalising
    all such breaks to the same sequence of
    characters
  • Defined by a public, free specification which can
    be viewed and implemented by anyone

6
Development of XML(see http//www.w3.org/xml/)
  • XML development started in mid 1990s
  • Initial draft specification of XML produced in
    November 1996
  • Pure subset of SGML
  • XML 1.0 became a W3C recommendation in February
    1998
  • Latest version of XML 1.0 is the Fourth Edition,
    published in August 2006
  • Available here http//www.w3.org/TR/xml/
  • XML 1.1 became a W3C recommendation in February
    1998
  • Latest version of XML 1.1 is the Second Edition,
    published in August 2006
  • Available here http//www.w3.org/TR/xml11/
  • XML 1.1 incorporates recent and future changes in
    the Unicode standard and introduces the idea of
    normalization of character encodings
  • XML 1.1 is not fully compatible with XML 1.0
  • Many applications written in XML 1.0, so many
    keep with this standard in preference to XML 1.1
  • Perhaps standard should have been simpler
  • But now huge amount of technology and information
    that relies on the standard so core features will
    probably not change

7
Limitations of HTML
  • HTML tags used here to indicate structure of
    recipes
  • But no way of enforcing correct format for data
  • Cannot easily sort recipes or select a subset
    with particular features
  • Cannot easily perform computations on the recipe
    data
  • HTML is not a good language for making a database
  • HTML is designed specifically for the hypertext
    domain, not other domains (like recipes)
  • In HTML, syntax and semantics (or structure and
    layout) are intertwined, even if we use cascading
    style sheets
  • For data storage, we want to store data with
    logical structure only so that it can be
    processed and formatted in all sorts of different
    ways

8
Recipes in XML
  • Can define our own recipe markup language in XML,
    RecipeML
  • Tags in RecipeML directly correspond to concepts
    in the recipe domain
  • e.g., recipe, ingredient, preparation step, etc.
  • Similar to identifying key domain abstractions in
    OO software engineering
  • XML-ification is the process of developing an XML
    representation for a particular domain
  • Essential information is in attributes and text
    between tags
  • Tags indicate structure only, not layout
  • Tags provide meta-information
  • For any domain, usually many possible markup
    designs, e.g.,
  • could break up date into day, month and year
  • Could enclose ingredient list in ltingredientsgt
    tag
  • XML is semi-structured
  • Can choose level of detail at which to mark up
    text
  • Often have to choose between using attributes or
    elements, e.g.,
  • name, amount, unit attributes could be tags

9
XML Language Syntax, Semantics Use as a Database
  • Define syntax of an XML language (like RecipeML,
    XHTML) using XML Schema
  • i.e., what tags are allowed and where they can
    appear in the XML document
  • e.g., preparation tag can only contain step tags
    and step tag can only contain text
  • Define semantics of an XML language using XSLT
  • Transforms XML into appropriate XHTML file that
    can be displayed in a web browser
  • Use XQuery to search recipe collection and
    extract all sorts of information from it
  • For more specialized applications can use a
    general-purpose programming language like Java
  • e.g., to write a web-based recipe editor, might
    need to use Servlets and JSP

10
XML Trees
  • Each XML document represents a hierarchical
    structure called an XML tree
  • Various ways of describing the structure of an
    XML tree, but here will adopt XPath Data Model
  • XML tree can be represented graphically with root
    node at the top (A in top diagram)
  • Edges between nodes represent parent-child
    relationships
  • A is parent of B B is child of A in top diagram
  • Content of a node is sequence of child nodes
  • Sequence (B, C, D) is content of A in top diagram
  • Leaf node is one with no children
  • E, F, C and D are leaf nodes in top diagram
  • XML tree is ordered so ordering of children of a
    node is important
  • Two trees at right are not equivalent in XML
  • Siblings of a node are the other nodes that are
    children of the parent of the node
  • C and D are siblings of B in top diagram
  • Ancestors of a node include its parent, its
    parents parent, etc. back to root node
  • A and B are the ancestors of F in top diagram
  • Descendants of a node include its children, its
    childrens children and so on
  • Descendants of A are B, E, F, C and D in top
    diagram

11
XML Tree Node Types
  • In XPath data model, XML tree is a special
    ordered tree in which each node is one of the
    following types
  • Text nodes
  • Plain text, not an element, raw data
  • Always leaf nodes (i.e., cannot have child nodes)
  • Cannot have two consecutive sibling text nodes
  • Node labelled with text
  • Element nodes
  • Logical grouping of information represented by
    descendants
  • Node labelled with element name
  • Attribute nodes
  • Parent is always an element node
  • Specify global properties of parent element
  • Each attribute is a name-value pair where value
    is always a text string
  • Names of attributes of a given element must be
    distinct
  • Comment nodes
  • Always a leaf node
  • Always contains a text string
  • Processing instruction nodes
  • Used to convey specialized meta-information to
    XML processing tools

12
Tree view of XML recipe
  • Some subtleties
  • Parent of each attribute node is an element node,
    but children of an element node do not include
    attributes
  • Attributes of an element node form an unordered
    set but children of an element form an ordered
    set (or sequence)
  • Document ordering of nodes
  • Node x occurs before node y if its start tag
    occurs earlier in the textual representation of
    the document than that of y
  • Parent precedes children, siblings ordered
    left-to-right
  • Tree-view conventions
  • Root node drawn as a circle
  • Element nodes drawn as rounded boxes
  • Text nodes drawn as parallelograms
  • Attribute nodes drawn as rectangles containing
    name value pairs

13
Viewing tree structure in a browser
  • If you load an XML file in a modern browser and
    the file has no associated style sheet, then its
    tree structure is shown

14
Other XML data models
  • Foregoing is how XML document described in XPath
    data model
  • In DOM (Document Object Model) and JDOM (Java
    Document Object Model), an XML tree can contain
    other types of nodes such as
  • Document Type nodes corresponding to Document
    Type Definitions (DTDs)
  • Entity reference nodes which are references to
    XML fragments defined in the DTD schema
  • CDATA nodes which are a special type of text node

15
Issues in designing an XML language
  • Text nodes usually contain the actual information
    or data
  • Elements and their attributes used to convey
    logical structure and meta-information about the
    data
  • Difference between information and
    meta-information not always obvious
  • Some languages use elements for everything
  • Others use attributes for everything so that all
    elements are empty
  • Most languages use a mixture of elements and
    attributes

16
Textual representation of XML documents
  • XML document is a Unicode text with markup tags
    and other meta-information representing elements,
    attributes and other nodes
  • Text nodes are written as the text they represent
    (character data)
  • Element nodes delimited by start and end tags
  • ltrelated ref"42"gtGarden Quiche is also
    yummy.lt/relatedgt
  • Text in between start and end tags is the content
  • This constitutes descendants of the element node
  • Attributes written inside element start tag and
    attribute values always written within double or
    single quotes
  • ref"42" or ref'42'
  • Empty element is one without content (i.e.,
    nothing between start and end tags)
  • ltpineapplegtlt/pineapplegt or ltpineapple/gt
  • XML document must be well-formed
  • Nodes organised into a strictly nested tree
    structure
  • Every start tag must have an end tag (or use
    abbreviated form for empty element)
  • Elements must nest properly
  • Properly nested ltbananagtltorangegtlt/orangegtlt/banana
    gt
  • Improperly nested ltbananagtltorangegtlt/bananagtlt/oran
    gegt
  • Cf. HTML which allows certain tags (particularly
    many end tags) to be omitted and also allows
    improper nesting
  • XML is case-sensitive
  • ltTaggtlt/taggt is not well-formed because end tag
    not the same name as start tag

17
Textual representation of XML documents
  • XML document should begin with an XML
    declaration
  • lt?xml version"1.0" encoding"UTF-8" ?gt
  • Version attribute indicates version of XML being
    used
  • Should be 1.0 or 1.1
  • Encoding attribute indicates encoding used in
    file
  • All XML parsers required to understand Unicode
    encodings UTF-8 and UTF-16
  • Some parsers support other popular encodings like
    ISO-8859-1 but must then be able to convert from
    these encodings to Unicode code points
  • Best to use UTF-8 or UTF-16 if possible
  • XML declaration followed by root element

18
Character data and attribute values
  • In character data (text nodes) and attribute
    values, special characters have to be escaped
    using Unicode character references
  • N denotes Unicode character with code point N
    represented in decimal
  • xN denotes Unicode character with code point N
    represented in hexadecimal
  • Some characters are predefined entities in XML
    (see table above)
  • Examples
  • lt can be referenced as lt, x3C or 60
  • can be referenced as amp, x26 or 38
  • lt and must be escaped in both character data
    and attribute values
  • In attribute values that are enclosed by " or '
    this character must also be escaped
  • Also use character references to encode Unicode
    characters that are not accessible from the
    keyboard
  • e.g., sake in hiragana script is
    which is encoded as x3055x3051
  • Complete list of Unicode character code points is
    available on the Unicode website at
    http//www.unicode.org/charts/

19
CDATA sections
  • If you have some text that contains lots of
    characters that have to be escaped, then you can
    enclose the text within a CDATA section
  • CDATA section corresponds to a CDATA node in the
    DOM and JDOM data models
  • For example, in most situations, lt!CDATAaltb
    bgtcgtis equivalent to altb amp bgtc
  • Strange syntax for CDATA sections originates in
    SGML

20
Comments, processing instructions and DTD
information
  • Comment nodes are encoded in the source in the
    same way as in HTML lt!--This is a comment--gt
  • A processing instruction is a target-value pair
    delimited by lt?...?gt lt?xml-stylesheet
    type"text/xsl" href"mystyle.xsl"?gtin which
    xml-stylesheet is the target and the string
    type"text/xsl" href"mystyle.xsl" is the
    single value
  • Document type nodes (recognized in DOM and JDOM)
    are encoded as follows lt!DOCTYPE gt

21
Example XML document
  • Contains an XML declaration, followed by a
    document type definition and then a single root
    element named features
  • The features element contains a processing
    instruction, some character data and a comment

22
White space in XML
  • Often convenient to use white space (spaces,
    tabs and new lines) to format source and make it
    more readable
  • Usually this white space is not supposed to be
    included in the delivered version of the document
  • However, sometimes we want the white space to be
    preserved
  • e.g., in poetry or computer programme source code
  • By default, the way that white space is handled
    in an XML document is decided by the application
    that is used to process the document
  • If we definitely want white space within the
    content of an element to be preserved, then we
    assign the value "preserve" to the attribute
    xmlspace in that element
  • Applies to all elements within content of element
    where xmlspace attribute value specified, unless
    overridden by another instance of the xmlspace
    attribute
  • Typically, white space handling is defined in the
    DTD for the specific language

23
Is XML too verbose?
  • Some argue that XML markup is more verbose than
    necessary
  • Same information can often be represented much
    more parsimoniously in a relational database
  • Leads to (misguided) advice to use attributes in
    preference to elements and short names for both
    attributes and elements
  • This usually leads to inflexible and
    incomprehensible language designs!
  • Better to disregard such considerations in the
    design phase and then compress files later using
    either a general purpose compression program or
    one that is optimized for XML, such as XMill
    http//sourceforge.net/projects/xmill
  • To represent structured information, all you
    really need are text and element nodes
  • One simpler alternative is to use something like
    Lisp S-Expressions which date back to 1958
  • For example,(collection (recipe
    (title "Rhubarb Cobbler") (date "Wed, 14
    Jun 95") ))represents the same
    asltcollectiongt ltrecipegt lttitlegtRhubarb
    Cobblerlt/titlegt ltdategtWed, 14 Jun
    95lt/dategt lt/recipegtlt/collectiongt

24
Applications of XML
  • Hundreds of XML applications have been developed
    for many different domains http//xml.coverpages.
    org/xmlApplications.html
  • XML languages can be roughlyl classified into
  • Data-oriented languages for describing data that
    would traditionally have been stored in
    databases.
  • Usually have a flat, wide structure, with the
    root element containing many similar children,
    each with a simple structure
  • Document-oriented languages (e.g., XHTML) are for
    annotating the structure of natural language text
  • Elements often have mixed content (elements and
    character data)
  • Unlike documents in a data-oriented language,
    documents in a document-oriented XML language can
    usually be understood even if the markup tags are
    removed
  • Protocols and programming languages including,
    e.g., XML Schema and XSLT
  • Usually have most complex syntax
  • Hybrids often combine features of data- and
    document-oriented languages
  • Typically allow freeform text as content of
    certain elements
  • e.g., ltcommentgt element in RecipeML

25
Examples of XML languages XHTML
  • XHTML 1.0 is W3Cs XML-ification of HTML 4.01
  • Apart from the XML declaration and the XHTML
    namespace declaration, XHTML is very similar to
    HTML 4.01
  • However, XHTML document must be a well-formed XML
    document, therefore
  • Omitting end tags is forbidden in XHTML
  • Can abbreviate end tags by using lt/gt notation,
    e.g., ltbr/gt
  • XHTML element and attribute names must be lower
    case
  • Attribute values cannot be omitted and must be
    surrounded by double or single quotes
  • e.g., attribute checked in HTML must be written
    checked"checked"

26
XHTML Variants
  • XHTML 1.0 Strict
  • Clean markup in which all layout is specified
    using CSS
  • XHTML 1.0 Transitional
  • Additionally permits explicit layout markup like
    bgcolor attribute and font tag
  • XHTML 1.0 Frameset
  • Allows use of frames
  • XHTML 1.1
  • Modularization of XHTML 1.0 in which language
    partitioned by functionality into modules, e.g.,
  • Structure includes html, head and body tags
  • Text includes basic text markup
  • Hypertext includes the anchor tag (ltagt)
  • Lists ul, ol, dl,
  • Forms form, input, select,
  • Each module defined using a separate DTD
  • Allows specific subsets of the XHTML language to
    be included in new languages

27
Other XML languages
  • CML (Chemical Markup Language)
  • Data-oriented language for representing molecules
    and chemical reactions
  • One of the first XML applications
  • Supported by wide range of tools such as browsers
    and editors
  • WML (Wireless Markup Language)
  • Document-oriented XML language that replaces HTML
    on mobile devices that typically have small
    displays, limited user-input facilities and low
    bandwidth
  • ebXML (Electronic Business XML Initiative)
  • Worldwide initiative to use XML for exchanging
    electronic business data
  • Has provided comprehensive standards for business
    processes, data components, collaboration
    protocol agreements, messaging etc.
  • Complex language that belongs to protocols and
    programming languages category of XML languages
  • ThML (Theological Markup Language)
  • Superset of XHTML for markup of theological texts
  • Supports references, annotations, glossaries
  • MusicXML
  • For encoding Western musical staff notation
  • Many other XML applications and initiatives
    listed here http//xml.coverpages.org/xmlApplica
    tions.html

28
Namespaces in XML
  • Not part of the XML specification
  • Defined separately from XML specification
  • For XML 1.0
  • http//www.w3.org/TR/xml-names/
  • For XML 1.1
  • http//www.w3.org/TR/xml-names11/
  • XML namespaces provide a simple method for
    qualifying element and attribute names used in
    Extensible Markup Language documents by
    associating them with namespaces identified by
    IRI references(http//www.w3.org/TR/xml-names/)

29
XML NamespacesMotivating problem
  • The (fictitious) XML language, WidgetML, is
    designed for describing widgets
  • Can include explanatory text written in XHTML
    within a WidgetML document
  • i.e., WidgetML uses XHTML as a sublanguage
  • Example above describes a widget called gadget
    with a medium-sized head and a big gizmo
    subwidget
  • XHTML message contained within info element
  • Both XHTML and non-XHTML part of WidgetML use
    tags big and head
  • Means that big and head tags can mean different
    things within a WidgetML document, depending on
    context
  • Demonstrates need to be able to avoid name
    clashes when combining languages that may use
    elements with the same names to mean different
    things
  • Programming languages uses name spaces and
    qualified names to avoid clashing names
  • In XML we also use namespaces and each namespace
    is identified by a unique URI
  • E.g., XHTML namespace is associated with the URI,
    http//www.w3.org/1999/xhtml
  • WidgetML developed by a company called Widget
    Inc. whose domain is www.widget.inc, so assigns a
    URI under their domain to the WidgetML namespace,
    such as http//www.widget.inc/widgetml/

30
Namespaces
  • So now, instead of just writing ltheadgtlt/headgt
    inside the info element, we can
    write lthttp//www.w3.org/1999/xhtmlheadgt
    lt/http//www.w3.org/1999/xhtmlheadgtto specify
    that we mean the head tag from XHTML, not the
    head tag from WidgetML
  • But prefixing every tag with a long URI would
    lead to an extremely verbose and incomprehensible
    document
  • Instead, we assign a short name to the namespace
    we want to use within an element and declare this
    association between the short name and the
    namespace as an attribute in the start tag of the
    containing element ltinfo xmlnsfoo"http//www.w
    3.org/TR/xhtml1gt ltfooheadgtlt/fooheadgt lt/
    infogtWe can then prefix the tags within the
    containing element with the short name to
    indicate that they are from the declared
    namespace
  • The attribute, xmlnsfoo"http//www.w3.org/TR/xht
    ml1",
  • Declares the namespace named http//www.w3.org/TR/
    xhtml1
  • Gives this namespace the prefix, foo

31
Namespaces
  • Namespace declaration applies to all contents of
    element in whose start tag it occurs
  • Can use any name as a prefix except one that
    contains a colon or one starting with the letters
    XML (in any combination of upper or lower case)
  • NCName (No Colon Name) is one that does not
    contain a colon
  • QName (Qualified name) may be an NCName or an
    NCName prefixed with a namespace prefix and a
    colon
  • Unprefixed element names are assigned a default
    namespace
  • Default namespace can be overridden by setting
    attribute xmlns to the URI for the namespace to
    be used for unprefixed tags ltwidget
    typegadget xmlnshttp//www.widget.incgt

32
Default namespaces dont apply to attributes!
  • 1 and 2 are equivalent
  • The size attribute does not come from the
    http//www.widget.inc namespace
  • In 3, the size attribute does come from the
    http//www.widget.inc namespace

33
RecipeML
34
Example RecipeML document
  • An example RecipeML document is available
    at http//www.brics.dk/ixwt/examples/recipes.xml

35
Summary
  • XML is a framework for developing markup
    languages in any conceivable domain
  • XML is just a notation for hierarchically
    structuring textual data
  • Strength of XML is that it is a widely accepted
    standard supported by many generic languages and
    tools
  • Means you get lots of free infrastructure if you
    build on it
  • Considered XML in the form of trees and in its
    textual representation
  • Considered the namespace mechanism for resolving
    name conflicts
Write a Comment
User Comments (0)
About PowerShow.com