XML3 - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

XML3

Description:

This allows for an even larger repertoire of 231 characters by using 31 bit encoding ... only been assigned to the very first plane, the Basic Multilingual Plane (BMP) ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 38
Provided by: serv385
Category:
Tags: xml3

less

Transcript and Presenter's Notes

Title: XML3


1
XML3
  • World Wide Web Technology

2
Physical Structure Entities
  • Entities can be used
  • to represent frequently repeated phrases
  • standard sets of contractual conditions
  • document sections
  • Internal entities (in DTD) can be used to ensure
    consistency and improve readability

3
Non-XML Resources
  • Binary external entities are provided to
    represent any non-XML resources that may be
    needed to form part of an XML document
  • Linking them to notations, allows a framework
    that can cope with any type of of digital
    resource that is invented

4
Storage for XML Entities
  • An entitys content can be, and frequently is a
    complete file. It may also be
  • part of a file
  • a stream in memory
  • object in a database
  • result of a query
  • If you change your storage from a file system to
    an OO DB then change DTD
  • XML applications need an entity manager
  • to find entities in documents
  • any resource that can be addressed by a URL

5
Entity Declarations
  • lt!ENTITY disclaimer No responsibilitygt
  • Delimiters
  • lt!ENTITY disclaimer No responsibilitygt
  • Entity Name
  • lt!ENTITY disclaimer No responsibilitygt
  • Entity Value
  • lt!ENTITY disclaimer No responsibilitygt

6
Predefined Entities
  • lt lt 60
  • gt gt
  • amp 38
  • apos
  • quot
  • Remember that if you are using these a lot not as
    markup use a CDATA section

7
External Entities
  • Either parsed or unparsed
  • Parsed entities are XML text and are referenced
    by entity references
  • lt!ENTITY disclaimer SYSTEM http//www.ballarat.e
    du.au/jly/xml/disclaimer.xmlgt
  • lt!ENTITY disclaimer PUBLIC -//JLY//TEXT
    Disclaimer//EN http//www.ballarat.edu.au/jly/x
    ml/disclaimer.xmlgt

8
External Entities
  • Refer in their declaration to a storage unit by
    means of a SYSTEM or PUBLIC identifier
  • The format for PUBLIC identifiers includes a
    PUBLIC name before the SYSTEM name
  • Simply a mechanism for describing an entity, but
    does not specify a particular storage location
  • XML processor may be able to determine whether
    available locally

9
External Entities
  • Unparsed entities must have a notation associated
    with them
  • Unparsed entities are referenced by ENTITY
    attributes
  • lt!ENTITY JlyImage SYSTEM jly.jpg NDATA JPEGgt
  • This declaration specifies the JPEG notation
    through the value given to NDATA
  • This notation in turn must be declared
  • lt!NOTATION JPEG SYSTEM http//www.graphics.com/jp
    gviewer.exegt

Indicates unparsed entity - always followed by a
notation
10
External Parsed Entities
  • Each external parsed entity may use a different
    encoding for its characters
  • May begin with a text declaration
  • lt?xml encodingUTF-8?gt

11
External Parsed Entities
  • Entities encoded in UTF-16 must begin with the
    Byte Order Mark (BOM)
  • The BOM (character FEFF, the zero width no-break
    SPACE) is not part of either the markup or the
    character data of the document
  • one can tell from the BOM whether the document is
    in UTF-8, big-endian UTF-16 or little-endian UTF
    16
  • Parsed entities in an encoding other than UTF-8
    or UTF 16 must begin with a text declaration

12
Notations
  • For describing the data content notation of
    different things -
  • the definition of how the bits and bytes of
    object should be interpreted
  • Declarations
  • lt!NOTATION GIF SYSTEM gif.specgt
  • Notation Declaration Delimiters
  • lt!NOTATION GIF SYSTEM gif.specgt
  • Notation Name
  • lt!NOTATION GIF SYSTEM gif.specgt
  • External Identifier
  • lt!NOTATION GIF SYSTEM gif.specgt

13
Notation Entities
  • We define a notation called GIF
  • lt!NOTATION GIF SYSTEM gif.specgt
  • We declare an entity to be an unparsed entity in
    this notation
  • lt!ENTITY JlyImage SYSTEM jly.gif NDATA GIFgt
  • We then declare an entity attribute that can take
    the name of an unparsed entity as value
  • lt!ATTLIST PERSON PHOTO ENTITY IMPLIED GENDER
    (MALEFEMALE) REQUIREDgt

14
Notation Entities
  • We can then, in our document instance, reference
    this entity.
  • ltPERSON GENDERMALE PHOTOJlyImagegtJohn
    Yearwoodlt/PERSONgt

15
Notation Attributes
  • We define three notations
  • lt!NOTATION USDATE SYSTEM usdate.notgt
  • lt!NOTATION AUSDATE SYSTEM ausdate.notgt
  • lt!NOTATION ISODATE SYSTEM isodate.notgt
  • We declare a notation attribute
  • lt!ATTLIST DATE FORMAT NOTATION (USDATEAUSDATEISO
    DATE) ISODATEgt

16
Notation Attributes
  • We can then in our document instance, indicate
    the notation of our content
  • ltDATE FORMATAUSDATEgt30/4/99lt/DATEgt
  • ltDATE FORMATUSDATEgt4/30/99lt/DATEgt

17
Parameter Entities
  • Same as other entities except only used in the
    DTD
  • lt!DOCTYPE BOOK lt!ENTITY isolat1 SYSTEM
    isolat1.pengt isolat1 lt!ENTITY
    ParaContent (PCDATATERMCITATIONQUOTE)gt
    ...lt!ELEMENT PARA ParaContent gtlt!ELEMENT
    TABLECELL ParaContent gt...gt

18
Conditional Sections
  • Conditional sections are allowed only within the
    external DTD subset. They cause the piece of
    markup that they contain to be included in the
    DTD or ignored depending on the value of their
    initial keyword
  • INCLUDE
  • IGNORE

19
Conditional Sections
  • lt!INCLUDE lt!ELEMENT book (comments, title,
    body, supplements?)gtgtlt!IGNORE lt!ELEMENT
    book (title, book, supplements?)gtgt

20
Conditional Sections and Parameter entities
  • Parameter entities provide a way of switching in
    and out various sections of a DTD
  • lt!ENTITY draft INCLUDEgtlt!ENTITY final
    IGNOREgtlt!draft lt!ELEMENT book
    (comments, title, body, supplements?)gtgtlt!fi
    nal lt!ELEMENT book (title, body,
    supplements?)gtgt

21
HTML DTD
  • If you look at this you will see the operation of
    conditional sections with parameter entities
  • You can use just the recommended parts of the DTD
    if you want

22
Internal and External subset
  • In the internal subset
  • parameter entity references can only occur in
    place of entire markup declarations
  • not inside markup declarations
  • In the external subset
  • parameter entity references can occur inside
    markup declarations
  • can have conditional sections

23
Internal subset precedence
  • The internal subset is considered to occur before
    the external subset, so
  • entity and attribute list declarations in the
    internal subset take precedence

24
Standalone revisited
  • If there is no external subset (either in the
    external set or in external entities referenced
    from the internal subset) then the value of
    standalone is meaningless
  • The standalone declaration must have the value
    no if any external markup declarations contain
    declarations of
  • attributes with default values used in doc
    instance
  • entities, if referenced in doc instance

25
Character setsUnicode
  • Unicode is an encoding scheme that specifies a
    name and a numerical value for a set of
    characters.
  • ASCII uses 7 bits to encode its characters (128
    in number)
  • Unicode uses 16 bits allowing for an
    international repertoire from the major scripts
    of the world

26
ISO/IEC 10646
  • This allows for an even larger repertoire of 231
    characters by using 31 bit encoding
  • These code positions are divided into 128 groups
    of 256 planes.
  • At present characters have only been assigned to
    the very first plane, the Basic Multilingual
    Plane (BMP). This has been done to match with
    Unicode.

27
ISO/IEC 10646
  • Defines two forms of expressing a character
  • UCS-4 (Universal Character Set coded in 4 bytes)
    that uses the full 31 bit encoding
  • UCS-2 (Universal Character Set coded in 2 bytes)
    that uses 16 bits to identify characters from BMP
  • The character code values of UCS-2 match those of
    Unicode 1.1 and the values of ISO/IEC 10646-11
    match those of Unicode 2

28
Character Encoding
  • UTF-8 (UCS Transformation Format, 8-bit form)
    provides a variable-width encoding such that
  • Characters in the ASCII repertoire take up only
    one byte with the same value as the character
    would have in ASCII
  • If a byte in UTF-8 could be ASCII (0-127) then it
    is ASCII

29
UTF-16
  • UTF16 (UCS Transformation Format for Planes of
    Group 00) provides access to an additional 16
    planes outside the BMP.
  • UTF-16 does this by using pairs of UCS-2
    characters, both in otherwise reserved ranges of
    UCS-2

30
Character Classes
  • Characters -
  • tab, cr, lf, legal graphic characters of Unicode
  • Name characters
  • letters, digits, period, hyphen,
    underscore,colon, combining characters and
    extenders
  • NMTOKEN - a mixture of name characters
  • NAME - NAME beginning with a letter or underscore

31
Whitespace
  • A special attribute xmlspace may be attached to
    an element to signal that white space within that
    element should be preserved by applications.
  • Possible values are default and preserve
  • Eg
  • lt!ATTLIST poem xmlspace (defaultpreserve)
    preservegt

32
End-of-Line handling
  • Whenever an external parsed entity or the literal
    entity value of an internal parsed entity
    contains the literal 2 character sequence of cr
    lf or just cr, an XML process must pass a single
    lf to the application

33
Language Identification
  • A special attribute xmllang can be specified on
    any element to indicate the language the content
    of the element and its other attribute values are
    in.
  • ltpara xmllangengtThis is English.lt/paragt
  • The values of the attribute are defined by IETF
    RFC 1766 Tags for Language Identification

34
Language Code
35
XML Processors (i)
  • Protect you from the character encoding used in
    the XML document
  • Protect you from line-break differences in
    operating systems
  • Do all the parsing
  • Do all the entity replacement

36
XML Processors (ii)
  • Existing XML processors can be classified by a
    number of different criteria
  • Validating or non-validating
  • Implementation Language/ Platform (Java, C,
    Python, Tcl)
  • Interface (Proprietary, ESIS, SAX, DOM)
  • Design goals (small fast)

37
Interfaces
  • Event based vs tree based
  • ESIS (Element structure information Set)
  • Groves (SGML/Hytime/DSSL)
  • SAX (Simple API for XML)
  • DOM Document Object Model
  • JAXP
Write a Comment
User Comments (0)
About PowerShow.com