Character Matters - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Character Matters

Description:

Dutch text should only contain basic latin characters, some of which may have accents ... Part 7: Character Repertoire Description Language - CRDL ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 23
Provided by: gca
Category:

less

Transcript and Presenter's Notes

Title: Character Matters


1
WELCOME
  • Character Matters

2
Character Matters
  • XML 2004, Marriott Wardman Park Hotel, Washington
    DC

Diederik Gerth van Wijk, Content Office
2004-11-16
3
Overview
  • What is a character
  • XML versus SGML, Unicode versus System Data
    entities
  • Character overload need for restriction
  • Character underload need for extension
  • How to validate the restriction with DSDL
  • How to validate an extension
  • The need for Bottom Up Constraint Languages
  • PCDATA considered harmful

4
What is a character
  • XML 1.0, 3rd ed. Definition A character is an
    atomic unit of text as specified by ISO/IEC
    106462000 ISO/IEC 10646
  • XML processors MUST accept any character in the
    range specified for Char
  • 2   Char      x9 xA xD
    x20-xD7FF xE000-xFFFD
    x10000-x10FFFF / any Unicode character,
    excluding the surrogate blocks, FFFE, and
    FFFF. /
  • Thats more than a million characters
  • Thats more than I can process
  • But not all that I want to process

5
What is a character (2)
  • ISO/IEC 10646 Universal Multiple-Octet Coded
    Character Set (UCS)
  • character A member of a set of elements used
    for the organisation, control, or representation
    of data
  • collection A set of coded characters which is
    numbered and named and which consists of those
    coded characters whose code positions lie within
    one or more identified ranges
  • ISO/IEC TR 15285 An operational model for
    characters and glyphs
  • Information technology uses the term character
    (or coded character) for the information content,
    and the term glyph for the presentation image.
  • Since the standard does not specify which
    information the character represents, a user of
    the standard is free to choose.

6
What is a character (3)
  • Are these the same character?

7
XML versus SGML, Unicode versus SDATA entities
  • In SGML, the SGML declaration declares which
    character repertoire to use. By defaultCHARSET
    --DOCUMENT CHARACTER SET-- BASESET "ISO
    646-1983//CHARSET International Reference
    Version (IRV)//ESC 2/5 4/0"
  • If you want anything more, use SDATA
    entitieslt!ENTITY eacute SDATA "eacute " --
    e-accent aigue --gt
  • In XML, the only repertoire is Unicode (AKA ISO
    10646)
  • And your entities must be generallt!ENTITY
    eacute "x00E9"gt lt!-- e-accent aigue --gt
  • The good news is that the eacute is now well
    defined
  • The bad news is that you lost control

8
Character overload the need for restriction
  • It takes a big font to support all Unicode
    glyphs19-06-2003 0505 367.112
    arial.ttf12-01-1999 1059 24.131.012
    ARIALUNI.TTF
  • Are you sure you know how to process all
    characters?
  • how to sort?
  • how to hyphenate?
  • how to pronounce?
  • how to render?
  • how to search?
  • In XML, do you need ¹ 1 ? ? ? ? ? ? Whats wrong
    with ltmyElementgt1lt/myElementgt and let
    ltmyElementgts style decide?
  • Would this be correct?ltpara xmllangengtEnchant
    é, M?!lt/paragt

9
Character underload the need for extension
  • Potentially 1,000,000 characters, and still not
    satisfied?
  • If the replacement text is context or style sheet
    sensitive
  • questionmark in Greek , in Latin ?
  • some-dash sometimes mdash, sometimes
    ndash
  • Topographical registration marks
  • Combinations
  • j-acute j jx0301
  • min-2 -/- x207B/x208B
  • Chinese / Japanese / Korean characters for names

10
How to specify a restriction
  • Unicode characters have three characteristics
  • Codepoint (the number) and name
  • Block (range of codepoints, like Latin-1
    Supplement)
  • General category (like Lu, Letter, uppercase)
  • XML Schema datatypes allow restrictions to block
    or category
  • But only of data content, not of mixed content

11
Document Schema Definition Languages (DSDL)
  • New International Standard IS 19757
  • Part 3 Rule-based validation -
    Schematronltschrule context"/_at_xmllang'nl'
    "gt ltschassert test"\pIsBasicLatin\pIsLati
    n-Supplement"gt Dutch text should only contain
    basic latin characters, some of which may have
    accents lt/schassertgt lt/schrulegt
  • "\pIsBasicLatin\pIsLatin-Supplement"is hard
    to read and to reuse
  • Part 7 Character Repertoire Description Language
    - CRDLDefines reusable named collections of
    characters

12
What a CRDL definition might look like
  • "\pIsBasicLatin\pIsLatin-Supplement"
  • ltcollection namedutch-charsgt
  • ltuniongt
  • ltref hrefwww.unicode.org/gencat/BasicLatin/
    gt
  • ltref hrefwww.unicode.org/gencat/Latin-Supple
    ment/gt
  • lt/uniongt
  • lt/collectiongt

13
Unicodes solution for extension Private Use
Areas
  • E000..F8FF Private Use Area
  • F0000..FFFFF Supplementary Private Use Area-A
  • 100000..10FFFF Supplementary Private Use Area-B
  • General category for characters in these areas is
    CoOther, Private Use
  • In the past, my eacute was probably the same
    character as your eacute
  • ISO lists of frequently used character entities
  • But my Private Use character UE000 is probably
    not your character UE000

14
SGML and XML roundtrips using PU characters
  • We still use SGML
  • But sometimes XML tools are nice
  • So we do roundtrips
  • Then one-to-one mapping is handy
  • But our entities are many-to-many
  • Unless we use the private use area fromlt!ENTITY
    o-umlaut "x00F6"gt lt!-- oe --gtlt!ENTITY o-trema
    "x00F6"gt lt!-- -o --gttolt!ENTITY o-umlaut
    "xE000"gt lt!-- oe --gtlt!ENTITY o-trema
    "x00F6"gt lt!-- -o --gt
  • But then, how do we define the processing of
    private use characters?

15
How to validate an extension
  • If I define private use area characters, how do I
    define their behaviour?
  • Or characteristics, like
  • allow my PU char UE000 wherever Latin-1
    supplement is allowed, or
  • treat my PU char UE000 as if it were an
    uppercase letter
  • Processing is not part of DSDL, thats only
    validation
  • If I add a character, I only want to define it
    once
  • But I want to reuse public character collections
  • My DocBook DTD might specify public collection
    restrictions
  • So if DSDL 7 doesnt allow to define
    characteristics
  • My DSDL 7 schemas will have to be generated

16
Bottom Up Constraint Language
  • SGML and XML are hierarchical, top down
  • An element type definition defines its content,
    not where it may be used
  • A CSS might be used to validate a document
  • In a Bottom Up Constraint Language an element
    defines
  • which type its content is, and
  • which type itself is, and thereby a new element
    adds itself to every element who contains its
    type
  • its processing characteristics (CSS etc)
  • its documentation
  • The same goes for characters if a PU character
    says it is like a Latin 1 Upper Case letter
    that should be enough to allow it wherever Latin
    1 Upper Case letters are allowed

17
What a BUCL statement might look like
  • ltCharProp CodePointE000" PU_Name"COMBINIG
    UMLAUT" PU_General_Category"Mn"
    PU_Canonical_Combining_Class"230"
    PU_Bidi_Class"NSM" PU_Bidi_Mirrored"N"
    Render_As"0308" /gt
  • ltCharProp CodePointE001" PU_Name"LATIN SMALL
    LETTER O WITH UMLAUT" PU_Decomposition_Mapping"00
    6F E000" PU_General_Category"Ll"
    PU_Simple_Uppercase_MappingE002"
    Char_Entity"oumlautOrder_As"6F 65" /gt
  • ltCharProp CodePointE002PU_Name"LATIN CAPITAL
    LETTER O WITH UMLAUTPU_Decomposition_Mapping"00
    4F E000" PU_General_Category"Lu"
    PU_Simple_Lowercase_MappingE001"
    Char_Entity"OumlautOrder_As"4F 45"/gt

18
What a BUCL system might do
  • BUCL might create a DTD for every purpose
  • For rendering or editing purposes lt!ENTITY
    oumlaut "ox0308"gtlt!ENTITY Oumlaut
    "Ox0308"gt
  • For sorting purposeslt!ENTITY oumlaut
    "oe"gtlt!ENTITY Oumlaut "OE"gt
  • For roundtrip purposeslt!ENTITY oumlaut
    "xE001"gtlt!ENTITY Oumlaut "xE002"gt
  • And CSS, FOSIs, Relax NG schemas, documentation,
    .....

19
Bottom Up Constraint Language (2)
20
PCDATA considered harmful
  • A bulleted list should not specify to use ? as
    bullet, the style sheet should
  • In a ltpara xmllangengt no Chinese characters
    should be allowed
  • But wouldnt spell checking do?
  • But if youre using a word list, why allow
    (english) characters?
  • Why not encode a paragraph as a sequence of
    sentences, and a sentence as a grammatical tree,
    and use references to a dictionary?
  • And let the value of xmllang decide which
    dictionary and which transformation rules to
    apply!
  • Only dictionaries, ltnamegtelements and stylesheets
    should use PCDATA

21
Conclusions
  • For quality control, you need restriction and
    validation
  • The world changes, new characters occur
  • Private Use characters can replace SDATA entities
  • You need to be able to specify their behavior
  • And characteristics
  • From which the validation rules should follow
  • BUCL UP!
  • If you do your data analysis well, the usage of
    PCDATA should be far more restricted than we now
    do

22
Questions
  • Do you really mean it?
  • Do you sell dictionaries?
  • I want my logo to be a character, how do I do
    that?
Write a Comment
User Comments (0)
About PowerShow.com