ISO/IEC%2010646%20and%20Unicode - PowerPoint PPT Presentation

About This Presentation
Title:

ISO/IEC%2010646%20and%20Unicode

Description:

Universal: characters in almost all national standards ... that possess the same abstract shape are unified unless disallowed by R1 or R2. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 16
Provided by: abc759
Category:

less

Transcript and Presenter's Notes

Title: ISO/IEC%2010646%20and%20Unicode


1
ISO/IEC 10646 and Unicode
  • It is a coded character set(codeset)
  • Designed for text processing and exchange
  • Features
  • Universal characters in almost all national
    standards
  • Framework Fix the coding architectures, and
    code-points can be filled up later.
  • Uniform and Efficient fixed-width encoding, no
    need to identify the coding length(ASCII, Big5,
    GB)
  • Unambiguous Any given 16-bit(32-bit) value
    always represents the same character

2
UCS-4(Canonical form of ISO 10646)
  • Fixed 32-bit(actually 31 bits) coding assignment
  • 00 00 00 00 to 7F FF FF FF
  • Each plane 216 65,536 code points
  • BMP(the basic multilingual plane)
  • Both Group No. and Plan No. are 00(first two
    bytes of zeros)
  • Before ISO 10646 part 2 came out(end of year
    2001), only BMP contains characters

Group No. (total 128)
Plane No (total 256)
High Byte (total 256)
Low Byte (total 256)
3
Code Architecture of UCS-4
Groups
Group 127
Group 1
Planes 256/Group
Group 0
Plane 00 BMP
4
  • UCS-2 2-byte representation of UCS-4
  • Basic Multilingual Plan(BMP)
  • Switching mechanism to use code range of BMP to
    access another 16 planes (Surrogate pairs)
  • BMP
  • Compatibility Zone

A-Zone Alphabets, Symbols, CJK Misc
I-Zone CJK ideographs
O-zone Hangul
S-Zone(Surrogate)
R-Zone Private Use, Compatibility, Arabic
Presentations
5
Unicode
  • Unicode is the implementation of ISO 10646 with
    16 bit representation using UCS-2
  • Has definition of actions associated with certain
    characters
  • control character behavior
  • Rendering behavior combining characters
  • Examples
  • Control character bell ltBELgt should cause a sound
    in the system
  • Type the character using U0061(a)U0300( )
    will be rendered as one symbol a

6
Extension of ISO 10646
  • Extension A(BMP) has 6,582 characters, published
    in 2000, ISO/IEC 10646-1 Second Edition(2000).
  • Extension B
  • All characters in ????,?????, plus other
    characters such as those in HK Supplementary
    Character Set,
  • ISO/IEC 10646-2(2001) , total of 43,253
    characters
  • In Plane 2 of UCS-4
  • How would Extension B be supported in UCS-2? gt
    Using some encoding scheme

7
Surrogate Pairs
  • 2 UCS-2 code H followed by L ltH,Lgt where
  • H is in the range of D800 - DBFF
  • L is in the range of DC00 - DFFF
  • For a given UCS-2 code(or code pair) U, the
    corresponding UCS-4 code-point value N (scalar
    value)
  • N U if U is a single, non-surrogate value
  • N(H-D80016)400 16 (L-DC00 16) 10000 16
    where U is a surrogate pairltH,Lgt
  • Undefined for any other U in UCS-2.
  • N in the range of 0 to 10FFFF16
  • ltD800, DC00gt gt N 1000016
  • ltDBFF,DFFFgt gt ?

8
  • UTF UCS Transformation Format
  • Allows a certain number of code values in UCS
    which correspond to some other coding standard
    (e.g. ASCII) be transmitted just as what they
    would be in that coding standard, a property
    known as transparency-while other code values are
    represented through escape mechanism
  • variable length encoding to achieve greater
    efficiency

9
  • UTF-8 8-bit encoding for 8-but UNIX Environment
  • ASCII transparent
  • First-byte indicates the number of characters
  • Shortest encoding principle for invertible (or
    bijective) encoding/decoding
  • Save storage space for ASCII, non-ideographic
    characters
  • Example Unicode A324 0430 0023 8A43
  • gt UTF-8
  • Example UTF-8 24 38 58 CE 82
  • gt UCS-4

10
Character vs. glyph
  • Character smallest component of written language
    that have semantic value
  • Glyphs represent the shapes that characters can
    have when they are rendered or displayed.
  • Example A, A, are the same character and having
    the same code. Concrete shape can be very
    different and are given one codepoint.
  • Coding of variants

11
ISO 10646/Unicode Featuresfor Chinese
  • Han Unification (Chinese, Japanese and Korean)
  • Unification Problems
  • Different sources, non-cognate
  • Three-dimensional Conceptual Model
  • semantics(x), abstract shape(y), actual shape(z)
  • Examples

12
Unification Rules(????)
  • R1 Source Separation Rule If two ideographs
    are distinct in a primary source standard, then
    they are not unified.Why
  • R2 Non-cognate(???)Rule In general, if two
    ideographs are unrelated in historical
    derivation(non-cognate characters), then they are
    not unified
  • R3 By means of two-level classification, the
    abstract shape of each ideograph is determined.
    Any two ideographs that possess the same abstract
    shape are unified unless disallowed by R1 or R2.

13
  • Example
  • Component structure analysis

14
Sources of Unified Han Characters
15
Wide character vs. Multi-byte characters
  • Text information needs to be represented by the
    right data types.
  • Multi byte characters data are processed on a
    per-byte basis Big5, GB, EUC, even UTF-8
  • Wide characters Fixed-byte encoding and no
    testing of high bit needed.
  • Processing representation for wide characters
  • Big Endian vs. Little Endian
  • Data type dependent
  • System architecture dependent
  • Distinction 0xFEFF for Big Endian and 0xFFFE for
    Little Endian
Write a Comment
User Comments (0)
About PowerShow.com