ISO/IEC%2010646%20and%20Unicode

About This Presentation

Title:

ISO/IEC%2010646%20and%20Unicode

Description:

Universal: characters in almost all national standards ... that possess the same abstract shape are unified unless disallowed by R1 or R2. ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 16

Provided by: abc759

Category:

more less

Transcript and Presenter's Notes

Title: ISO/IEC%2010646%20and%20Unicode

1
ISO/IEC 10646 and Unicode

It is a coded character set(codeset)
Designed for text processing and exchange
Features
Universal characters in almost all national
standards
Framework Fix the coding architectures, and
code-points can be filled up later.
Uniform and Efficient fixed-width encoding, no
need to identify the coding length(ASCII, Big5,
GB)
Unambiguous Any given 16-bit(32-bit) value
always represents the same character

2
UCS-4(Canonical form of ISO 10646)

Fixed 32-bit(actually 31 bits) coding assignment
00 00 00 00 to 7F FF FF FF
Each plane 216 65,536 code points
BMP(the basic multilingual plane)
Both Group No. and Plan No. are 00(first two
bytes of zeros)
Before ISO 10646 part 2 came out(end of year
2001), only BMP contains characters

Group No. (total 128)
Plane No (total 256)
High Byte (total 256)
Low Byte (total 256)
3
Code Architecture of UCS-4
Groups
Group 127
Group 1
Planes 256/Group
Group 0
Plane 00 BMP
4

UCS-2 2-byte representation of UCS-4
Basic Multilingual Plan(BMP)
Switching mechanism to use code range of BMP to
access another 16 planes (Surrogate pairs)
BMP
Compatibility Zone

A-Zone Alphabets, Symbols, CJK Misc
I-Zone CJK ideographs
O-zone Hangul
S-Zone(Surrogate)
R-Zone Private Use, Compatibility, Arabic
Presentations
5
Unicode

Unicode is the implementation of ISO 10646 with
16 bit representation using UCS-2
Has definition of actions associated with certain
characters
control character behavior
Rendering behavior combining characters
Examples
Control character bell ltBELgt should cause a sound
in the system
Type the character using U0061(a)U0300( )
will be rendered as one symbol a

6
Extension of ISO 10646

Extension A(BMP) has 6,582 characters, published
in 2000, ISO/IEC 10646-1 Second Edition(2000).
Extension B
All characters in ????,?????, plus other
characters such as those in HK Supplementary
Character Set,
ISO/IEC 10646-2(2001) , total of 43,253
characters
In Plane 2 of UCS-4
How would Extension B be supported in UCS-2? gt
Using some encoding scheme

7
Surrogate Pairs

2 UCS-2 code H followed by L ltH,Lgt where
H is in the range of D800 - DBFF
L is in the range of DC00 - DFFF
For a given UCS-2 code(or code pair) U, the
corresponding UCS-4 code-point value N (scalar
value)
N U if U is a single, non-surrogate value
N(H-D80016)400 16 (L-DC00 16) 10000 16
where U is a surrogate pairltH,Lgt
Undefined for any other U in UCS-2.
N in the range of 0 to 10FFFF16
ltD800, DC00gt gt N 1000016
ltDBFF,DFFFgt gt ?

UTF UCS Transformation Format
Allows a certain number of code values in UCS
which correspond to some other coding standard
(e.g. ASCII) be transmitted just as what they
would be in that coding standard, a property
known as transparency-while other code values are
represented through escape mechanism
variable length encoding to achieve greater
efficiency

UTF-8 8-bit encoding for 8-but UNIX Environment
ASCII transparent
First-byte indicates the number of characters
Shortest encoding principle for invertible (or
bijective) encoding/decoding
Save storage space for ASCII, non-ideographic
characters
Example Unicode A324 0430 0023 8A43
gt UTF-8
Example UTF-8 24 38 58 CE 82
gt UCS-4

10
Character vs. glyph

Character smallest component of written language
that have semantic value
Glyphs represent the shapes that characters can
have when they are rendered or displayed.
Example A, A, are the same character and having
the same code. Concrete shape can be very
different and are given one codepoint.
Coding of variants

11
ISO 10646/Unicode Featuresfor Chinese

Han Unification (Chinese, Japanese and Korean)
Unification Problems
Different sources, non-cognate
Three-dimensional Conceptual Model
semantics(x), abstract shape(y), actual shape(z)
Examples

12
Unification Rules(????)

R1 Source Separation Rule If two ideographs
are distinct in a primary source standard, then
they are not unified.Why
R2 Non-cognate(???)Rule In general, if two
ideographs are unrelated in historical
derivation(non-cognate characters), then they are
not unified
R3 By means of two-level classification, the
abstract shape of each ideograph is determined.
Any two ideographs that possess the same abstract
shape are unified unless disallowed by R1 or R2.

Example
Component structure analysis

14
Sources of Unified Han Characters
15
Wide character vs. Multi-byte characters

Text information needs to be represented by the
right data types.
Multi byte characters data are processed on a
per-byte basis Big5, GB, EUC, even UTF-8
Wide characters Fixed-byte encoding and no
testing of high bit needed.
Processing representation for wide characters
Big Endian vs. Little Endian
Data type dependent
System architecture dependent
Distinction 0xFEFF for Big Endian and 0xFFFE for
Little Endian