From UCS2 to UTF16 - PowerPoint PPT Presentation

About This Presentation

Title:

From UCS2 to UTF16

Description:

APIs and libraries need to follow this change and support the full range ... Public APIs reviewed and changed ('luxury' of early project stage) or deprecated ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 27

Provided by: mar1313

Learn more at: http://www.unicode.org

Category:

Tags: ucs2 | utf16

more less

Transcript and Presenter's Notes

Title: From UCS2 to UTF16

1
From UCS-2 to UTF-16

Discussion and practical example for the
transition of a Unicode library from UCS-2 to
UTF-16

2
Why is this an issue?

The concept of the Unicode standard changed
during its first few years
Unicode 2.0 (1996) expanded the code point range
from 64k to 1.1M
APIs and libraries need to follow this change and
support the full range
Upcoming character assignments (Unicode 3.1,
2001) fall into the added range

3
Unicode is a 16-bit character set

Concept 16-bit, fixed-width character set
Saving space by not including precomposed,
rarely-used, obsolete, characters
Compatibility, transition strategies, and
acceptance forced loosening of these principles
Unicode 3.1 gt90k assigned characters

4
16-bit APIs

APIs developed for Unicode 1.1 used 16-bit
characters and strings UCS-2
Assuming 11 charactercode unit
Examples Win32, Java, COM, ICU, Qt/KDE
Byte-based UTF-8 (1993) mostly for MBCS
compatibility and transfer protocols

5
Extending the range

Set aside two blocks of 1k 16-bit values,
surrogates, for extension
1k x 1k 1M 10000016 additional code points
using a pair of code units
16-bit form now variable-width UTF-16
Unicode scalar values 0..10ffff16
Proposed 1994 part of Unicode 2.0 (1996)

6
Parallel with ISO-10646

ISO-10646 uses 31-bit codes UCS-4
UCS-2 16-bit codes for subset 0..ffff16
UTF-16 transformation of subset 0..10ffff16
UTF-8 covers all 31 bits
Private Use areas above 10ffff16 slated for
removal from ISO-10646 for UTF interoperability
and synchronization with Unicode

7
21-bit code points

Code points (Unicode scalar values) up to
10ffff16 use 21 bits
16-bit code units still good for strings
variable-width like MBCS
Default string unit size not big enough for code
points
Dual types for programming?

8
C char/wchar_t dual types

C/C standards dual types
Strings mostly with char units (8 bits)
Code points wchar_t, 8..32 bits
Typical use in I18N-ed programs (8-bit) char
strings but (16/32-bit) wchar_t (or 32-bit int)
characters code point type is implementation-depe
ndent

9
Unicode dual types, too?

Strings could continue with 16-bit units
Single code points could get 32-bit data type
Dual-type model like C/C MBCS

10
Alternatives to dual 16/32 types

UTF-32 all types 32 bits wide, fixed-width
UTF-8 same complexity after range extension
beyond just the BMP, closer to C/C model
byte-based
Use pairs of 16-bit units
Use strings for everything
Make string unit size flexible 8/16/32 bits

11
UCS-2 to UTF-32

Fixed-width, single base type for strings and
code points
UCS-2 programming assumptions mostly intact
Wastes at least 33 space, typically 50
Performance bottleneck CPU - memory

12
UCS-2 to UTF-8

UCS-2 programming assumes many characters in
single code units
Breaks a lot of code
Same question of type for code points follow C
model, 32-bit wchar_t? More difficult
transition than other choices

13
Surrogate pairs for single chars

Caller avoids code point calculation
But caller and callee need to detect and handle
pairs caller choosing argument values, callee
checking for errors
Harder to use with code point constants because
they are published as scalar values
Significant change for caller from using scalars

14
Strings for single chars

Always pass in string (and offset)
Most general, handles graphemes in addition to
code points
Harder to use with code point constants because
they are published as scalar values
Significant change for caller from using scalars

15
UTF-flexible

In principle, if the implementation can handle
variable-width, MBCS-style strings, could it
handle any UTF-size as a compile-time choice?
Adds interoperability with UTF-8/32 APIs
Almost no assumptions possible
Complexity of transition even higher than of
transition to pure UTF-8, performance?

16
Interoperability

Break existing API users no more than necessary
Interoperability with other APIs Win32, Java,
COM, now also XML DOM
UTF-16 is Unicode default good compromise
(speed/ease/space)
String units should stay 16 bits wide

17
Does everything need to change?

String operations search, substring,
concatenation, work with any UTF without change
Character property lookup and similar need to
support the extended range
Formatting should handle more code points or
even graphemes
Careful evaluation of all public APIs

18
ICU some of all

Strings UTF-16, UChar type remains 16-bit
New UChar32 for code points
Provide macros for C to deal with all UTFs
iteration, random access,
C CharacterIterator many new functions
Property lookup/low-level UChar32
Formatting strings for graphemes

19
Scalar code pointsproperty lookup

Old, 16-bitUChar u_tolower(UChar c)
uvc15..7c6..0
New, 21-bitUChar32 u_tolower(UChar32 c)
uvwc20..10c9..4c3..0

20
Formatting grapheme strings

Oldvoid setDecimalSymbol(UChar c)
Newvoid setDecimalSymbol(const UnicodeString
s)

21
Codepage conversion

To Unicode results are one or two UTF-16 code
units, surrogates stored directly in the
conversion table
From Unicode triple-stage compact array access
from 21-bit code points like property lookup
Single-character-conversion to Unicode now
returns UChar32 values

22
API first

Tools and basic functions and classes are in
place (property lookup, conversion, iterators,
BiDi)
Public APIs reviewed and changed (luxury of
early project stage) or deprecated and superseded
by new versions
Higher-level implementations to follow before
Unicode 3.1 published

23
More implementations follow

Collation need to prepare for gt64k primary keys
Normalization and Transliteration
Word/Sentence break iteration
Etc.
No non-BMP data before Unicode 3.1 is stable

24
Other libraries

Java planning stage for transition
Win32 rendering and UniScribe API largely
UTF-16-ready
Linux standardizing on 32-bit Unicode wchar_t,
has UTF-8 locales like other Unixes for char
APIs
W3C standards assume full UTF-16 range

25
Summary