From UCS2 to UTF16 - PowerPoint PPT Presentation

About This Presentation
Title:

From UCS2 to UTF16

Description:

APIs and libraries need to follow this change and support the full range ... Public APIs reviewed and changed ('luxury' of early project stage) or deprecated ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 27
Provided by: mar1313
Learn more at: http://www.unicode.org
Category:
Tags: ucs2 | utf16

less

Transcript and Presenter's Notes

Title: From UCS2 to UTF16


1
From UCS-2 to UTF-16
  • Discussion and practical example for the
    transition of a Unicode library from UCS-2 to
    UTF-16

2
Why is this an issue?
  • The concept of the Unicode standard changed
    during its first few years
  • Unicode 2.0 (1996) expanded the code point range
    from 64k to 1.1M
  • APIs and libraries need to follow this change and
    support the full range
  • Upcoming character assignments (Unicode 3.1,
    2001) fall into the added range

3
Unicode is a 16-bit character set
  • Concept 16-bit, fixed-width character set
  • Saving space by not including precomposed,
    rarely-used, obsolete, characters
  • Compatibility, transition strategies, and
    acceptance forced loosening of these principles
  • Unicode 3.1 gt90k assigned characters

4
16-bit APIs
  • APIs developed for Unicode 1.1 used 16-bit
    characters and strings UCS-2
  • Assuming 11 charactercode unit
  • Examples Win32, Java, COM, ICU, Qt/KDE
  • Byte-based UTF-8 (1993) mostly for MBCS
    compatibility and transfer protocols

5
Extending the range
  • Set aside two blocks of 1k 16-bit values,
    surrogates, for extension
  • 1k x 1k 1M 10000016 additional code points
    using a pair of code units
  • 16-bit form now variable-width UTF-16
  • Unicode scalar values 0..10ffff16
  • Proposed 1994 part of Unicode 2.0 (1996)

6
Parallel with ISO-10646
  • ISO-10646 uses 31-bit codes UCS-4
  • UCS-2 16-bit codes for subset 0..ffff16
  • UTF-16 transformation of subset 0..10ffff16
  • UTF-8 covers all 31 bits
  • Private Use areas above 10ffff16 slated for
    removal from ISO-10646 for UTF interoperability
    and synchronization with Unicode

7
21-bit code points
  • Code points (Unicode scalar values) up to
    10ffff16 use 21 bits
  • 16-bit code units still good for strings
    variable-width like MBCS
  • Default string unit size not big enough for code
    points
  • Dual types for programming?

8
C char/wchar_t dual types
  • C/C standards dual types
  • Strings mostly with char units (8 bits)
  • Code points wchar_t, 8..32 bits
  • Typical use in I18N-ed programs (8-bit) char
    strings but (16/32-bit) wchar_t (or 32-bit int)
    characters code point type is implementation-depe
    ndent

9
Unicode dual types, too?
  • Strings could continue with 16-bit units
  • Single code points could get 32-bit data type
  • Dual-type model like C/C MBCS

10
Alternatives to dual 16/32 types
  • UTF-32 all types 32 bits wide, fixed-width
  • UTF-8 same complexity after range extension
    beyond just the BMP, closer to C/C model
    byte-based
  • Use pairs of 16-bit units
  • Use strings for everything
  • Make string unit size flexible 8/16/32 bits

11
UCS-2 to UTF-32
  • Fixed-width, single base type for strings and
    code points
  • UCS-2 programming assumptions mostly intact
  • Wastes at least 33 space, typically 50
  • Performance bottleneck CPU - memory

12
UCS-2 to UTF-8
  • UCS-2 programming assumes many characters in
    single code units
  • Breaks a lot of code
  • Same question of type for code points follow C
    model, 32-bit wchar_t? More difficult
    transition than other choices

13
Surrogate pairs for single chars
  • Caller avoids code point calculation
  • But caller and callee need to detect and handle
    pairs caller choosing argument values, callee
    checking for errors
  • Harder to use with code point constants because
    they are published as scalar values
  • Significant change for caller from using scalars

14
Strings for single chars
  • Always pass in string (and offset)
  • Most general, handles graphemes in addition to
    code points
  • Harder to use with code point constants because
    they are published as scalar values
  • Significant change for caller from using scalars

15
UTF-flexible
  • In principle, if the implementation can handle
    variable-width, MBCS-style strings, could it
    handle any UTF-size as a compile-time choice?
  • Adds interoperability with UTF-8/32 APIs
  • Almost no assumptions possible
  • Complexity of transition even higher than of
    transition to pure UTF-8, performance?

16
Interoperability
  • Break existing API users no more than necessary
  • Interoperability with other APIs Win32, Java,
    COM, now also XML DOM
  • UTF-16 is Unicode default good compromise
    (speed/ease/space)
  • String units should stay 16 bits wide

17
Does everything need to change?
  • String operations search, substring,
    concatenation, work with any UTF without change
  • Character property lookup and similar need to
    support the extended range
  • Formatting should handle more code points or
    even graphemes
  • Careful evaluation of all public APIs

18
ICU some of all
  • Strings UTF-16, UChar type remains 16-bit
  • New UChar32 for code points
  • Provide macros for C to deal with all UTFs
    iteration, random access,
  • C CharacterIterator many new functions
  • Property lookup/low-level UChar32
  • Formatting strings for graphemes

19
Scalar code pointsproperty lookup
  • Old, 16-bitUChar u_tolower(UChar c)
    uvc15..7c6..0
  • New, 21-bitUChar32 u_tolower(UChar32 c)
    uvwc20..10c9..4c3..0

20
Formatting grapheme strings
  • Oldvoid setDecimalSymbol(UChar c)
  • Newvoid setDecimalSymbol(const UnicodeString
    s)

21
Codepage conversion
  • To Unicode results are one or two UTF-16 code
    units, surrogates stored directly in the
    conversion table
  • From Unicode triple-stage compact array access
    from 21-bit code points like property lookup
  • Single-character-conversion to Unicode now
    returns UChar32 values

22
API first
  • Tools and basic functions and classes are in
    place (property lookup, conversion, iterators,
    BiDi)
  • Public APIs reviewed and changed (luxury of
    early project stage) or deprecated and superseded
    by new versions
  • Higher-level implementations to follow before
    Unicode 3.1 published

23
More implementations follow
  • Collation need to prepare for gt64k primary keys
  • Normalization and Transliteration
  • Word/Sentence break iteration
  • Etc.
  • No non-BMP data before Unicode 3.1 is stable

24
Other libraries
  • Java planning stage for transition
  • Win32 rendering and UniScribe API largely
    UTF-16-ready
  • Linux standardizing on 32-bit Unicode wchar_t,
    has UTF-8 locales like other Unixes for char
    APIs
  • W3C standards assume full UTF-16 range

25
Summary
  • Transition from UCS-2 to UTF-16 gains importance
    after four years of standard
  • APIs for single characters need change or new
    versions
  • String APIs no change
  • Implementations need to handle 21-bit code points
  • Range of options

26
Resources
  • Unicode FAQ http//www.unicode.org/unicode/faq/
  • Unicode on IBM developerWorks http//www.ibm.com/
    developer/unicode/
  • ICU http//oss.software.ibm.com/icu/
Write a Comment
User Comments (0)
About PowerShow.com