Supplementary Character Support in Microsoft Products - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Supplementary Character Support in Microsoft Products

Description:

What are supplementary characters? ... a little bit of extra work needed for supplementary characters ... for Unicode, let along supplementary characters ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 23
Provided by: downloadM
Category:

less

Transcript and Presenter's Notes

Title: Supplementary Character Support in Microsoft Products


1
Supplementary Character Support in Microsoft
Products
  • Michael S. Kaplan
  • Software Design Engineer
  • Microsoft

2
What are supplementary characters?
  • "a coded character representation for a single
    abstract character that consists of a sequence of
    two code units, where the first unit of the pair
    is a high surrogate and the second is a low
    surrogate"

3
High/low surrogate?
  • High UD800 - UDBFF
  • Low UDC00 - UDFFF
  • Terminology
  • "surrogate pair" preferred over "surrogate
    character
  • See http//www.trigeminal.com/16to32AndBack.asp

4
Conversion example 1
  • Example 1
  • The first character in the Surrogate range (D800,
    DC00) as UTF-32
  • 1. D800 binary 1101100000000000 (lower ten
    bits 0000000000)
  • 2. DC00 binary 1101110000000000 (lower ten
    bits 0000000000)
  • 3. Concatenate 00000000000000000000 x0000
  • 4. Add x10000
  • Result U10000. This makes sense, since the
    first character in the Surrogate range follows
    immediately after the last character in the
    16-bit Unicode range (UFFFF)

5
Conversion example 2
  • Example 2.
  • You have a Unicode character such as U2040A (a
    CJK character in Plane 2) and wish to encode it
    in UTF-16
  • 1. Subtract x10000 - Result 1040A
  • 2. Split into two ten-bit pieces 0001000001
    0000001010
  • 3. Add 1101100000000000 (D800) to the high 10
    bits piece (0001000001) - Result
    1101100001000001 (D841)
  • 4. Add 1101110000000000 (DC00) to the low 10 bits
    piece (0000001010) - Result 1101110000001010
    (DC0A)
  • Your surrogate pair D841, DC0A

6
UTF-8 conversions
  • Illegal conversions six-byte UTF-8 (two
    surrogate code points of UTF-16, converted
    separately)
  • legal conversions four-byte UTF-8 (one UTF-32
    code point)
  • CESU-8 is the the inverse of the above

7
UTF-8 example
  • Unicode surrogate pair
  • aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx
  • becomes incorrect UTF-8 total 6 bytes
  • 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy
    10xxxxxx
  • Instead, you should take a Unicode surrogate
    pair
  • 110110wwwwzzzzyy, 110111yyyyxxxxxx
  • and convert it to UTF-8 totaling 4 bytes (below,
    uuuuu is defined as wwww1)
  • 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

8
Encoding choices for MS
  • UTF-16, mostly
  • Occasionally UTF-8
  • Even more occasionally, UTF-32
  • REASONS
  • There was obviously an existing, well-tested set
    of APIs that support UCS-2, which is a subset of
    UTF-16.
  • A completely new API set was not required.
  • A move to UTF-32 would require twice as much
    space for all characters.
  • A move to UTF-8 would require even more than
    twice as much space in many cases.

9
The products...
  • Mostly the new generation of products
  • Windows 2000/XP
  • Office XP (some support in Office 2000)
  • Visual Studio.Net
  • Most (all) of these products supported Unicode
    already
  • a little bit of extra work needed for
    supplementary characters
  • usually just UTF-8 changes were needed

10
Windows 2000
  • Uniscribe support for rendering
  • Each surrogate pair is a single grapheme
  • APIs like CharPrev/CharNext not changed
  • No specific surrogate font/IME
  • Must be turned on
  • http//msdn.microsoft.com/library/en-us/intl/unico
    de_192r.asp

11
Windows XP
  • . from Windows 2000
  • Turned on by default!
  • GDI support for rendering
  • Font CMAP extensions
  • Lots of UTF-8 issues fixed
  • No specific surrogate font/IME (yet)
  • Extensions to fallback fonts limited
  • HKLM\Software\Microsoft\Windows
    NT\CurrentVersion\LanguagePack\SurrogateFallback\P
    lane1HKLM\Software\Microsoft\Windows
    NT\CurrentVersion\LanguagePack\SurrogateFallback\P
    lane2HKLM\Software\Microsoft\Windows
    NT\CurrentVersion\LanguagePack\SurrogateFallback\P
    lane3(etc.)

12
Other system components
  • MLang
  • Internet Explorerhttp//i18nWithVB.com/surrogate_
    ime/
  • IIS 5.0/6.0

13
The downlevel story
  • No good support for Unicode, let along
    supplementary characters
  • Uniscribe/RichEdit does improve the downlevel
    story for display purposes
  • Officially, no support on Win9x

14
The Office suite
  • Word
  • Frontpage
  • Excel/Access
  • Outlook
  • RichEdit 4.0

15
Office - Specific Features
  • Insertion/Deletion of text - All
  • Cursor movement - All
  • Font linking/fallback - All (Word's is best)
  • UTF-8 issues fixed - All
  • Enhanced word breaking - All (Word/RichEdit)
  • Vertical text - Word/PowerPoint/Publisher/RichEdit
  • Direct entry (Altnnnnnn, hhhhh Altx) -
    Word/RichEdit

16
CHS/CHT/CHP Office
  • The product and the langpacks support an extended
    Unicode IME that handles supplementary characters
  • An Extension B font is also included

17
Visual Studio.NET
  • String class and globalization namespace
  • StringInfo
  • GetTextElementEnumerator
  • Handles supplementary characters
  • Also handles composite characters
  • GDI
  • IDE support

18
SQL Server
  • Past - no support (for Unicode, even!)
  • Present - surrogate "safe" (neutral)
  • Future - surrogate aware

19
Items not currently supported
  • Character Map
  • Graph 10
  • Outlook 10 mail headers
  • Fonts/IMEs
  • Collations for supplementary characters

20
Collation plan for supplementary characters in
the UCA?
  • All Plane-1 (non-ideographic) characters sort
    after all the other non-ideographic scripts but
    before the ideographs.
  • All Plane 2 (ideographic) characters will be
    sorted after all the ideographs on the BMP.
  • All Plane 3-14 (currently not assigned) will be
    treated like any other unassigned characters.
  • Plane 14 language tags will be treated as if they
    were unassigned.
  • All characters encoded in Plane 15-16 (private
    use) will be sorted after all other characters.

21
Questions?
22
  • Supplementary Character Support in Microsoft
    Products

Dont forget to fill out your evals!
Write a Comment
User Comments (0)
About PowerShow.com