Unicode Normalization - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Unicode Normalization

Description:

Unicode Normalization Mark Davis www.macchiato.com Normalization Uniqueness two equivalent strings have precisely the same normalized form Fast binary comparison ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 24
Provided by: MarkD115
Category:

less

Transcript and Presenter's Notes

Title: Unicode Normalization


1
Unicode Normalization
  • Mark Davis
  • www.macchiato.com

2
Normalization
  • Uniqueness
  • two equivalent strings have precisely the same
    normalized form
  • Fast binary comparison,accurate digital
    signatures
  • Recommended for XML, JavaScript and other
    standards

3
Canonical Equivalence
  • Fundamental equivalence
  • Indistinguishable to users, when correctly
    rendered
  • Includes
  • Combining sequences
  • Hangul
  • Singletons

C

Ç
?
?
?
O
?
4
Compatibility Equivalence
  • Formatting differences
  • Font variants (H)
  • Breaking differences (-)
  • Cursive forms (? ? ? ?)
  • Circled (?)
  • Width, size, rotated (? ? ?)
  • Super/subscripts (9 ?)
  • Squared characters (?)
  • Fractions (?)
  • Others (?)

?
?
k
g
?
f
i
?
5
UTR 15Unicode Normalization Forms
Form D Canonical Decomposition
Form KD Compatibility Decomposition
Form C Form D Canonical Composition
Form KC Form KD Canonical Composition
6
Normalization Requirement
  • Uniqueness two equivalent strings will have
    precisely the same normalized form
  • If two strings x and y are canonical equivalents,
    then
  • C(x) C(y)
  • D(x) D(y)
  • If two strings are compatibility equivalents,
    then
  • KC(x) KC(y)
  • KD(x) KD(y)

7
Affected Characters
  • None of the forms affect text with only ASCII
    characters (U0000 to U007F)
  • None of the forms generate compability characters
    that were not in the source text.
  • Both KD and KC replace compatibility characters.
  • Both D and C maintain compatibility characters.

8
Cautions Decomposition
  • Requires decomposition mappings from the Unicode
    Character Database
  • Those decomposition mappings must be applied
    recursively
  • The string must be put into canonical order
  • Either Canonical or Compatibility

9
Cautions Composition
  • Decomposition required first!
  • Then canonical composition
  • Composition data fixed at Unicode 3.0.0
  • Some characters are excluded from composition
  • Form C and Form KC can still have combining
    characters!
  • Required for Indic, Arabic, Hebrew, c.

10
Caution Both C D
  • All normalization forms are not closed under
    string concatenation. Example
  • NFC/D "a??" "?"
  • Not Norm. "a???"
  • NFC "à??"
  • NFD "a???"
  • Exceptions easy to test for

11
Composition Process
  1. Decompose (D or KD)
  2. Combine unblocked characters with the previous
    starter, if possible

12
Composition Exclusions
  • Script Specifics ? ?? ? ?
  • Futures G ?? ? G?
  • Singletons O ? O
  • Non-starter sequences ? ? ? ??

13
Legacy Encoding
  • Legacy text is normalized if it maps 11 to
    normalized Unicode text
  • Legacy sets
  • Prenormalized e.g. ISO 8859-1
  • Normalizable e.g. ISO 2022 (ISO 5426/ISO
    8859-1/)
  • Unnormalizable e.g. ISO 5426

14
Programming Identifiers
  • Closed under all Normalization Forms, if minor
    changes incorporated
  • Modified syntax
  • identifier start ( start extend )
  • start LuLlLtLmLoNl- irregulars
    combining_like
  • extend MnMcNdPcCf- irregulars
    combining_like mid_dot
  • (Almost) closed under Case Mappings
  • see SpecialCasing.txt

15
Resources
  • Reference version on Unicode Site
  • Production Version
  • http//oss.software.ibm.com/icu
  • ICU C/C and Java Versions
  • Open Source, with IBM Public License
  • Free commercial use and distribution Not Viral!
  • Panel Later today
  • Other companies also providing ask!

16
Normalization
  • Uniqueness two equivalent strings have precisely
    the same normalized form
  • Fast binary comparison, accurate digital
    signatures
  • Recommended for XML, JavaScript and other
    standards

17
Q A
18
Backup Slides
19
Definition Starter
  • S is a starter
  • Canonical class of zero in the Unicode Character
    Database
  • Can start a composition
  • Examples
  • Starters Spacing marks, some non-spacing
  • a, ? T ? ? ??
  • Non-starters most non-spacing marks
  • ?, ? ?? ??

20
Definition Blocked
  • C is blocked from S
  • There is some character B between S and C, and
    either
  • B is a starter or
  • B has the same canonical class as C
  • Examples
  • ABC B blocks C from A
  • A?? ? blocks ? from A
  • A??? ?? doesnt block ? from A

21
Testing Conformance Canonical
For all Unicode characters X For all Unicode characters X For all Unicode characters X
C(X) C(D(X) D(X), C(X) in canonical order C(X) C(D(X) D(X), C(X) in canonical order C(X) C(D(X) D(X), C(X) in canonical order
CDM No CDM No CDM
X  D(X) X  C(X) X ? D(X) No characters in D(X) have CDM X ? D(X) No characters in D(X) have CDM
X  D(X) X  C(X) X ? Exclusions X ? Exclusions
X  D(X) X  C(X) X ? C(D(X) X C(D(X)
22
Unicode Normalization
  • Introduction
  • Normalization forms
  • Design goals
  • Specification
  • Excluded characters
  • Versions
  • Legacy encodings
  • Applications

23
Characters and Encoding Forms
Abstract
Encoded
Serialized
UTF-16BE
UTF-8
C5
00
C5
C3
85
212B
21
2B
E2
84
AB
Å
F0000
DB
80
DC
00
F3
B0
80
80
00
61
03
0A
61
CC
8A
A
61
30A
Write a Comment
User Comments (0)
About PowerShow.com