Title: Unicode Normalization
1Unicode Normalization
- Mark Davis
- www.macchiato.com
2Normalization
- Uniqueness
- two equivalent strings have precisely the same
normalized form - Fast binary comparison,accurate digital
signatures - Recommended for XML, JavaScript and other
standards
3Canonical Equivalence
- Fundamental equivalence
- Indistinguishable to users, when correctly
rendered - Includes
- Combining sequences
- Hangul
- Singletons
C
Ç
?
?
?
O
?
4Compatibility Equivalence
- Formatting differences
- Font variants (H)
- Breaking differences (-)
- Cursive forms (? ? ? ?)
- Circled (?)
- Width, size, rotated (? ? ?)
- Super/subscripts (9 ?)
- Squared characters (?)
- Fractions (?)
- Others (?)
?
?
k
g
?
f
i
?
5UTR 15Unicode Normalization Forms
Form D Canonical Decomposition
Form KD Compatibility Decomposition
Form C Form D Canonical Composition
Form KC Form KD Canonical Composition
6Normalization Requirement
- Uniqueness two equivalent strings will have
precisely the same normalized form - If two strings x and y are canonical equivalents,
then - C(x) C(y)
- D(x) D(y)
- If two strings are compatibility equivalents,
then - KC(x) KC(y)
- KD(x) KD(y)
7Affected Characters
- None of the forms affect text with only ASCII
characters (U0000 to U007F) - None of the forms generate compability characters
that were not in the source text. - Both KD and KC replace compatibility characters.
- Both D and C maintain compatibility characters.
8Cautions Decomposition
- Requires decomposition mappings from the Unicode
Character Database - Those decomposition mappings must be applied
recursively - The string must be put into canonical order
- Either Canonical or Compatibility
9Cautions Composition
- Decomposition required first!
- Then canonical composition
- Composition data fixed at Unicode 3.0.0
- Some characters are excluded from composition
- Form C and Form KC can still have combining
characters! - Required for Indic, Arabic, Hebrew, c.
10Caution Both C D
- All normalization forms are not closed under
string concatenation. Example - NFC/D "a??" "?"
- Not Norm. "a???"
- NFC "à??"
- NFD "a???"
- Exceptions easy to test for
11Composition Process
- Decompose (D or KD)
- Combine unblocked characters with the previous
starter, if possible
12Composition Exclusions
- Script Specifics ? ?? ? ?
- Futures G ?? ? G?
- Singletons O ? O
- Non-starter sequences ? ? ? ??
13Legacy Encoding
- Legacy text is normalized if it maps 11 to
normalized Unicode text - Legacy sets
- Prenormalized e.g. ISO 8859-1
- Normalizable e.g. ISO 2022 (ISO 5426/ISO
8859-1/) - Unnormalizable e.g. ISO 5426
14Programming Identifiers
- Closed under all Normalization Forms, if minor
changes incorporated - Modified syntax
- identifier start ( start extend )
- start LuLlLtLmLoNl- irregulars
combining_like - extend MnMcNdPcCf- irregulars
combining_like mid_dot - (Almost) closed under Case Mappings
- see SpecialCasing.txt
15Resources
- Reference version on Unicode Site
- Production Version
- http//oss.software.ibm.com/icu
- ICU C/C and Java Versions
- Open Source, with IBM Public License
- Free commercial use and distribution Not Viral!
- Panel Later today
- Other companies also providing ask!
16Normalization
- Uniqueness two equivalent strings have precisely
the same normalized form - Fast binary comparison, accurate digital
signatures - Recommended for XML, JavaScript and other
standards
17Q A
18Backup Slides
19Definition Starter
- S is a starter
- Canonical class of zero in the Unicode Character
Database - Can start a composition
- Examples
- Starters Spacing marks, some non-spacing
- a, ? T ? ? ??
- Non-starters most non-spacing marks
- ?, ? ?? ??
20Definition Blocked
- C is blocked from S
- There is some character B between S and C, and
either - B is a starter or
- B has the same canonical class as C
- Examples
- ABC B blocks C from A
- A?? ? blocks ? from A
- A??? ?? doesnt block ? from A
21Testing Conformance Canonical
For all Unicode characters X For all Unicode characters X For all Unicode characters X
C(X) C(D(X) D(X), C(X) in canonical order C(X) C(D(X) D(X), C(X) in canonical order C(X) C(D(X) D(X), C(X) in canonical order
CDM No CDM No CDM
X D(X) X C(X) X ? D(X) No characters in D(X) have CDM X ? D(X) No characters in D(X) have CDM
X D(X) X C(X) X ? Exclusions X ? Exclusions
X D(X) X C(X) X ? C(D(X) X C(D(X)
22Unicode Normalization
- Introduction
- Normalization forms
- Design goals
- Specification
- Excluded characters
- Versions
- Legacy encodings
- Applications
23Characters and Encoding Forms
Abstract
Encoded
Serialized
UTF-16BE
UTF-8
C5
00
C5
C3
85
212B
21
2B
E2
84
AB
Å
F0000
DB
80
DC
00
F3
B0
80
80
00
61
03
0A
61
CC
8A
A
61
30A