Unicode Normalization - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Unicode Normalization

Description:

Unicode Normalization Mark Davis www.macchiato.com Normalization Uniqueness two equivalent strings have precisely the same normalized form Fast binary comparison ... – PowerPoint PPT presentation

Number of Views:223

Avg rating:3.0/5.0

Slides: 24

Provided by: mark740

Category:

more less

Transcript and Presenter's Notes

Title: Unicode Normalization

1
Unicode Normalization

Mark Davis
www.macchiato.com

2
Normalization

Uniqueness
two equivalent strings have precisely the same
normalized form
Fast binary comparison,accurate digital
signatures
Recommended for XML, JavaScript and other
standards

3
Canonical Equivalence

Fundamental equivalence
Indistinguishable to users, when correctly
rendered
Includes
Combining sequences
Hangul
Singletons

C

Ç
?
?
?
O
?
4
Compatibility Equivalence

Formatting differences
Font variants (H)
Breaking differences (-)
Cursive forms (? ? ? ?)
Circled (?)
Width, size, rotated (? ? ?)
Super/subscripts (9 ?)
Squared characters (?)
Fractions (?)
Others (?)

?
?
k
g
?
f
i
?
5
UTR 15Unicode Normalization Forms
Form D Canonical Decomposition
Form KD Compatibility Decomposition
Form C Form D Canonical Composition
Form KC Form KD Canonical Composition
6
Normalization Requirement

Uniqueness two equivalent strings will have
precisely the same normalized form
If two strings x and y are canonical equivalents,
then
C(x) C(y)
D(x) D(y)
If two strings are compatibility equivalents,
then
KC(x) KC(y)
KD(x) KD(y)

7
Affected Characters

None of the forms affect text with only ASCII
characters (U0000 to U007F)
None of the forms generate compability characters
that were not in the source text.
Both KD and KC replace compatibility characters.
Both D and C maintain compatibility characters.

8
Cautions Decomposition

Requires decomposition mappings from the Unicode
Character Database
Those decomposition mappings must be applied
recursively
The string must be put into canonical order
Either Canonical or Compatibility

9
Cautions Composition

Decomposition required first!
Then canonical composition
Composition data fixed at Unicode 3.0.0
Some characters are excluded from composition
Form C and Form KC can still have combining
characters!
Required for Indic, Arabic, Hebrew, c.

10
Caution Both C D

All normalization forms are not closed under
string concatenation. Example
NFC/D "a??" "?"
Not Norm. "a???"
NFC "à??"
NFD "a???"
Exceptions easy to test for

11
Composition Process

Decompose (D or KD)
Combine unblocked characters with the previous
starter, if possible

12
Composition Exclusions

Script Specifics ? ?? ? ?
Futures G ?? ? G?
Singletons O ? O
Non-starter sequences ? ? ? ??

13
Legacy Encoding

Legacy text is normalized if it maps 11 to
normalized Unicode text
Legacy sets
Prenormalized e.g. ISO 8859-1
Normalizable e.g. ISO 2022 (ISO 5426/ISO
8859-1/)
Unnormalizable e.g. ISO 5426

14
Programming Identifiers

Closed under all Normalization Forms, if minor
changes incorporated
Modified syntax
identifier start ( start extend )
start LuLlLtLmLoNl- irregulars
combining_like
extend MnMcNdPcCf- irregulars
combining_like mid_dot
(Almost) closed under Case Mappings
see SpecialCasing.txt

15
Resources

Reference version on Unicode Site
Production Version
http//oss.software.ibm.com/icu
ICU C/C and Java Versions
Open Source, with IBM Public License
Free commercial use and distribution Not Viral!
Panel Later today
Other companies also providing ask!

16
Normalization

Uniqueness two equivalent strings have precisely
the same normalized form
Fast binary comparison, accurate digital
signatures
Recommended for XML, JavaScript and other
standards

17
Q A
18
Backup Slides
19
Definition Starter

S is a starter
Canonical class of zero in the Unicode Character
Database
Can start a composition
Examples
Starters Spacing marks, some non-spacing
a, ? T ? ? ??
Non-starters most non-spacing marks
?, ? ?? ??

20
Definition Blocked

C is blocked from S
There is some character B between S and C, and
either
B is a starter or
B has the same canonical class as C
Examples
ABC B blocks C from A
A?? ? blocks ? from A
A??? ?? doesnt block ? from A

21
Testing Conformance Canonical
For all Unicode characters X For all Unicode characters X For all Unicode characters X
C(X) C(D(X) D(X), C(X) in canonical order C(X) C(D(X) D(X), C(X) in canonical order C(X) C(D(X) D(X), C(X) in canonical order
CDM No CDM No CDM
X  D(X) X C(X) X ? D(X) No characters in D(X) have CDM X ? D(X) No characters in D(X) have CDM
X  D(X) X C(X) X ? Exclusions X ? Exclusions
X  D(X) X C(X) X ? C(D(X) X C(D(X)
22
Unicode Normalization

Introduction
Normalization forms
Design goals
Specification
Excluded characters
Versions
Legacy encodings
Applications

23
Characters and Encoding Forms
Abstract
Encoded
Serialized
UTF-16BE
UTF-8
C5
00
C5
C3
85
212B
21
2B
E2
84
AB
Å
F0000
DB
80
DC
00
F3
B0
80
80
00
61
03
0A
61
CC
8A
A
61
30A

Write a Comment

User Comments (0)