Title: Optimizing the Usage of Normalization
1Optimizing the Usage of Normalization
- Vladimir Weinstein
- vweinste_at_us.ibm.com
Globalization Center of Competency, San Jose, CA
2Introduction
- Unicode standard has multiple ways to encode
equivalent strings
- Accents that dont interact are put into a unique
order
3Introduction (contd.)
- Normalization provides a way to transform a
string to an unique form (NFD, NFC) - Strings that can be transformed to the same form
are called canonically equivalent - Time-critical applications need to minimize the
number of passes over the text - ICU gives a number of tools to deal with this
problem - We will use collation (language-sensitive string
comparison) as an example
4Avoiding Normalization
- Force users to provide already normalized data
- The performance problem does not go away
- When the strings are processed many times, it
could be beneficial to normalize them beforehand - Forcing users to provide a specific form can be
unpopular
5Check for Normalized Text
- Most strings are already in normalized form
- Quick Check is significantly faster than the full
normalization - Needs canonical class data and additional data
for checking the relation between a code point
and a normalization form - Algorithm in UAX 15 Annex 8 (http//www.unicode.o
rg/unicode/reports/tr15/Annex8)
6Normalize Incrementally
- Instead of normalizing the whole string at once,
normalize one piece at a time - This technique is usually combined with an
incremental Quick Check - Useful for procedures with early exit, such as
string comparing or scanning - Normalizes up to the next safe point
7Incremental Normalization Example
Non incremental normalization
Initial string
résumé
Quick check
If normalized regularly, the whole string is
processed by normalization
Incremental normalization
Normalize just the parts that fail quick check
8Optimized Concatenation
- Simple concatenation of two normalized strings
can yield a string that is not normalized - One option is to normalize the result
- Unnecessarily duplicates normalization
9Optimized Concatenation Example
- It is enough to normalize the boundary parts
- Incremental normalization is used
- Much faster than redoing the whole resulting
string
10Accepting the FCD Form
- Fast Composed or Decomposed form is a partially
normalized form - Not unique
- More lenient than NFD or NFC form
- It requires that the procedure has support for
all the canonically equivalent strings on input - It is possible to quick check the FCD format
11FCD Form Examples
SEQUENCE FCD NFC NFD
A-ring Y Y
Angstrom Y
A ring Y Y
A grave Y Y
A-ring grave Y
A cedilla ring Y Y
A ring cedilla
A-ring cedilla Y
12Canonical Closure
- Preprocessing data to support the FCD form
- Ensures that if data is assigned to a sequence
(or a code point) it will also be assigned to all
canonically equivalent FCD sequences
13Collation
- Locale specific sorting of strings
- Relation between code points and collation
elements - Context sensitive
- Contractions H lt Z, but CZ lt CH
- Expansions OE lt Å’ lt OF
- Both ?? lt ?? or ?? gt ??
- See Collation in ICU by Mark Davis
14Collation Implementation in ICU
- Two modes of operation
- Normalization OFF expects the users to pass in
FCD strings - Normalization ON accepts any strings
- Some locales require normalization to be turned
on - Canonical closure done for contractions and
regular mappings - Two important services
- Sort key generation
- String compare function
- More about ICU at the end of presentation
15FCD Support in Collation
- Much higher performance
- Values assigned to a code point or a contraction
are equal to those for its FCD canonically
equivalent sequences - This process is time consuming, but it is done at
build time - May increase data set
16Sort Key Generation
- Whole strings are processed
- Sort keys tend to get reused, so the emphasis is
on producing as short sort keys as possible - Two modes of operation
- Normalization ON strings are quick checked and
normalization is performed, if required - Normalization OFF depends on strings being in
FCD form. The performance increases by 20 to 50
17String Compare
- Very time critical
- Result is usually determined before fully
processing both strings - First step is binary comparison for equality
- When it fails, comparison continues from a safe
spot
18String Compare Continued
- Normalization ON incremental FCD check and
incremental FCD normalization if required - Normalization OFF assumes that the source
strings are FCD - Most locales dont require normalization on and
thus are 20 faster by using FCD
19International Components for Unicode
- International Components for Unicode(ICU) is a
library that provides robust and full-featured
Unicode support - The ICU normalization engine supports the
optimizations mentioned here - Library services accept FCD strings as input
- Wide variety of supported platforms
- Open source (X license non-viral)
- C/C and JAVA versions
- http//oss.software.ibm.com/icu/
20Conclusion
- The presented techniques allow much faster string
processing - In case of collation, sort key generation gets up
to 50 faster than if normalizing beforehand - String compare function becomes up to 3 times
faster! - May increase data size
- Canonical closure preprocessing takes more time
to build, but pays off at runtime
21Q A
22Summary
- Introduction
- Avoiding normalization
- Check for normalized text
- Normalize incrementally
- Concatenation of normalized strings
- Accepting the FCD form
- Implementation of collation in ICU