Optimizing the Usage of Normalization - PowerPoint PPT Presentation

About This Presentation

Title:

Optimizing the Usage of Normalization

Description:

Dublin, Ireland, May 2002. Introduction ... Dublin, Ireland, May 2002. Avoiding Normalization. Force users to provide already normalized data ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 23

Provided by: icupr

Learn more at: https://icu-project.org

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing the Usage of Normalization

1
Optimizing the Usage of Normalization

Vladimir Weinstein
vweinste_at_us.ibm.com

Globalization Center of Competency, San Jose, CA
2
Introduction

Unicode standard has multiple ways to encode
equivalent strings

Accents that dont interact are put into a unique
order

3
Introduction (contd.)

Normalization provides a way to transform a
string to an unique form (NFD, NFC)
Strings that can be transformed to the same form
are called canonically equivalent
Time-critical applications need to minimize the
number of passes over the text
ICU gives a number of tools to deal with this
problem
We will use collation (language-sensitive string
comparison) as an example

4
Avoiding Normalization

Force users to provide already normalized data
The performance problem does not go away
When the strings are processed many times, it
could be beneficial to normalize them beforehand
Forcing users to provide a specific form can be
unpopular

5
Check for Normalized Text

Most strings are already in normalized form
Quick Check is significantly faster than the full
normalization
Needs canonical class data and additional data
for checking the relation between a code point
and a normalization form
Algorithm in UAX 15 Annex 8 (http//www.unicode.o
rg/unicode/reports/tr15/Annex8)

6
Normalize Incrementally

Instead of normalizing the whole string at once,
normalize one piece at a time
This technique is usually combined with an
incremental Quick Check
Useful for procedures with early exit, such as
string comparing or scanning
Normalizes up to the next safe point

7
Incremental Normalization Example
Non incremental normalization
Initial string
résumé
Quick check
If normalized regularly, the whole string is
processed by normalization
Incremental normalization
Normalize just the parts that fail quick check
8
Optimized Concatenation

Simple concatenation of two normalized strings
can yield a string that is not normalized
One option is to normalize the result
Unnecessarily duplicates normalization

9
Optimized Concatenation Example

It is enough to normalize the boundary parts
Incremental normalization is used
Much faster than redoing the whole resulting
string

10
Accepting the FCD Form

Fast Composed or Decomposed form is a partially
normalized form
Not unique
More lenient than NFD or NFC form
It requires that the procedure has support for
all the canonically equivalent strings on input
It is possible to quick check the FCD format

11
FCD Form Examples
SEQUENCE FCD NFC NFD
A-ring Y Y
Angstrom Y
A ring Y Y
A grave Y Y
A-ring grave Y
A cedilla ring Y Y
A ring cedilla
A-ring cedilla Y
12
Canonical Closure

Preprocessing data to support the FCD form
Ensures that if data is assigned to a sequence
(or a code point) it will also be assigned to all
canonically equivalent FCD sequences

13
Collation

Locale specific sorting of strings
Relation between code points and collation
elements
Context sensitive
Contractions H lt Z, but CZ lt CH
Expansions OE lt Œ lt OF
Both ?? lt ?? or ?? gt ??
See Collation in ICU by Mark Davis

14
Collation Implementation in ICU

Two modes of operation
Normalization OFF expects the users to pass in
FCD strings
Normalization ON accepts any strings
Some locales require normalization to be turned
on
Canonical closure done for contractions and
regular mappings
Two important services
Sort key generation
String compare function
More about ICU at the end of presentation

15
FCD Support in Collation

Much higher performance
Values assigned to a code point or a contraction
are equal to those for its FCD canonically
equivalent sequences
This process is time consuming, but it is done at
build time
May increase data set

16
Sort Key Generation

Whole strings are processed
Sort keys tend to get reused, so the emphasis is
on producing as short sort keys as possible
Two modes of operation
Normalization ON strings are quick checked and
normalization is performed, if required
Normalization OFF depends on strings being in
FCD form. The performance increases by 20 to 50

17
String Compare

Very time critical
Result is usually determined before fully
processing both strings
First step is binary comparison for equality
When it fails, comparison continues from a safe
spot

18
String Compare Continued

Normalization ON incremental FCD check and
incremental FCD normalization if required
Normalization OFF assumes that the source
strings are FCD
Most locales dont require normalization on and
thus are 20 faster by using FCD

19
International Components for Unicode

International Components for Unicode(ICU) is a
library that provides robust and full-featured
Unicode support
The ICU normalization engine supports the
optimizations mentioned here
Library services accept FCD strings as input
Wide variety of supported platforms
Open source (X license non-viral)
C/C and JAVA versions
http//oss.software.ibm.com/icu/

20
Conclusion

The presented techniques allow much faster string
processing
In case of collation, sort key generation gets up
to 50 faster than if normalizing beforehand
String compare function becomes up to 3 times
faster!
May increase data size
Canonical closure preprocessing takes more time
to build, but pays off at runtime

21
Q A

22
Summary