Collation%20in%20ICU%201.8 - PowerPoint PPT Presentation

About This Presentation
Title:

Collation%20in%20ICU%201.8

Description:

Latin-1: Swedish and German sorting differs. Not code point (binary) order. Binary: Z a v w. English: Z a. Swedish: v w. Not a property of strings. With ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 42
Provided by: mark738
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Collation%20in%20ICU%201.8


1
Collation in ICU 1.8
  • Mark Davis
  • Chief SW Globalization Architect
  • IBM

2
Agenda
  • What is Collation?
  • Features
  • Mechanisms
  • Warnings
  • ICU 1.8 Collation
  • Note Slides differ from printouts

3
Collation Sorting Order
  • How hard can it be?
  • A lt B lt C lt
  • Complications
  • Languages are complex and varied
  • Unicode is a big set of characters
  • Performance is crucial

4
Varies By
  • Language
  • Swedish z lt ö
  • German ö lt z
  • Usage
  • Dictionary öf lt of
  • Telephone of lt öf
  • Customizations
  • A lt a
  • a lt A
  • Versioning
  • Fixes
  • New Gov. Stds
  • New Characters

5
Levels
  • Base characters a lt b
  • Accents as lt às lt at
  • ignored if there is a L1 character difference
  • Case ao lt Ao lt aò
  • ignored if there is a L1 or L2 difference
  • Punctuation ab lt a-b lt aB
  • ignored if there is a L1, L2, or L3 difference

6
Context Sensitivity
  • Contractions
  • H lt Z, but CZ lt CH
  • Expansions
  • OE lt Œ lt OF
  • Both
  • ?? lt ??
  • ?? gt ??

7
Canonical Equivalence
  • Å Å A º
  • x . x .
  • ? u u . ? u . u
    ? .

8
Oddities
  • Normal accents
  • cote lt coté lt côte lt côté
  • first accent difference determines order
  • French accents
  • cote lt côte lt coté lt côté
  • last accent difference determines order
  • Il-logical Order (Thai, Lao)
  • ? ? sorts like ? ?

9
Merging Database Fields
  • F1 LastName, F2 FirstName

10
Customizations
  • Parameters that change collation behavior
  • Choice of language (locale)
  • Runtime choices
  • Examples to follow

11
Parametric Customizations
  • Strength
  • Base
  • Base Accent
  • Base Accent Case
  • Case
  • A lt a
  • a lt A
  • Punctuation
  • di Silva lt diSilva
  • diSilva lt di Silva

12
Punctuation (Alternates)
  • Base Characterdi silvadi SilvaDi silvaDi
    SilvaDickensdisilvadiSilvaDisilvaDiSilva
  • IgnoreableDickens di silvadisilvadi
    SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva

13
Extended Customizations
  • User-defined
  • ampersand
  • Merging tailorings
  • Iranian French
  • Script Order
  • b lt ? lt ß lt ?
  • ß lt b lt ? lt ?
  • Numbers
  • A-1 lt A-234
  • A-234 lt A-1

14
Collation also used for
  • Searching
  • ignore case, accent options
  • Selection
  • Return all records where
  • Jones name lt Smith
  • Graphemes
  • What a user considers a character
  • Regular expressions (Level 3)
  • UTR 18

15
UCA
  • UTS 10 Unicode Collation Algorithm
  • Levels, Expansions, Contractions, Punctuation,
    Canonical Equivalence, etc.
  • Default ordering all Unicode code points
  • Provides for tailoring to given languages
  • Also see The Unicode Standard, 5.17 Sorting
    and Searching
  • Aligned with ISO 14651

16
APIs
  • String Compare
  • Sort Keys
  • String Search

17
Sort Keys
  • Transform string into series of bytes which will
    binary-compare
  • a 06 C3 01 20 01 02 00
  • A 06 C3 01 20 01 08 00
  • á 06 C3 01 20 32 01 02 02 00
  • ab 06 C3 06 D7 01 20 20 01 02 02 00
  • b 06 D7 01 20 01 02 00

18
String Compare vs. Sort Keys
  • Same results in either case
  • SC faster for single comparisons
  • average 5 to 10 times!
  • SK faster for multiple comparisons
  • index once
  • binary compare many times

19
String Search
  • Naïve Approach
  • key matches in target at ltx, ygt
  • iff target.substring(x, y) key
  • Boundary Complications
  • Ignorables a matches in (a)?
  • at lt0,2gt lt1, 2gt lt0,3gt lt1,3gt?
  • Contractions c matches in churo?
  • Normalization å matches in a?

20
WARNING 1 Basics
  • Not aligned with character set or repertoire
  • Latin-1 Swedish and German sorting differs
  • Not code point (binary) order
  • Binary Z lt a lt v lt w
  • English Z gt a
  • Swedish v w
  • Not a property of strings
  • With same database
  • Swedish user view/select
  • German user view/select

21
WARNING 2 Operations
  • Order not preserved under concatenation /
    substringing
  • x lt y ? xz lt yz
  • x lt y ? zx lt zy
  • xz lt yz ? x lt y
  • zx lt zy ? x lt y

22
WARNING 3 Dependence
  • Collation is a relation over strings
  • Sort keys embody part of that relation
  • Thus, comparing sort keys from different
    tailorings (or parameters) gives undefined
    results.
  • C lt CH lt D
  • May move binary value for D

23
WARNING 4 Stability
  • Stable Sort
  • Records with equal comparison come out in
    original order
  • Property of algorithm, not comparison
  • Semi-Stable Comparison
  • x ? y ? x ? y
  • Property of comparison, not algorithm
  • Degrades performance
  • Doesnt do what people think (or really want)!

24
ICU (Intl Components for Unicode)
  • Open-source C, C, Java, JNI
  • Charset Conversions, Locales, Resources,
    Collation, Calendars, Time zones (daylight),
    Transliteration, Normalization, Boundaries
    (grapheme, word, line, sentence), Format/Parse
    (numbers, currencies, dates, times, messages)
  • Cross-Platform Windows, Unix, 390,
  • Architecture Java
  • http//oss.software.ibm.com/icu/

25
ICU/Java Collation Architecture
  • L1-3, contractions, expansions,
  • Locale tailorings
  • Fully rule-based specification
  • Arbitrary runtime user customizations
  • ? question mark
  • dollar sign
  • z lt george

26
ICU 1.8.1 Collation Revision
  • full UCA compliancefull supplementary character
    support
  • much better performancemuch smaller sort-keys
  • smaller memory footprintsmaller disk footprint
  • additional parametric controladditional
    tailoring control

27
Coding Style for Performance
  • Avoided unnecessary function calls.
  • Example strlen too expensive!
  • Avoided use of objects
  • Rewrote core code in C
  • C API wraps the C core code.
  • Fast-pathed common cases
  • Used stack memory buffers
  • (with expansion if necessary)
  • Made inner loops as tight as possible

28
Fractional UCA
  • Fractional weights for compression
  • Gaps for tailoring, future UCA additions
  • Only stores differences in tailoring file
  • Reduces memory footprint

29
Flat File I
  • Flat-file (memory mapped)
  • speeds initialization
  • reduces memory footprint
  • (next slide)

30
Flat-File II
  • Old separate allocations
  • New offsets within mem-map

31
Delta Tailoring II
32
Processing Overview
  • Checks for identical prefixes
  • Tolerant of most unnormalized text
  • invokes normalization rarely
  • Uses exceptional values
  • Compresses sort keys
  • Incremental length/normalization

33
Identical Prefixes
  • Sorting / Searching Databases
  • Many comparisons to close strings
  • Check initial prefixes with binary compare
  • Drop into collation loop at first difference
  • Complication

34
Initial Prefix Complication
  • Need to backup if in bad position

35
Fast C or D (FCD)
  • Accepts all NFD, most NFC, without normalization

36
Exceptional Values
  • Normal weight storage
  • Special Weight Storage
  • NOT_FOUND, EXPANSION, CONTRACTION, THAI,

37
Sort Key Compression
  • Common weights are 1-byte
  • Primary, secondary, tertiary, quarternary
  • Sequences are compressed
  • UTF-16 Values for Märk Davis (22 bytes)
  • 004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073
    0000
  • Sort Key (L3, ignorable punctuation - 19 bytes)
  • 2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80
    8F 07 00

38
ICU 1.8 vs. Windows, glibc
  • Full UCA
  • Warning perf. comparisons approx.
  • Depends on data, parameters, features
  • glibc - UTF-8 locales
  • String comparison comparable
  • 20 worse to 400 better
  • Sort keys shorter
  • half as long

39
More Information
  • ICU
  • http//oss.software.ibm.com/icu/
  • Design Document
  • http//oss.software.ibm.com/cvs/icu/icuhtml/design
    /collation/
  • These Slides
  • http//www.macchiato.com
  • Q A

40
Backup Slides
41
WARNING 5 Math. Relation
  • S Unicode Strings
  • Reflexive
  • ?a ? S a a
  • Antisymmetric
  • ?a, b ? S a b b a ? a b
  • Transitive
  • ?a, b ? S a b b c ? a c
  • Total
  • ?a, b ? S a b ? b a
Write a Comment
User Comments (0)
About PowerShow.com