Unicode Transforms in ICU - PowerPoint PPT Presentation

About This Presentation
Title:

Unicode Transforms in ICU

Description:

Converts all uppercase Latin characters to Katakana, Then converts all other Latin characters to Hiragana. San Jose, California 10/7/09 ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 36
Provided by: mark738
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Unicode Transforms in ICU


1
Unicode Transforms in ICU
  • Mark DavisChief SW Globalization Architect
  • IBM

2
What is ICU?
  • Internationalization libraries for C, C, Java
  • Open source non-viral
  • Sponsored by IBM
  • Suns Java licenses an earlier ICU version ICU4J
    updates it.
  • Unicode standard compliant
  • full supplementary support
  • Cross-platform extensible and customizable
  • High performance and thread-safe
  • Multiple locales in same thread simultaneously
  • http//oss.software.ibm.com/icu/

3
ICU Features
  • Unicode text handling
  • Character set conversions (700)
  • Collation Searching
  • Locales (170)
  • Resource Bundles
  • Calendar Time zones
  • Complex-text layout engine
  • Breaks character, word, line, sentence
  • Formatting
  • Date time
  • Messages
  • Numbers currencies
  • Transforms
  • Normalization
  • Casing
  • Transliterations

4
ICU Transforms
  • Powerful, flexible mechanism
  • Uppercase, Lowercase, Titlecase, Full/Halfwidth
  • Normalization
  • Hex, Character Names
  • Script to Script conversion
  • Supports Styled Text, not just Plain Text
  • Chaining, Filters, Buffering
  • Customizable

5
Transform Examples
  • Any-Uppercase
  • a ? A
  • Any-Hex/Java
  • a ? \u0061
  • Greek-Latin
  • a ? a

6
Filters
  • aeiou Latin - Greek
  • Latin is the source
  • aeiou is a filter, restricts the application
    to only English vowels. Uses UnicodeSet.
  • Greek is the target
  • \u0000-\u007E Any - Hex
  • A d is ? A \u03B4 is\u2026

7
UnicodeSet
  • Ranges ABC a-z
  • Union Lu P
  • Intersection Lu \u0000-\u01FF
  • Set Difference Lu - \u0000-\u01FF
  • Complement aeiou
  • Properties
  • Uppercase letters Lu
  • Punctuation P
  • Script Greek
  • ICU 2.2 all enumerated Unicode 3.2 properties

8
UnicodeSet Property Syntax
  • Either POSIX or Perl Style
  • \pletter
  • letter
  • Short or long form (UCD Property Aliases)
  • \pgeneral_category uppercase_letter
  • \pgcLu
  • Case-, Space-, Underbar-Insensitive

9
Example Filter
  • Lu Latin-Katakana Latin-Hiragana
  • Converts all uppercase Latin characters to
    Katakana,
  • Then converts all other Latin characters to
    Hiragana.

10
Chaining Transforms
  • Kana-Latin Any-Title
  • ???, ????
  • takeda, masayuki
  • Takeda, Masayuki
  • Any number

11
Filtering plus Chaining
  • NFD Mark Remove NFC
  • Decompose
  • Remove accents (Marks)
  • Recompose

12
Built-in Transforms
  • Normalization
  • Å ? Å
  • Casing
  • a ? A
  • Full ? Halfwidth
  • ? ? ?
  • Character Names
  • a ? LATIN SMALL LETTER A
  • Hex XML, Java, C, Perl, styles
  • a ? \u0061, U0061,

13
Script ? Script Conversions
  • General conversions, e.g. Greek-Latin
  • Source-Target Reversible f ? ph ? f
  • Not Target-Source Reversible f ? f ? ph
  • Variants
  • By Language Greek-German
  • By Standard Greek-Latin/UNGEGN
  • Can build your own

14
Any-Latin Example
  • ?, ??
  • ?, ??
  • ?, ??
  • ???, ????
  • ???, ????
  • ????, ???
  • ???ts?, ???a
  • ?a???d??, ???st??
  • Te?d???t??, ?????
  • Gim, Gugsam
  • Gim, Myeonghyi
  • Jeong, Byeongho
  • Takeda, Masayuki
  • Masuda, Yoshihiko
  • Yamamoto, Noboru
  • Roútse, Ánna
  • Kaloúdes, Chr?stos
  • Theodorátou, Eléne

15
Styled Text
  • Preserves individual styles on letters, where
    possible
  • apa ? apa

16
When Buffering
  • Conversions are not performed if they may extend
    over boundaries

Key
Result
a
a
p
ap
a
apa
p
apap
h
apaf
17
Custom Rules
  • Similar to Regular Expressions
  • Variables
  • Property matches
  • Contextual matches
  • Rearrangement
  • 1, 2
  • Quantifiers
  • , , ?

18
Differences from Reg. Exp.s
  • More Powerful
  • Buffered/Keyboard
  • Styled Text
  • Ordered Rules
  • Cursor Backup
  • Less Powerful
  • Only greedy quantifiers
  • No backup so no (X Y)
  • No input-side back references

19
Example of Custom Rules
  • UnixQuotes-RealQuotes
  • \\ gt two graves ? right-quote
  • \'\' gt two generics ? left-quote
  • Example (SJ Mercury News online)
  • expertise'' ? expertise

20
Rule Ordering
  • Find first rule that matches at start
  • If no match, or (isBuffered clipped-Match)
  • advance start by 1
  • Else if match,
  • Substitute text
  • Move start as specified
  • Continue until start reaches limit

21
Rule Ordering Example
Reg Exp.
Translit.
s/xy/c/g
xy gt c
s/yx/d/g
yx gt d
xyx-yxy-xyx
cx-yc-cx
cx-dy-cx
22
Context
  • Rules
  • ? G ? ? ? ? ? ? ? gt n
  • ? gt g
  • Meaning
  • Convert gamma into n
  • IF followed by G, ?, ?, ?, ?, ?, ?, or ?
  • Otherwise into g

23
Cursor Backup
BYO
  • Allows text to be revisited
  • Reduces rule-count
  • Example Rules
  • BY gt ? Y
  • YO gt ?

1
?YO
2
??
24
Demonstration
  • Public Demo
  • http//oss.software.ibm.com/icu/demo
  • (local copy, samples)

25
More Information
  • http//oss.software.ibm.com/
  • User Guide /icu/userguide/
  • C /icu/apiref/utrans_h.html
  • C /icu/apiref/
  • Java API /icu4j/doc/com/ibm/text/
  • Latest Version of these slides
  • http//www.macchiato.com

26
ICU Transforms
  • Powerful, flexible mechanism
  • Uppercase, Lowercase, Titlecase, Full/Halfwidth
  • Normalization
  • Hex, Character Names
  • Script to Script conversion
  • Supports Styled Text, not just Plain Text
  • Chaining, Filters, Buffering
  • Customizable

27
Q A
28
Backup Slides
  • Not used in the presentation, except in response
    to questions

29
Buffered Usage
  • No conversion for clipped match

tt
  • Fill buffer
  • Transliterate
  • May have left-overs

x
tt
th
  • Copy left-overs to start
  • Fill rest of buffer
  • Transliterate

?
30
Styled Text Handling
  • Transforms operate on Replaceable, an
    interface/abstract class defined by ICU
  • In ICU4c, UnicodeString is a Replaceable subclass
    (with no out-of-band data -- no styles)
  • ICU4j defines ReplaceableString, a Replaceable
    subclass, also with no styles
  • Clients must define their own Replaceable
    subclass that implements their styled text.

31
Transliteration Sources
  • Søren Binks
  • http//homepage.mac.com/sirbinks/translit.html
  • UNGEGN
  • http//www.eki.ee/wgrs/

32
API Information
  • Like other ICU APIs, can get each of the
    available Transform IDs
  • count Transliterator countAvailableIDs()
  • myID TransliteratorgetAvailableID(n)
  • And get a localizable name for each
  • TransliteratorgetDisplayName(myID, france,
    nameForUser)
  • Note these are C APIs C and Java are also
    available.

33
API Creation
  • Use an ID to create
  • myTrans TransliteratorcreateInstance("Latin-Gr
    eek")

34
API Simple usage
  • Convert entire string
  • myTrans.transliterate(myString)

35
More Control
  • Specify Context
  • Use with Styled Text

abcdefghijklmnopqrstuvwxyz
contextStart
contextLimit
start
limit
Write a Comment
User Comments (0)
About PowerShow.com