Unicode Transforms in ICU - PowerPoint PPT Presentation

About This Presentation
Title:

Unicode Transforms in ICU

Description:

Only greedy quantifiers. No backup: so no (X | Y) No 'input-side back references' ... http://www.macchiato.com. Dublin, Ireland 11/6/09. 21st International ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 33
Provided by: mark738
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Unicode Transforms in ICU


1
Unicode Transforms in ICU
  • Mark DavisChief SW Globalization Architect
  • IBM

2
What is ICU?
  • The Premier Unicode-Enablement Library
  • Open-Source non-viral license
  • Full-featured, cross-platform
  • C, C, Java APIs
  • Collation, Charset Conversion, Resources,
    Boundaries, Calendars, Transforms (case, norm.,
    translit., ), Format/Parse (dates, times, msgs,
    nums., curr., ), Unicode strings/props
  • Unicode Conformant
  • http//oss.software.ibm.com/icu/

3
ICU Transforms
  • Powerful, flexible mechanism
  • Uppercase, Lowercase, Titlecase, Full/Halfwidth
  • Normalization
  • Hex, Character Names
  • Script to Script conversion
  • Supports Styled Text, not just Plain Text
  • Chaining, Filters, Buffering
  • Customizable

4
Transform Examples
  • Any-Uppercase
  • a ? A
  • Any-Hex/Java
  • a ? \u0061
  • Greek-Latin
  • a ? a

5
Filters
  • aeiou Latin - Greek
  • Latin is the source
  • aeiou is a filter, restricts the application
    to only English vowels.
  • Greek is the target
  • \u0000-\u007E Any - Hex
  • A d is ? A \u03B4 is\u2026

6
UnicodeSet Filters
  • Ranges ABC a-z
  • Union Lu P
  • Intersection Lu \u0000-\u01FF
  • Set Difference Lu - \u0000-\u01FF
  • Complement aeiou
  • Properties
  • Uppercase letters Lu
  • Punctuation P
  • Script Greek
  • Other Unicode properties in ICU 2.2

7
Example Filter
  • Lu Latin-Katakana Latin-Hiragana
  • Converts all uppercase Latin characters to
    Katakana,
  • Then converts all other Latin characters to
    Hiragana.

8
Chaining Transforms
  • Hiragana-Latin Any-Title
  • ???, ????
  • takeda, masayuki
  • Takeda, Masayuki
  • Any number of transforms in chain

9
Filtering plus Chaining
  • NFD M Remove NFC
  • Decompose
  • Remove accents (Marks)
  • Recompose

10
Script ? Script Examples
  • ?, ??
  • ?, ??
  • ?, ??
  • ???, ????
  • ???, ????
  • ????, ???
  • ???ts?, ???a
  • ?a???d??, ???st??
  • Te?d???t??, ?????
  • Gim, Gugsam
  • Gim, Myeonghyi
  • Jeong, Byeongho
  • Takeda, Masayuki
  • Masuda, Yoshihiko
  • Yamamoto, Noboru
  • Roútse, Ánna
  • Kaloúdes, Chrêstos
  • Theodorátou, Eléne

11
Script ? Script Conversions
  • General conversions Greek-Latin
  • Source-Target Reversible f ? ph ? f
  • Not Target-Source Reversible f ? f ? ph
  • Variants
  • By Language Greek-German
  • By Standard Greek-Latin/UNGEGN
  • Can build your own

12
Styled Text
  • Preserves individual styles on letters, where
    possible
  • apa ? apa

13
When Buffering
  • Conversions are not performed if they may extend
    over boundaries

Key
Result
a
a
p
ap
a
apa
p
apap
h
apaf
14
Custom Rules
  • Similar to Regular Expressions
  • Variables
  • Property matches
  • Contextual matches
  • Rearrangement
  • 1, 2
  • Quantifiers
  • , , ?

15
Differences from Regular Expressions
  • More Powerful
  • Buffered/Keyboard
  • Styled Text
  • Ordered Rules
  • Cursor Backup
  • Less Powerful
  • Only greedy quantifiers
  • No backup so no (X Y)
  • No input-side back references

16
Example of Custom Rules
  • UnixQuotes-RealQuotes
  • \\ gt two graves ? right-quote
  • \'\' gt two generics ? left-quote
  • Example (SJ Mercury News online)
  • expertise'' ? expertise

17
Rule Ordering
  • Find first rule that matches at start
  • If no match, or (isBuffered clipped-Match)
  • advance start by 1
  • Else if match,
  • Substitute text
  • Move start as specified
  • Continue until start reaches limit

18
Rule Ordering Example
Reg Exp.
Translit.
s/xy/c/g
xy gt c
s/yx/d/g
yx gt d
xyx-yxy-xyx
cx-yc-cx
cx-dy-cx
19
Context
  • Rules
  • ? G ? ? ? ? ? ? ? gt n
  • ? gt g
  • Meaning
  • Convert gamma into n
  • IF followed by G, ?, ?, ?, ?, ?, ?, or ?
  • Otherwise into g

20
Cursor Backup
  • Allows text to be revisited
  • Reduces rule-count
  • Example Rules
  • BY gt ? Y
  • YO gt ?

BYO
1
?YO
2
??
21
Demonstration
  • Public Demo
  • http//oss.software.ibm.com/icu/demo
  • (local copy, samples)

22
More Information
  • http//oss.software.ibm.com/
  • User Guide /icu/userguide/
  • C /icu/apiref/utrans_h.html
  • C /icu/apiref/
  • Java API /icu4j/doc/com/ibm/text/
  • Latest Version of these slides
  • http//www.macchiato.com

23
ICU Transforms
  • Powerful, flexible mechanism
  • Uppercase, Lowercase, Titlecase, Full/Halfwidth
  • Normalization
  • Hex, Character Names
  • Script to Script conversion
  • Supports Styled Text, not just plaintext
  • Chaining Filters
  • Customizable

24
Q A
25
Backup Slides
  • Not used in the presentation, except in response
    to questions

26
Buffered Usage
  • No conversion for clipped match

tt
  • Fill buffer
  • Transliterate
  • May have left-overs

x
tt
th
  • Copy left-overs to start
  • Fill rest of buffer
  • Transliterate

?
27
Styled Text Handling
  • Transforms operate on Replaceable, an
    interface/abstract class defined by ICU
  • In ICU4c, UnicodeString is a Replaceable subclass
    (with no out-of-band data -- no styles)
  • ICU4j defines ReplaceableString, a Replaceable
    subclass, also with no styles
  • Clients must define their own Replaceable
    subclass that implements their styled text.

28
Transliteration Sources
  • Søren Binks
  • http//homepage.mac.com/sirbinks/translit.html
  • UNGEGN
  • http//www.eki.ee/wgrs/

29
API Information
  • Like other ICU APIs, can get each of the
    available Transform IDs
  • count Transliterator countAvailableIDs()
  • myID TransliteratorgetAvailableID(n)
  • And get a localizable name for each
  • TransliteratorgetDisplayName(myID, france,
    nameForUser)
  • Note these are C APIs C and Java are also
    available.

30
API Creation
  • Use an ID to create
  • myTrans TransliteratorcreateInstance("Latin-Gr
    eek")

31
API Simple usage
  • Convert entire string
  • myTrans.transliterate(myString)

32
More Control
  • Specify Context
  • Use with Styled Text

abcdefghijklmnopqrstuvwxyz
contextStart
contextLimit
start
limit
Write a Comment
User Comments (0)
About PowerShow.com