Getting Started with ICU - PowerPoint PPT Presentation

About This Presentation
Title:

Getting Started with ICU

Description:

Getting Started with ICU. Vladimir Weinstein. Eric Mader. Steven R. Loomis ... Source download for modifying build options. CVS for bleeding edge read instructions ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 44
Provided by: vladimirw
Learn more at: https://icu-project.org
Category:
Tags: icu | getting | loomis | started

less

Transcript and Presenter's Notes

Title: Getting Started with ICU


1
Getting Started with ICU
  • Vladimir Weinstein
  • Eric Mader
  • Steven R. Loomis

2
Agenda
  • Getting setting up ICU4C
  • Using conversion engine
  • Using break iterator engine
  • Getting setting up ICU4J
  • Using collation engine
  • Using message formats
  • Example analysis

3
Getting ICU4C
  • http//ibm.com/software/globalization/icu
  • Get the latest release
  • Get the binary package
  • Source download for modifying build options
  • CVS for bleeding edge read instructions

4
Setting up ICU4C
  • Unpack binaries
  • If you need to build from source
  • Windows
  • MSVC .Net 2003 Project,
  • CygWin MSVC 6,
  • just CygWin
  • Unix runConfigureICU
  • make install
  • make check

5
Testing ICU4C
  • Windows - run cintltst, intltest, iotest
  • Unix - make check (again)
  • See it for yourself

include ltstdio.hgt include "unicode/utypes.h" in
clude "unicode/ures.h" main() UErrorCode
status U_ZERO_ERROR UResourceBundle res
ures_open(NULL, "", status)
if(U_SUCCESS(status))
printf("everything is OK\n") else
printf("error s opening resource\n",
u_errorName(status)) ures_close(res)
6
Conversion Engine - Opening
  • ICU4C uses open/use/close paradigm
  • Open a converter

UErrorCode status U_ZERO_ERROR UConverter cnv
ucnv_open(encoding, status) if(U_FAILURE(statu
s)) / process the error situation, die
gracefully /
  • Almost all APIs use UErrorCode for status
  • Check the error code!

7
What Converters are Available
  • ucnv_countAvailable() get the number of
    available converters
  • ucnv_getAvailable get the name of a particular
    converter
  • Lot of frameworks allow this examination

8
Converting Text Chunk by Chunk
char bufferDEFAULT_BUFFER_SIZE char bufP
buffer len ucnv_fromUChars(cnv, bufP,
DEFAULT_BUFFER_SIZE, source, sourceLen,
status) if(U_FAILURE(status)) if(status
U_BUFFER_OVERFLOW_ERROR) status
U_ZERO_ERROR bufP (UChar )malloc((len 1)
sizeof(char)) len ucnv_fromUChars(cnv,
bufP, DEFAULT_BUFFER_SIZE, source,
sourceLen, status) else / other error,
die gracefully / / do interesting stuff
with the converted text /
9
Converting Text Character by Character
UChar32 result char source start char
sourceLimit start len while(source lt
sourceLimit) result ucnv_getNextUChar(cnv,
source, sourceLimit, status) if(U_FAILURE(stat
us)) / die gracefully / / do
interesting stuff with the converted text /
  • Works only from code page to Unicode

10
Converting Text Piece by Piece
while((!feof(f)) ((countfread(inBuf, 1,
BUFFER_SIZE , f)) gt 0) ) source
inBuf sourceLimit inBuf count
do target uBuf
targetLimit uBuf uBufSize
ucnv_toUnicode(conv, target,
targetLimit, source,
sourceLimit, NULL,
feof(f)?TRUEFALSE, / pass 'flush' when eof /
/ is true (when no more data
will come) / status)
if(status U_BUFFER_OVERFLOW_ER
ROR) // simply ran out of space
we'll reset the // target ptr
the next time through the loop.
status U_ZERO_ERROR else
// Check other errors here and act
appropriately
text.append(uBuf, target-uBuf) count
target-uBuf while (source lt
sourceLimit) // while simply out of space
11
Clean up!
  • Whatever is opened, needs to be closed
  • Converters use ucnv_close
  • Sample uses conversion to convert code page data
    from a file

12
Text Boundary Analysis
  • Process of locating linguistic boundaries while
    formatting and processing text
  • Many uses
  • Relatively straightforward for English
  • Hard for some other languages
  • Chinese and Japanese
  • Thai
  • Hindi

13
Break Iteration - Introduction
  • Character boundaries grapheme clusters
  • Word boundaries word counting, double click
    selection
  • Line break boundaries where to break a line
  • Sentence break boundaries sentence counting,
    triple click selection
  • ICU class - BreakIterator

14
Break Iteration starting states
  • Points to a boundary between two characters
  • Index of character following the boundary
  • Use current() to get the boundary
  • Use first() to set iterator to start of text
  • Use last() to set iterator to end of text

15
Break Iteration - Navigation
  • Use next() to move to next boundary
  • Use previous() to move to previous boundary
  • Returns DONE if cant move boundary

16
Break Itaration Checking a position
  • Use isBoundary() to see if position is boundary
  • Use preceeding() to find boundary at or before
  • Use following() to find boundary at or after

17
Break Iteration - Opening
  • Use the factory methods

Locale locale // locale to use for break
iterators UErrorCode status U_ZERO_ERROR Break
Iterator characterIterator
BreakIteratorcreateCharacterInstance(locale,
status) BreakIterator wordIterator
BreakIteratorcreateWordInstance(locale,
status) BreakIterator lineIterator
BreakIteratorcreateLineInstance(locale,
status) BreakIterator sentenceIterator
BreakIteratorcreateSentenceInstance(loc
ale, status)
  • Dont forget to check the status!

18
Set the text
  • We need to tell the iterator what text to use

UnicodeString text readFile(file,
text) wordIterator-gtsetText(text)
  • Reuse iterators by calling setText() again.

19
Break Iteration - Counting words in a file
int32_t countWords(BreakIterator wordIterator,
UnicodeString text) U_ERROR_CODE status
U_ZERO_ERROR UnicodeString word
UnicodeSet letters(UnicodeString("letter"),
status) int32_t wordCount 0 int32_t
start wordIterator-gtfirst() for(int32_t
end wordIterator-gtnext() end !
BreakIteratorDONE start end, end
wordIterator-gtnext())
text-gtextractBetween(start, end, word)
if(letters.containsSome(word))
wordCount 1 return
wordCount
20
Break Iteration Breaking lines
int32_t previousBreak(BreakIterator
breakIterator, UnicodeString text,
int32_t location) int32_t len
text.length() while(location lt len)
UChar c textlocation
if(!u_isWhitespace(c) !u_iscntrl(c))
break location 1
return breakIterator-gtprevious(location
1)
21
Break Iteration Cleaning up
  • Use delete to delete the iterators

delete characterIterator delete
wordIterator delete lineIterator delete
sentenceIterator
22
Useful Links
  • Homepage http//ibm.com/software/globalization/i
    cu
  • API documents and User guide http//ibm.com/softw
    are/globalization/icu/documents.jsp

23
Getting ICU4J
  • Easiest pick a .jar file off download section
    on http//ibm.com/software/globalization/icu
  • Use the latest version if possible
  • For sources, download the source .jar
  • For bleeding edge, use the latest CVS see site
    for instructions

24
Setting up ICU4J
  • Check that you have the appropriate JDK version
  • Try the test code (ICU4J 3.0 or later)

import com.ibm.icu.util.ULocale import
com.ibm.icu.util.UResourceBundle public class
TestICU public static void main(String args)
UResourceBundle resourceBundle
UResourceBundle.getBundleInstance(null,
ULocale.getDefault())
  • Add ICUs jar to classpath on command line
  • Run the test suite

25
Building ICU4J
  • Need ant in addition to JDK
  • Use ant to build
  • We also like Eclipse

26
Collation Engine
  • More on collation tomorrow!
  • Used for comparing strings
  • Instantiation

ULocale locale new ULocale("fr") Collator coll
Collator.getInstance(locale) // do useful
things with the collator
  • Lives in com.ibm.icu.text.Collator

27
String Comparison
  • Works fast
  • You get the result as soon as it is ready
  • Use when you dont need to compare same strings
    many times

int compare(String source, String target)
28
Sort Keys
  • Used when multiple comparisons are required
  • Indexes in data bases
  • ICU4J has two classes
  • Compare only sort keys generated by the same type
    of a collator

29
CollationKey class
  • JDK compatible
  • Saves the original string
  • Compare keys with compareTo method
  • Get the bytes with toByteArray method
  • We used CollationKey as a key for a TreeMap
    structure

30
RawCollationKey class
  • Does not store the original string
  • Get it by using getRawCollationKey method
  • Mutable class, can be reused
  • Simple and lightweight

31
Message Format - Introduction
  • Assembles a user message from parts
  • Some parts fixed, some supplied at runtime
  • Order different for different languages
  • English My Aunts pen is on the table.
  • French The pen of my Aunt is on the table.
  • Pattern string defines how to assemble parts
  • English 0''s 2 is 1.
  • French 2 of 0 is 1.
  • Get pattern string from resource bundle

32
Message Format - Example
String person // e.g. My Aunt String
place // e.g. on the table String thing
// e.g. pen String pattern
resourceBundle.getString(personPlaceThing) Mess
ageFormat msgFmt new MessageFormat(pattern) Obj
ect arguments person, place, thing) String
message msgFmt.format(arguments) System.out.pr
intln(message)
33
Message Format Different data types
  • We can also format other data types, like dates
  • We do this by adding a format type

String pattern On 0, date at 0, time there
was 1. MessageFormat fmt new
MessageFormat(pattern) Object args new
Date(System.currentTimeMillis()), // 0
a power failure // 1
System.out.println(fmt.format(
args))
  • This will output

On Jul 17, 2004 at 21508 PM there was a power
failure.
34
Message Format Format styles
  • Add a format style

String pattern On 0, date, full at 0, time,
full there was 1. MessageFormat fmt new
MessageFormat(pattern) Object args new
Date(System.currentTimeMillis()), // 0
a power failure // 1
System.out.println(fmt.format(
args))
  • This will output

On Saturday, July 17, 2004 at 21508 PM PDT
there was a power failure.
35
Message Format Format style details
Format Type Format Style Sample Output
number (none) 123,456.789
number integer 123,457
number currency 123,456.79
number percent 12
date (none) Jul 17, 2004
date short 7/17/04
date medium Jul 17, 2004
date long July 17, 2004
date full Saturday, July 17, 2004
time (none) 21508 PM
time short 215 PM
time medium 21408 PM
time long 21508 PM PDT
time full 21508 PM PDT
36
Message Format No format type
  • If no format type, data formatted like this

Data Type Sample Output
Number 123,456.789
Date 7/17/04 215 PM
String on the table
others output of toString() method
37
Message Format Counting files
  • Pattern to display number of files

There are 1, number, integer files in 0.
  • Code to use the pattern

String pattern resourceBundle.getString(fileCou
nt) MessageFormat fmt new MessageFormat(fileCo
untPattern) String directoryName Int
fileCount Object args directoryName,
new Integer(fileCount) System.out.println(fmt.f
ormat(args))
  • This will output messages like

There are 1,234 files in myDirectory.
38
Message Format Problems counting files
  • If theres only one file, we get

There are 1 files in myDirectory.
  • Could fix by testing for special case of one file
  • But, some languages need other special cases
  • Dual forms
  • Different form for no files
  • Etc.

39
Message Format Choice format
  • Choice format handles all of this
  • Use special format element

There 1, choice, 0are no files
1is one file 1ltare 1,
number, integer files in 0.
  • Using this pattern with the same code we get

There are no files in thisDirectory. There is one
file in thatDirectory. There are 1,234 files in
myDirectory.
40
Message Format Choice format patterns
  • Selects a string based on number
  • If string is a format element, process it
  • Splits real line into two or more ranges
  • Range specifiers separated by vertical bar ()
  • Lower limit, separator, string
  • Separator indicates type of lower limit

Separator Lower Limit
inclusive
inclusive
lt exclusive
41
Message Format Choice pattern details
  • Heres our pattern again

There 1, choice, 0are no files
1is one file 1ltare 1,
number, integer files in 0.
  • First range is 0..1)
  • Really -8..1)
  • Second range is 1..1
  • Third range is (1..8

42
Message Format Other details
  • Format style can be a pattern string
  • Format type number use DecimalFormat pattern
  • Format type date, time use SimpleDateFormat
    pattern
  • Quoting in patterns
  • Enclose special characters in single quotes
  • Use two consecutive single quotes to represent
    one

The '' character, the '' character and the ''
character.
43
Useful Links
  • Homepage http//ibm.com/software/globalization/i
    cu
  • API documents and User guide http//ibm.com/softw
    are/globalization/icu/documents.jsp
Write a Comment
User Comments (0)
About PowerShow.com