Title: Getting Started with ICU
1Getting Started with ICU
- Vladimir Weinstein
- Eric Mader
- Steven R. Loomis
2Agenda
- Getting setting up ICU4C
- Using conversion engine
- Using break iterator engine
- Getting setting up ICU4J
- Using collation engine
- Using message formats
- Example analysis
3Getting ICU4C
- http//ibm.com/software/globalization/icu
- Get the latest release
- Get the binary package
- Source download for modifying build options
- CVS for bleeding edge read instructions
4Setting up ICU4C
- Unpack binaries
- If you need to build from source
- Windows
- MSVC .Net 2003 Project,
- CygWin MSVC 6,
- just CygWin
- Unix runConfigureICU
- make install
- make check
5Testing ICU4C
- Windows - run cintltst, intltest, iotest
- Unix - make check (again)
- See it for yourself
include ltstdio.hgt include "unicode/utypes.h" in
clude "unicode/ures.h" main() UErrorCode
status U_ZERO_ERROR UResourceBundle res
ures_open(NULL, "", status)
if(U_SUCCESS(status))
printf("everything is OK\n") else
printf("error s opening resource\n",
u_errorName(status)) ures_close(res)
6Conversion Engine - Opening
- ICU4C uses open/use/close paradigm
- Open a converter
UErrorCode status U_ZERO_ERROR UConverter cnv
ucnv_open(encoding, status) if(U_FAILURE(statu
s)) / process the error situation, die
gracefully /
- Almost all APIs use UErrorCode for status
- Check the error code!
7What Converters are Available
- ucnv_countAvailable() get the number of
available converters - ucnv_getAvailable get the name of a particular
converter - Lot of frameworks allow this examination
8Converting Text Chunk by Chunk
char bufferDEFAULT_BUFFER_SIZE char bufP
buffer len ucnv_fromUChars(cnv, bufP,
DEFAULT_BUFFER_SIZE, source, sourceLen,
status) if(U_FAILURE(status)) if(status
U_BUFFER_OVERFLOW_ERROR) status
U_ZERO_ERROR bufP (UChar )malloc((len 1)
sizeof(char)) len ucnv_fromUChars(cnv,
bufP, DEFAULT_BUFFER_SIZE, source,
sourceLen, status) else / other error,
die gracefully / / do interesting stuff
with the converted text /
9Converting Text Character by Character
UChar32 result char source start char
sourceLimit start len while(source lt
sourceLimit) result ucnv_getNextUChar(cnv,
source, sourceLimit, status) if(U_FAILURE(stat
us)) / die gracefully / / do
interesting stuff with the converted text /
- Works only from code page to Unicode
10Converting Text Piece by Piece
while((!feof(f)) ((countfread(inBuf, 1,
BUFFER_SIZE , f)) gt 0) ) source
inBuf sourceLimit inBuf count
do target uBuf
targetLimit uBuf uBufSize
ucnv_toUnicode(conv, target,
targetLimit, source,
sourceLimit, NULL,
feof(f)?TRUEFALSE, / pass 'flush' when eof /
/ is true (when no more data
will come) / status)
if(status U_BUFFER_OVERFLOW_ER
ROR) // simply ran out of space
we'll reset the // target ptr
the next time through the loop.
status U_ZERO_ERROR else
// Check other errors here and act
appropriately
text.append(uBuf, target-uBuf) count
target-uBuf while (source lt
sourceLimit) // while simply out of space
11Clean up!
- Whatever is opened, needs to be closed
- Converters use ucnv_close
- Sample uses conversion to convert code page data
from a file
12Text Boundary Analysis
- Process of locating linguistic boundaries while
formatting and processing text - Many uses
- Relatively straightforward for English
- Hard for some other languages
- Chinese and Japanese
- Thai
- Hindi
13Break Iteration - Introduction
- Character boundaries grapheme clusters
- Word boundaries word counting, double click
selection - Line break boundaries where to break a line
- Sentence break boundaries sentence counting,
triple click selection - ICU class - BreakIterator
14Break Iteration starting states
- Points to a boundary between two characters
- Index of character following the boundary
- Use current() to get the boundary
- Use first() to set iterator to start of text
- Use last() to set iterator to end of text
15Break Iteration - Navigation
- Use next() to move to next boundary
- Use previous() to move to previous boundary
- Returns DONE if cant move boundary
16Break Itaration Checking a position
- Use isBoundary() to see if position is boundary
- Use preceeding() to find boundary at or before
- Use following() to find boundary at or after
17Break Iteration - Opening
Locale locale // locale to use for break
iterators UErrorCode status U_ZERO_ERROR Break
Iterator characterIterator
BreakIteratorcreateCharacterInstance(locale,
status) BreakIterator wordIterator
BreakIteratorcreateWordInstance(locale,
status) BreakIterator lineIterator
BreakIteratorcreateLineInstance(locale,
status) BreakIterator sentenceIterator
BreakIteratorcreateSentenceInstance(loc
ale, status)
- Dont forget to check the status!
18Set the text
- We need to tell the iterator what text to use
UnicodeString text readFile(file,
text) wordIterator-gtsetText(text)
- Reuse iterators by calling setText() again.
19Break Iteration - Counting words in a file
int32_t countWords(BreakIterator wordIterator,
UnicodeString text) U_ERROR_CODE status
U_ZERO_ERROR UnicodeString word
UnicodeSet letters(UnicodeString("letter"),
status) int32_t wordCount 0 int32_t
start wordIterator-gtfirst() for(int32_t
end wordIterator-gtnext() end !
BreakIteratorDONE start end, end
wordIterator-gtnext())
text-gtextractBetween(start, end, word)
if(letters.containsSome(word))
wordCount 1 return
wordCount
20Break Iteration Breaking lines
int32_t previousBreak(BreakIterator
breakIterator, UnicodeString text,
int32_t location) int32_t len
text.length() while(location lt len)
UChar c textlocation
if(!u_isWhitespace(c) !u_iscntrl(c))
break location 1
return breakIterator-gtprevious(location
1)
21Break Iteration Cleaning up
- Use delete to delete the iterators
delete characterIterator delete
wordIterator delete lineIterator delete
sentenceIterator
22Useful Links
- Homepage http//ibm.com/software/globalization/i
cu - API documents and User guide http//ibm.com/softw
are/globalization/icu/documents.jsp
23Getting ICU4J
- Easiest pick a .jar file off download section
on http//ibm.com/software/globalization/icu - Use the latest version if possible
- For sources, download the source .jar
- For bleeding edge, use the latest CVS see site
for instructions
24Setting up ICU4J
- Check that you have the appropriate JDK version
- Try the test code (ICU4J 3.0 or later)
import com.ibm.icu.util.ULocale import
com.ibm.icu.util.UResourceBundle public class
TestICU public static void main(String args)
UResourceBundle resourceBundle
UResourceBundle.getBundleInstance(null,
ULocale.getDefault())
- Add ICUs jar to classpath on command line
- Run the test suite
25Building ICU4J
- Need ant in addition to JDK
- Use ant to build
- We also like Eclipse
26Collation Engine
- More on collation tomorrow!
- Used for comparing strings
- Instantiation
ULocale locale new ULocale("fr") Collator coll
Collator.getInstance(locale) // do useful
things with the collator
- Lives in com.ibm.icu.text.Collator
27String Comparison
- Works fast
- You get the result as soon as it is ready
- Use when you dont need to compare same strings
many times
int compare(String source, String target)
28Sort Keys
- Used when multiple comparisons are required
- Indexes in data bases
- ICU4J has two classes
- Compare only sort keys generated by the same type
of a collator
29CollationKey class
- JDK compatible
- Saves the original string
- Compare keys with compareTo method
- Get the bytes with toByteArray method
- We used CollationKey as a key for a TreeMap
structure
30RawCollationKey class
- Does not store the original string
- Get it by using getRawCollationKey method
- Mutable class, can be reused
- Simple and lightweight
31Message Format - Introduction
- Assembles a user message from parts
- Some parts fixed, some supplied at runtime
- Order different for different languages
- English My Aunts pen is on the table.
- French The pen of my Aunt is on the table.
- Pattern string defines how to assemble parts
- English 0''s 2 is 1.
- French 2 of 0 is 1.
- Get pattern string from resource bundle
32Message Format - Example
String person // e.g. My Aunt String
place // e.g. on the table String thing
// e.g. pen String pattern
resourceBundle.getString(personPlaceThing) Mess
ageFormat msgFmt new MessageFormat(pattern) Obj
ect arguments person, place, thing) String
message msgFmt.format(arguments) System.out.pr
intln(message)
33Message Format Different data types
- We can also format other data types, like dates
- We do this by adding a format type
String pattern On 0, date at 0, time there
was 1. MessageFormat fmt new
MessageFormat(pattern) Object args new
Date(System.currentTimeMillis()), // 0
a power failure // 1
System.out.println(fmt.format(
args))
On Jul 17, 2004 at 21508 PM there was a power
failure.
34Message Format Format styles
String pattern On 0, date, full at 0, time,
full there was 1. MessageFormat fmt new
MessageFormat(pattern) Object args new
Date(System.currentTimeMillis()), // 0
a power failure // 1
System.out.println(fmt.format(
args))
On Saturday, July 17, 2004 at 21508 PM PDT
there was a power failure.
35Message Format Format style details
Format Type Format Style Sample Output
number (none) 123,456.789
number integer 123,457
number currency 123,456.79
number percent 12
date (none) Jul 17, 2004
date short 7/17/04
date medium Jul 17, 2004
date long July 17, 2004
date full Saturday, July 17, 2004
time (none) 21508 PM
time short 215 PM
time medium 21408 PM
time long 21508 PM PDT
time full 21508 PM PDT
36Message Format No format type
- If no format type, data formatted like this
Data Type Sample Output
Number 123,456.789
Date 7/17/04 215 PM
String on the table
others output of toString() method
37Message Format Counting files
- Pattern to display number of files
There are 1, number, integer files in 0.
String pattern resourceBundle.getString(fileCou
nt) MessageFormat fmt new MessageFormat(fileCo
untPattern) String directoryName Int
fileCount Object args directoryName,
new Integer(fileCount) System.out.println(fmt.f
ormat(args))
- This will output messages like
There are 1,234 files in myDirectory.
38Message Format Problems counting files
- If theres only one file, we get
There are 1 files in myDirectory.
- Could fix by testing for special case of one file
- But, some languages need other special cases
- Dual forms
- Different form for no files
- Etc.
39Message Format Choice format
- Choice format handles all of this
- Use special format element
There 1, choice, 0are no files
1is one file 1ltare 1,
number, integer files in 0.
- Using this pattern with the same code we get
There are no files in thisDirectory. There is one
file in thatDirectory. There are 1,234 files in
myDirectory.
40Message Format Choice format patterns
- Selects a string based on number
- If string is a format element, process it
- Splits real line into two or more ranges
- Range specifiers separated by vertical bar ()
- Lower limit, separator, string
- Separator indicates type of lower limit
Separator Lower Limit
inclusive
inclusive
lt exclusive
41Message Format Choice pattern details
There 1, choice, 0are no files
1is one file 1ltare 1,
number, integer files in 0.
- First range is 0..1)
- Really -8..1)
- Second range is 1..1
- Third range is (1..8
42Message Format Other details
- Format style can be a pattern string
- Format type number use DecimalFormat pattern
- Format type date, time use SimpleDateFormat
pattern - Quoting in patterns
- Enclose special characters in single quotes
- Use two consecutive single quotes to represent
one
The '' character, the '' character and the ''
character.
43Useful Links
- Homepage http//ibm.com/software/globalization/i
cu - API documents and User guide http//ibm.com/softw
are/globalization/icu/documents.jsp