ICU%20Overview%20The%20Open-Source%20Unicode%20Library,%20v3.0 - PowerPoint PPT Presentation

About This Presentation
Title:

ICU%20Overview%20The%20Open-Source%20Unicode%20Library,%20v3.0

Description:

Mature, widely used set of C/C and Java libraries ... Precise binary sortkey stability over time. Fully data driven. API / rule customizations ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 29
Provided by: hele46
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: ICU%20Overview%20The%20Open-Source%20Unicode%20Library,%20v3.0


1
ICU OverviewThe Open-SourceUnicode Library, v3.0
  • Markus Scherer
  • ICU Manager
  • IBM Globalization Center of Competency

2
Agenda
  • Background
  • What is ICU?
  • Architecture Overview
  • ICU Features and recent additions
  • References
  • Q and A

3
Why Globalization?
4
Unicode
  • All world languages
  • Efficient and effective processing
  • Lossless data exchange
  • Enables single-binary global software
  • But all languages ? large, complex standard
  • 1,400 pages Annexes additional standards
  • 90,000 characters
  • Major update every 3 years
  • 70 character properties, many multi-valued
  • Affects many processes display, line-break,
    regex,

5
Locales
  • Features vary widely across languages countries
  • Sorting, line breaks, date/time/number/currency
    formatting, codepage conversion,
  • Performance is key easy to do the right thing
    hard to do it fast

6
What is ICU?
  • Globalization / Unicode / Locales
  • Mature, widely used set of C/C and Java
    libraries
  • Basis for Java 1.1 internationalization but
    goes far beyond
  • Very portable identical results on all
    platforms / programming languages
  • C/C 30 platforms/compilers
  • Java IBM Sun JDK
  • Full threading model customizable modular
  • Open source but not viral
  • ICU 3.0 78 languages 118 countries 870
    codepages

7
Who uses ICU?
  • Products Within IBM
  • PSD Print Architecture, DB2, COBOL, Host Access
    Client, InfoPrint Manager, Informix GLS version
    4.0, iSeries, Lotus Notes, Lotus Extended Search,
    Lotus Workplace, MQ Integrator Endeavour, NUMA-Q,
    OTI, Pervasive Computing WECMS, SSS Websphere
    Banking Solutions, Tivoli Presentation Services,
    WBI Adapter/ Connect/Modeler and Monitor/
    Solution Technology Development/WBI-Financial
    TePI, Websphere Application Server/ Studio
    Workload Simulator/Transcoding Publisher, XML
    Parser
  • Other Companies and Organizations
  • Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump,
    Business Objects, caris, CERN, Cognos, Debian,
    Gentoo, HP, Inktomi, JD Edwards, Jikes,
    Macromedia, Mathworks, Mozilla, NCR, OpenOffice,
    Parrot, PayPal, Python, QNX, Rogue Wave, SAP,
    Siebel, SIL, Software AG, Sun Microsystems
    (Solaris, Java), SuSE, Sybase, Virage,
    webMethods, Wine, Leica Geosystems GIS Mapping,
    LLC.

8
ICU Features
  • Unicode text handling
  • Charset conversions (870)
  • Collation Searching
  • Locales (170)
  • Resource Bundles
  • Calendar Time zones
  • Complex-text layout engine
  • Unicode Regular Expressions
  • Breaks word, line,
  • Formatting
  • Date time
  • Messages
  • Numbers currencies
  • Transforms
  • Normalization
  • Casing
  • Transliterations

9
Architecture Overview 1
  • Locale Based Services
  • Locale is an identifier, not a container
  • Keywords for variants de_at_collationphonebook
  • Resource inheritance shared resources

root
Language
en
de
zh
Hant
Hans
Script
US
IE
DE
CH
TW
CN
TW
CN
Country
10
Architecture Overview 2
  • Open and Close Service Model
  • Better performance by avoiding setup costs per
    operation
  • Warning use properly for maximum performace
  • ICU Threading Model
  • Multiple versions in use simultaneously
  • Large resources shared in read-only cache

11
Architecture Overview 3
  • Data Driven Services
  • Customize at build-time or run-time
  • Interchange with other platforms
  • same results on each
  • Rule-based
  • Collation, Word-breaks, Transforms
  • Pattern-based
  • Formats, UnicodeSet
  • Table-based
  • Character Conversion

12
Architecture Overview ICU4C
  • Simple Error Handling
  • C subset for portability
  • Support for multi-threaded environment
  • Version Management
  • Multiple versions at the same time
  • Data and library versioning
  • String Buffer Management
  • Preflighting and overflow protection
  • Misc Load/Unload ICU
  • Recent Additions
  • Runtime-settable memory allocation and mutex
    functions

13
Architecture Overview ICU4J
  • Supplement for Java
  • Core globalization (no char. conversion, no GUI
    components)
  • We do supply complex text support for Sun
  • Modularized products may add just needed
    functionality

14
ICU4J vs. JDK
  • CLDR 1.1 (Common Locale Data Repository)
  • Up-to-date globalization standards-compliant
    latest Unicode
  • Supplementary character (GB 18030, JIS X 213,
    HKSCS)
  • Full properties JDK has only a fraction
  • Local calendars (Thailand, Japan,) ISO dates
  • Currencies, String Search, Intl Domain Names
  • Transforms Case, Scripts, Normalization
  • Much faster turn-around on bug-fixes, enhancements

15
Unicode Text Handling
  • C
  • UChar null-terminated or with length
  • C
  • UnicodeString full featured string class
  • Java
  • Uses normal JDK String, adds utilities
  • All handle supplementary characters
  • Required for GB 18030/JIS X 0213/HKSCS repertoires

16
Unicode Text Handling 2
  • All Unicode 4.0 properties
  • Direct API
  • Values, names, enumerations
  • UnicodeSet
  • Fast, compact set operations
  • Pattern-based (both Perl POSIX syntax for
    properties)
  • \pgreek vs. greek
  • All properties
  • \plowercase-a-z
  • \pgreek \puppercase

17
Data Recent Additions
  • Conforms to CLDR 1.1
  • 50 more data than CLDR 1.0 adding many
    translated terms for languages, scripts,
    countries, currencies, and time zones.
  • improved collation for Eastern Europe, Chinese
    pinyin
  • Reduced multiplatform install image size
  • Improved XLIFF-ICU conversion tools
  • Locale canonicalization spec defined and
    implemented (CJ)
  • Provides interoperability with POSIX and .NET
    locale IDs, more RFC 3066 support

18
Character Set Conversion
  • Precise alias information
  • When you ask for SJIS, you can request the
    precise definition by platform
  • windows, ibm, solaris,
  • Buffer management
  • automatically handles characters that cross
    buffers
  • Customizations allowed for
  • illegal sequences
  • undefined characters
  • Unicode Text Compression SCSU, BOCU

19
Collation and Searching
  • Fast international comparison and string search
    fully UCA compliant
  • Compressed sort keys, optimized string
    comparison, sublinear string search
  • incremental sortkeys for radix-sort
  • Precise binary sortkey stability over time
  • Fully data driven
  • API / rule customizations
  • strength, normalization, upper vs. lowercase
    first, ignore punctuation,

20
Collation and Searching Recent Additions
  • Numeric sorting sequences of digits can be
    sorted numerically instead of alphabetically
  • e.g., filenames would sort "ab-2" lt "ab-10"
  • without material performance cost
  • with reduced sortkey length.
  • Significantly improved sorting orders for many
    other languages
  • Data in separate tree, for easier modularization
    and maintenance
  • getFunctionalEquivalent API allows for better
    caching and UI support.

21
Calendar Time Zones
  • International Calendars Arabic, Buddhist,
    Hebrew, Japanese
  • Required for correct presentation of dates in
    some countries
  • Olson timezone support, with localizations
  • Recent Additions
  • RFC822 time zone format support in DateFormat
    (CJ) for compatibility.

22
Formatting
  • Date time 8 formats per locale
  • Messages
  • Completely localizable, Plural support
  • Numbers currencies
  • Scientific Notation, Spelled-out (checks, etc.)
  • Full Orthogonal Currency support
  • INR In Hindi
  • INR In English Rs. 1,234.57
  • INR In German Rs. 1.234,57
  • Recent Additions
  • POSIX migration library
  • Allows parsing multiple currencies with one
    formatter
  • Short and stand-alone month/day names

23
Transforms
  • Unicode Normalization
  • Highly optimized for performance
  • performance utilities concatenation, detection,
    comparison
  • Casing (upper, lower, title, folding)
  • General Transforms
  • Script transliterations
  • Half-width/Full-width, Hex, etc.
  • Chain transforms together, filter source
    characters
  • Rule-based, customizable at runtime.
  • IDNA International Domain Names

24
Segmentation word, line sentence
  • Fast state-table implementation
  • Customizable
  • Rule-based customizable at runtime
  • Special customizations, e.g. Thai
  • Recent Additions
  • Greatly improved performance when going
    backwards(common case when doing line break)
  • Java
  • The rules syntax has been extended. Rules can now
    return information about the types of characters
    they encountered.
  • Common compiled (binary) rule format with ICU4C

25
Unicode Regular Expressions
  • Full Regex Implementation
  • C only Java 1.4 has own package (though not as
    powerful)
  • All Unicode 4.0 Properties
  • supported through UnicodeSet
  • Good performance
  • competitive with non-Unicode regex
  • Recent Additions
  • Now features a C API, instead of just C.

26
Complex-text layout engine
  • Glyph processing, positioning adjustment
  • ligature substitution, contextual forms, kerning,
    accent placement, Bidi scripts, etc.
  • Support for
  • Drawing
  • Caret Display
  • Hit Testing
  • Selection Highlighting
  • Caret Movement
  • Layout Metrics
  • Line Break
  • ICU 3.0 Canonical Equivalence a or á

27
References
  • ICU main site
  • http//oss.software.ibm.com/icu/
  • Links to
  • Download ICU
  • User Guide, Technical FAQ, Support, Bug Reports
  • Unicode Consortium
  • http//www.unicode.org
  • Unicode glossary, Unicode character database

28
Questions and Answers
Write a Comment
User Comments (0)
About PowerShow.com