Title: Challenges of web internationalization
1Challenges of web internationalization
2Objectives
- Key International goals for businesses
- Challenges of global UI design
- Challenges related to data formats
- Standard Unicode Character Set
- Open Source i18n library ICU
3Key International Goals
- Business Goals Growth, Growth, Growth
- Reduce development, test, and integration costs
- (Re)Use more components globally, reliably
- Shorten time to market
- Minimize changes for each language/region
- Simultaneous worldwide shipment (SimShip)
- Efficient, cost-effective localization
- (Localization cost) times (Regional Markets)
- Quality
4How Do We Get There?
- Successful Publishing Around The World
- Desirable content
- Consistent with local laws
- Compatible with local infrastructure
- Acceptable pricing models
- Desirable delivery formats (UI)
- Native language(s), Local customs
5Global User Interface Design
- Language
- Data Format/Presentation
- Imagery, Symbolism
- Use of Color
- Layout
- Media (Audio, Video, etc.)
6Date Formats
7Supporting International Formats
0000 0100 0000 0000 (0x0400)
1.024
1,024
Value 0x0400 Style European result
Format(Value, style)
Value 0x0400 Style American result
Format(Value, style)
8Definitions
I18n
- Internationalization
- To design and develop an application
- without built-in cultural assumptions
- that is efficient to localize
- Localization
- To tailor an application to meet the needs of a
particular region, market, or culture
L10n
9Definition of terms
- Translation - To render an application into
another language - Globalization
- Companies participating in the global economy,
establishing themselves in foreign markets - Adapting products and services to end-users
cultural and linguistic requirements - I18N L10N
Y4!
G11N
10Data Validation
- Phone Numbers
- USA 1 (781) 789-1898
- France 33.1.6172.8041
- Number of digits is not fixed
- Identifiers may use international text
- Postal codes, License plates, et al.
Courtesy License Plates of the World
www.worldlicenseplates.com
11Data Validation
- Validation logic chosen dynamically
- (Based on intl, user preference, etc.)
/ choose validation logic / if (intl jp)
then validator JapaneseDateFormat if (intl
uk) then validator EnglishDateFormat if
(intl fr) then validator EuropeanDateFormat
result ValidateData(input, validator)
12Titles and Addresses
United Kingdom Mr. Badi Kumar Yahoo! Inc 210
Bath Road Slough, Berkshire England SL1 3XE
Japan Japan 104-0032 Tokyo Chuo-Ku Hacchoubori
3-11-12 Taiki Building Yahoo! KK Tanaka-san
Country
Title
Not only layout, but tab order has to change! And
watch out for exit validation of country!
13Sorting
- English ABC...RSTUVWXYZ
- German AÄB...NOÖ...SßTUÜVYZ
- Swedish/Finnish AB...STUVWXYZÅÄÖ
- Norwegian AB...VWXYÜZÆØÅ
- Note Y Ü
- Spanish ch sorts between c d
- Color, Charlar, Dar
14Text Processing
- Sorting
- Line Wrapping
- Word Breaking, Hyphenization
- Capitalization
- Quotes
- Styles (Bold, Italic, Ruby, Amikake)
- Writing direction (LTR, RTL, Vertical)
15Internationalization Libraries
- Avoid implementing regional formats!
- IBM ICU, Basis Tech. Rosette
- Native OS or Program Language
- Windows, Java API
- Posix Locales, Formatters
16Summary
- Encapsulate data parsing/formatting and use
internationalized API locale - Complex data (e.g. address) changes positions,
fields, size and tab order - Color should be a cue not sole indicator
- Graphics, audio, video may change
- Text display separate from graphics
- Generalize and abstract for global use
17Arent these problems solved already?
- Yes! Partly by the open source library called ICU.
18Unicode Character Set
Example Unicode Characters
19Unicode Character Standard
- Developed by the Unicode Consortium
- www.unicode.org
- Covers all major living scripts
- Version 4.0 has 96,000 characters
- Capacity for 1 million characters
- Unicode Character Set ISO 10646
- Unicode adds character properties and algorithms
- ISO and Unicode work together to synchronize
- ISO support enhances international acceptance
20Unicode Worldwide, Multilingual
- 17 Planes of 64K
- 0-10FFFF, 21 Bits
- Basic Multilingual Plane (BMP)
- Common characters
- 1st Supplementary Plane
- archaic, fictional characters
- 2nd Supplementary Plane
- Ideographs
2117 Planes of 64K
22Unicode Character Set
- Organized by scripts into blocks
23Unicode Is Generative
- Composition can create new characters
- Base non-spacing (combining) character(s)
- A Ã…
- U0041 U030A U00C5
- a . ?
- U0061 U0302 U0323 U1EAD
- a . ?
- U0061 U0323 U0302 U1EAD
- Note Unicode notation is Uhhhh
24Unicode Characteristics
- Multilingual
- All scripts/languages, one character set
- Character Properties
- Case, digit, alpha/letter/ideogram, directional
class, mirroring, combining class, etc. provided
by Unicode - Logical order for bidirectional languages
- Round Trip Conversion To Legacy Encodings
- Byte Order Mark (BOM)
- Big vs. Little endian and encoding identifier
25Unicode Characteristics
- 3 equivalent forms
- UTF-8 8-bit variable width, multi-byte (max. 4)
- UTF-16 16-bit, variable width, surrogates (max
2) - UTF-32 32-bit, fixed width (max 1)
- UCS-2 is old terminology, dont use.
- Design avoids multi-byte performance problems
- Algorithm specifications provide interoperability
- Allows one binary program image to be used
worldwide - Developers do not need to be linguists to
implement
26Storage and Serialization Formats
- UTF-32
- 32 bits per character
- One unit per character
- Unicode only goes to 10FFFF (21 bits)
- UTF-16
- 16 bits per code unit
- Can use two surrogate values ie two code units
per character
27Properties of UTF-8
- Transforms Unicode to sequences of octets
- ASCII-compatible (Characters 0-127)
- Non-ASCII characters are either 2, 3 or 4 bytes
- European generally 2, CJK generally 3, higher
planes 4. - Result
- Algorithms searching for ASCII characters(e.g.,
/ \ lt gt ? - a b c d etc.) work correctly - String length is not greatly increased
- All of Unicode supported
28Choosing a UTF
- UTF-8
- Good choice for migrating legacy software and
file formats (ASCII compatibility, multi-byte
encoding) - Best storage form for European languages
- UTF-16
- More efficient for sorting, processing
- Best storage for Asian languages
- Requires wide character datatypes
- Good choice for new implementations
- UTF-32 -efficient processing, wastes memory
29Unicode Does Not Equal Internationalization
- Unicode simplifies development
- Single source code
- Enables multilingual processing
- Properties reduce research for each language
- Unicode does not fix all internationalization
- E.g. Date, time, number and other formats
- Linguistic processing can require additional
algorithms, data (e.g. word breaking) - Continue identify, support cultural requirements
- Conversion to native encodings for interface to
legacy software, systems can impose limitations
30Summary Unicode
- Well-supported, ubiquitous, and often required in
integrated environments. - Simplifies working with many languages
- Large character set requires consideration
- Requires removing assumptions that 1 character is
1 byte or word.
31Developing Software For The World- Summary
- Gather international requirements.
- Design Internationalization In Early
- Use Global images, widgets, etc. where possible.
Plan for localization elsewhere. - Test international data early (pseudo-localize.)
Involve international testers. - Maximize locale-independence.
- Use Unicode.
32International components for Unicode (ICU)
- Why ICU?
- Open source
- Flexible
- Portable foundation
- In sync with the standards including Unicode and
CLDR. UnicodeString - Minimizes cost
- Solves most of the problems related to i18n
- Comprehensive functionalities for globalization
requirements
33ICU Features
- Text Unicode text handling, full character
properties and character set conversions (500
code pages) - Analysis Unicode regular expressions full
Unicode sets character, word and line boundaries
- Comparison language sensitive collation and
searching
34ICU Features contd
- Transformations normalization, upper/lowercase,
script transliterations (50 pairs) - Locales comprehensive data (230) resource
bundle architecture - Complex Text Layout Arabic, Hebrew, Indic and
Thai - Formatting and Parsing multi-calendar and time
zone,dates, times, numbers, currencies, messages
35Questions
?