Title: Introductions to Software Internationalization for Saba 210KB
1Introductions to Software Internationalization
Internal Training
By Ernie Huang Date Nov 10, 2000
2Course Outlines
- Internationalization (I18N) (3 - 24)
- Locale-Specific Issues (25 - 28)
- Double-byte Enabling (29 - 38)
- Unicode Issues (39 - 44)
- Q A (45)
3Internationalization
- The method of developing a program whose
feature and code design are not based on a single
language or locale and the source code base
facilitates the creation of different language
editions -
- Marketing Consideration
- Design Consideration
- Coding Consideration
- Source Code Control Consideration
4Marketing Consideration
- Reach global market with less development effort
- Meet the need of international enterprise
- Leverage international partners for sale and
support of international market
5Shipment of International Products
- Begin working on international editions after the
domestic edition has been released or when it is
almost finished. - Plan for international products in advance, work
on several language editions concurrently, and
ship them all at roughly the same time.
6Categories of Internationalization for Microsoft
Windows 95
- European (Western/Central/Eastern Europe, etc.)
- single-byte, left to right
- Middle Eastern (Arabic, Hebrew)
- single-byte, bidirectional
- Far Eastern (Trad. Chinese, Simp. Chinese,
Japanese, Korean) - double-byte, horizontal and vertical, input
method - Thai (Thai) ltGo to Source Code
Controlgt - single-byte, left to right, text layout
7Design Consideration (1 of 2)
- Features important to international markets are
included - Icons and bitmaps are generic, are culturally
acceptable, and do not contain text - Menu and dialog-box designs leave room for text
expansion - Text and messages are devoid of slang and
specific cultural references - Consistent English user interface terminology is
used in strings.
8Design Consideration (2 of 2)
- Strings are documented using comments to provide
context for translators. Strings or characters
that should not be localized are marked. - Shortcut-key combinations are accessible on
international keyboards - International laws affecting feature designs are
considered - Third party agreements support international
design issues
9Coding Consideration (1 of 4)
- Code doesnt concatenate strings to form
sentences - Example - Code doesnt use a given string variable in more
than one context - Code doesnt contain hard-coded character
constants, numeric constants, screen positions,
filenames, or path names that presume a
particular language - Example - Buffers are large enough to handle translated
words and phrases - Example
10Coding Consideration (2 of 4)
- Program allows input of international data
- All language editions share a common file format
- Example - Code contains support for locale-specific
hardware, if necessary. - Features that dont apply to international
markets can be removed easily.
11Coding Consideration (3 of 4)
- Code properly handles accented characters -
Example - Program handles non-homogeneous network
environments in which machines are running
different code page - Code uses API to retrieve lead-byte range for Far
East code pages - Code correctly parses double-byte characters
unless based on Unicode
12Coding Consideration (4 of 4)
- Code supports Unicode or conversion between
Unicode and the local code page - Code doesnt assume that all characters are 8-bit
or 16-bit - Example - Code uses generic data types and generic function
prototypes - Program displays and prints text using
appropriate fonts. - lt Jump to Source Code Controlgt
13Avoid Hard-Coding Localizable Elements
- Hard coded strings, characters, constants, screen
positions, filenames, and file paths are
difficult to track down and localize. - Example
- If (szInputString0 O)
- DoOpen( ) // when it is Open
- ltGo Backgt
14Make Buffers Large Enough to Hold Translated Text
- Buffers that are declared to be the exact size of
a word or a sentence will probably overflow when
text is translated. - char szOK3
- GetButtonName(szOK)
- With the Win32 API, stack space is not so
limited as Win16. So feel free to make a large
buffer. Change 3 to 4095. ltGo Backgt
15Do Not Limit Character Parsing to Latin Script
- //Search until you find a noncharacter
- Wrong
- while ((pch gt A) pch lt Z)
- (pch gt a) pch lt z))
- pch
- Correct
- while (IsCharAlpha(pch))
- pch
ltGo Backgt
16Do not assume that characters are always 8-bit
- //Skip two characters
- Wrong
- szString szString 2
- Correct
- szString CharNext(CharNext(szString))
- ltGo Backgt
17Do not localize strings saved as part of your
file format
- For example, you should not localize the keyword
\bold which is used in RTF (Rich Text Format).
Otherwise the RTF file cannot be recognized among
different languages. - Another example - HTML tag should not be
localized. - ltGo backgt
18Do not concatenate strings to form sentences
- English
- String1 Not Enough memory to
- String2 the file
- 1 Variable name for open, create, edit, etc.
- 2 Variable name for the file name
- Bad coding example
- String1 1 String2 2
- In other language could be
- ltGo Backgt
19Source Code Control Consideration (1 of 2)
- Use No-Compile Strategy for various localized
product build - Example - Localizable items are stored in resource files
- All language editions using single-byte character
sets are based on a single executable - Example
20Source Code Control Consideration (2 of 2)
- All language editions using double-byte character
sets are based on a single executable - All language editions using Unicode are based on
a single executable
21Internationalized Product for Localization
- Key Success Factors
- Localizable resources can be easily out-sourced
- Bug fixing for base code can be easily applied
to all - localized version
- Leverage the testing result of base code with
No- - Compile strategy
22Isolating Localizable Resources
- Certain algorithm
- Constants
- Dialogs
- Macro Languages
- Menus
- Messages
- Prompts
- Sounds
- Status bars
- Toolbars
Separating all localizable items into one or more
files makes localization much easier to be
completed.
23Sample Build Tree
Developers update files in the native directory
and use batch files to propagate the changes to
other language directory.
All files that need to be customized based on
language are in resources directory.
24No-Compile Strategy
- Core source code doesnt require recompiling
every time when creating international editions
of a product. - Compile your main executable only once. To create
localized editions, you compile only the
localized resource files and link them to the
executable or to a separate DLL. - If your program is not based on Unicode, you may
need one EXE for SBCS and one EXE for DBCS.
- ltGo Backgt
25Localization
The process of adapting a program for a
specific international market, which includes
translating the user interface, resizing dialog
boxes, customizing features, and testing results
to ensure that the program retains same
functionality and performance. It is not just
a translation process. The following will focus
on non-translation issues that may impact design
or coding.
26Locale-Specific Coding Consideration
- Windows operating systems (Win32 NLS API) and
development tools may have enough API support you
want. So it is not desirable to write proprietary
sort, case, or character property tables in your
code unless your system or development tools does
not support. - If you are concerned about the overhead of
continually calling the API, call it at startup
time to create static tables.
27Locale-Specific Issues (1 of 2)
- Character set (Code Page) and Font
- Date and Time Format
- Calendar Format
- Currency and Number Format
- First Name and Last Name Format
- Address Format
- Phone Number
- Culture or Political Sensitive Issues
- Word or Phrase search
28Locale-Specific Issues (1 of 2)
- Word-warp (Line Breaking)
- Character Sorting
- Many languages dont have upper / lower cases
- Unique ID for the citizens of the country
- Laws, government regulation, taxes
- Dependent hardware environment availability
- Dependent software environment availability
29DBCS-Enabling
The method of adapting a western-language
based program to be able to display, input,
store, retrieve and process double-byte
characters (in Japanese, Traditional Chinese,
Simplified Chinese and Korean) correctly. It
could be included in Internationalization or even
localization effort. In short, C Program has more
to do with DBCS-Enabling.
30What is DBCS
- DBCS - Double Byte Character Set
- The characters for Traditional Chinese,
Simplified Chinese, Korean and Japanese are more
than 256, so one byte cannot encode all of them.
In Windows, two bytes are used. - Many Chinese characters were borrowed or adapted
for Japanese (Kanji) and Korean (banja) long time
ago.
31Lead-byte and Trail-byte Ranges for DBCS Code
Pages
32Potential DBCS-Enabling Effort (1 of 4)
- Input
- Make sure Input Method Editor (IME) can be
activated for DBCS data input. - Display
- Font names, size and character set are changed
for DBCS data. - Store
- Database field types are OK for DBCS data.
- Retrieve
- Buffer length unit should be consistent (byte
or character) or conversion is required.
33Potential DBCS-Enabling Effort (2 of 4)
- Search
- A DBCS character must be distinguished from a
SBCS character when searching a delimiter (such
as \) in a string (such as path) - Compare
- A comparison should work on character basis,
not on byte basis. - Truncate/Concatenate
- Should work on character basis
34Potential DBCS-Enabling Effort (3 of 4)
- Cursor
- In cursor placement and cursor movement, it
should never stay in the middle of the
double-byte character. - Locale Specific issues
- Sorting, line breaking, font, etc. mentioned
in the previous section. - Code Conversion
- Conversion to and from Internet mail format,
locale-specific standard (eg. EUC-JP) and Unicode
standards.
35Potential DBCS-Enabling Effort (4 of 4)
- Case conversions
- No upper/lower case conversions for DBCS
- data.
- Third-Party code
- If there is third-party code that cause
double- - byte issues, solutions need to be
implemented.
36Dual Compilation
- ifdef DBCS
- ...
- else
- ...
- endif
- DBCS-enabled code doent affect base code
- Create dual code base that you have to compile,
test and maintain separately
37Input Method Editor in DBCS Windows
38Some DBCS-Enabling Notes on non-C/C products
- Input Method Editor can be activated in VB,
Delphi, browser, etc. - DBCS Character display can be enabled by
specifying a proper font - Font Association capability in non-Japanese DBCS
languages usually confuses the DBCS-Enabling
effort - With Unicode built in, Java program perform the
conversion between Unicode and code page
implicitly
39Unicode
Define all available characters of the
languages in the world as two-byte code (under
Windows) so that every string of data has the
same interpretation across different language of
operating systems Example The
character has different code values under
the Windows code pages of Japanese, Korean,
Simplified Chinese and Traditional Chinese, but
its code value in Unicode is consistent.
40Advantages of Unicode
- Sort and process international characters
efficiently (lists, database indexes, network
user names, etc.) - Eliminate the code to handle multiple code pages
and double-bytes character set - Code that work for more than one language gets
thoroughly tested in the process of releasing the
first language edition - Unicode provides details for characters with
semantic rules that can simplify text layout - Unicode does not significantly increase file size
41Disadvantages of Unicode
- Unicode doesnt help with complex text-based
operations - sorting
- hyphenation
- line breaking
- Unicode is not currently supported in many
applications and fonts - Unicode font with DBCS characters are not built
in under Windows 95/98/NT
42Unicode on Windows NT
- Handles characters internally in Unicode
- Supports all of the wide-character variants of
Win32 API - GDI processes all text in Unicode
- Resource compiler compiles strings into Unicode
- System information files are stored as Unicode
- NTFS filenames are always in Unicode
- Exchanges data on the network in Unicode format
with other Unicode machine
43Potential Unicode Implementation Issues for Asian
Products (1 of 2)
- Some performance impact due to heavy conversions
- - For non-Unicode program
- - Convert back to code page for display (no
DBCS - Unicode font)
- Buffer length unit for Unicode is always 2 bytes
(one Unicode character). This is not the same as
non-Unicode AP.
44Potential Unicode Implementation Issues for Asian
Products (2 of 2)
- Some code values in code page dont have
round-trip conversions. - For database with data in multiple languages and
Unicode data is stored, the access program needs
to be able to get code page information and
convert the Unicode data. - For database with data in multiple languages and
code page data is stored, the database needs to
have a language type field so that access
program can be used for conversion.
45Q A
Question?