Title: DEV10: Supporting Multiple Languages In Your Application
1DEV-10 Supporting Multiple Languages In Your
Application
Salvador Viñals
Consultant Product Manager
2Agenda
- International support with OpenEdge 10
- OpenEdge internationalization update
- GB18030
- Sorting and Collations
- Unicode Normalization
- Default word-break tables and double-byte
- For more information, go to
- Summary
This presentation includes annotations with
additional, complementary information
3Code-Pages and Unicode
- Code-pages
- Many code-pages
- Max 255 characters each
- Each with regionally-limited repertoire of
characters - Unicode
- Uni code One
- Uni code Universal
- Virtually all the world's characters
- Distinguishes characters by script, but not by
language. - UTF-8, UTF-16, UTF-32
- Unicode binary representations (8,16,32 bits)
4OpenEdge Products
International readiness
- OpenEdge 10 products support UTF-8 (Unicode)
- Database (Personal, Workgroup, Enterprise)
- Application Servers AppServer, WebSpeed (Basic,
Enterprise) - GUI Clients (Client Networking, WebClient) and
Batch Client - Exceptions
- Character Client and DataServers Use code-pages
instead - Code-pages and Unicode can interoperate
5Configurations
6Translation Products
- Translation Manager (TranMan)
- Visual Translator (VisTran)
- Products life cycle
- Progress V9 Functionally Stable
- OpenEdge 10 Active
TranMan and VisTran run on Windows only, however
they can be used to manage translations of ChUI
or GUI applications.
7Agenda
- International support with OpenEdge 10
- OpenEdge internationalization update
- GB18030
- Sorting and Collations
- Unicode Normalization
- Default word-break tables and double-byte
- For more information, go to
- Summary
This presentation includes annotations with
additional, complementary information
8Support for GB18030 Code Page
- Chinese code page
- Required for all new software sold in mainland
China
9Support for GB18030 Code Page
- Why is this code page unique?
- Does not fit into lead-byte / trail-byte model
- It has 1, 2, and 4 byte characters
- Cannot tell from lead-byte if there are 2 or 4
bytes in the character
10Support for GB18030 Code Page
- Supported by making conversions of the GB18030
code page to and from UTF-8 - Requires cpinternal to be UTF-8
- No cpinternal for GB18030
- Reading and writing a file in GB18030
- Converts to/from UTF-8
11Linguistic Sorting
The goal
- Unicode sorting for UTF-8
- Language-sensitive collations
- Tailor app to expectations of locale
- Language
- Location (country, region, etc.)
- Easy to use
- Functions just like any other collation for ABL,
and OpenEdge Database or SQL users - Prior to 10.0B UTF-8 collation was binary sort
12Some collation examples Latin alphabet
13Linguistic Sorting
Internals
- OpenEdge Database meta-schema
- Table _DB-collate
- Already used for single-byte sort weights
- New functionality used for summary information
- Table _Collation
- Added in 10.0A in preparation
- Can hold any amount of collation data
14Linguistic Sorting
- ABL Usage
- Reference collation by name
- For example ICU-fr for French
- Specify using
- -cpcoll lttable namegt
- Identifies collation table to use with code page
in memory at session startup - lttable namegt is the collation table in convmap.cp
or the name of the ICU collation - ABL Statements
- COMPARE
- COLLATE
15Linguistic Sorting
- COMPARE and COLLATE new strengths supported
- 10.0A strengths CASE-INSENSITIVE,
CASE-SENSITIVE, CAPS and RAW - Added strengths
- PRIMARY
- SECONDARY CASE-INSENSITIVE
- TERTIARY CASE-SENSITIVE
- QUATERNARY
16Linguistic Sorting
Sort order depends on selected collation
/ French collation / DISPLAY ICU-fr
COMPARE("côte", "lt", "coté", "case-insensitive",
"ICU-fr") / Spanish collation / DISPLAY
ICU-es COMPARE("côte", "lt", "coté",
"case-insensitive", "ICU-es")
- Output of above statements
ICU-fr yes ICU-es no
17Linguistic Sorting
- OpenEdge uses collations for
- The cpcoll startup parameter
- The database collation
- The collation of a database CLOB column
- An argument to the COMPARE function or COLLATE
option of the BY phrase
18Linguistic Sorting
Rules
- Once a collation is specified for the database in
the _Collation table, it cannot be modified - Once the collation is written to the _Collation
table, it is the only collation with that name
that can be used by that database - It is strongly recommended that databases should
be backed up before using an ICU collation
19Linguistic Sorting
Example 1 of 4
- The following examples assume
- UTF-8 database with basic collation
- Names
- beet, carrot, çedilla, entry, école, trust, zoom
FOR EACH words WHERE name lt t DISPLAY
name. END.
beet carrot entry
20Linguistic Sorting
Example 2 of 4
FOR EACH words WHERE name gt t DISPLAY
name. END.
trust zoom école çedilla
21Linguistic Sorting
Example 3 of 4
FOR EACH words WHERE COMPARE(name lt
t,case-insensitive,ICU-en) DISPLAY
name. END.
beet carrot entry école çedilla
beet carrot entry
22Linguistic Sorting
Example 4 of 4
FOR EACH words WHERE COMPARE(name lt
t,case-insensitive,ICU-en) BY
COLLATE(name,case-insensitive,ICU-en) DISPLA
Y name. END.
- Before, without BY COLLATE
beet carrot çedilla école entry
beet carrot entry école çedilla
23Linguistic Sorting
Supported Collations
- OpenEdge supports ICU collations in the icui18n
library for supported OpenEdge languages
- One additional collation is supported - Japanese
Hiragana Quaternary as case-sensitive - Uses the QUATERNARY strength as the
CASE-SENSITIVE strength
- ICU-ja__HQ Japanese Hiragana Quaternary
24Linguistic SortingICU Collations Available 1 of 3
- ICU-UCA UCA (default Unicode Collation
Algorithm) - ICU-ar Arabic
- ICU-be Belarusian
- ICU-bg Bulgarian
- ICU-ca Catalan
- ICU-cs Czech
- ICU-da Danish
- ICU-de__PHONEBOOK German phonebook
- ICU-el Greek
- ICU-en_BE English Belgium
- ICU-eo Esperanto
- ICU-es Spanish
- ICU-es__TRADITIONAL Spanish traditional
- ICU-et Estonian
- ICU-fa Persian
- ICU-fi Finnish
- ICU-fr French
- ICU-gu Gujarati
25Linguistic SortingICU Collations Available 2 of 3
- ICU-he Hebrew
- ICU-hi Hindi
- ICU-hi__DIRECT Hindi direct
- ICU-hr Croatian
- ICU-hu Hungarian
- ICU-is Icelandic
- ICU-ja Japanese
- ICU-ko Korean
- ICU-kn Kannada
- ICU-lt Lithuanian
- ICU-lv Latvian
- ICU-mk Macedonian
- ICU-mr Marathi
- ICU-mt Maltese
- ICU-nb Norwegian Bokmål
- ICU-nn Norwegian Nynorsk
- ICU-pl Polish
- ICU-ro Romanian
26Linguistic SortingICU Collations Available 3 of 3
- ICU-ru Russian
- ICU-sh Saint Helena
- ICU-sk Slovak
- ICU-sl Slovenian
- ICU-sq Albanian
- ICU-sr Serbian
- ICU-sv Swedish
- ICU-ta Tamil
- ICU-te Telugu
- ICU-th Thai
- ICU-tr Turkish
- ICU-uk Ukrainian
- ICU-vi Vietnamese
- ICU-zh Chinese
- ICU-zh__PINYIN Chinese Pinyin
- ICU-zh_HK Chinese Hong Kong
- ICU-zh_MO Chinese Macau
- ICU-zh_TW Chinese Taiwan
27Collations Gotchas
- If Database, Clients and Servers use different
collations (-cpcoll), indexed and non-indexed
queries may return different results - If a client needs different collation than
database, you can use COMPARE, COLLATE on the
client - Performance impact with large results sets
28Configuration Gotchas
Typical character client configuration, 1/2
- Database code-page is 1252 on Windows server
- OpenEdge install startup.pf setting is
- cpinternal 1252 cpstream 1252
- French Windows Client with
- a default Windows code page of 1252, and
- a DOS system code page of ibm850
- DOS Character Client starts without specifying
-cpinternal and cpstream - so uses 1252 from startup.pf
29Configuration Gotchas
Typical character client configuration, 2/2
- User enters è (Hex 8A in ibm850)
- Since session is started with cpinternal 1252
OpenEdge doesnt convert when writing to the
database. - The entered value is written to the database as
8A, when it should be E8 (1252) - Start Character Client with cpinternal and
cpstream set to ibm850
30Unicode Normalization
What is Normalization?
- Unicode has different ways of expressing the same
characters - Decomposed
- Á (U0041, Latin Capital Letter A)
- (U0301, Combining Acute Accent )
- Composed
- Á (U00C1, Latin Capital Letter A with Acute)
31Unicode Normalization
Why Normalization?
- XML (and other W3C entities) expects data in
NFC form - Best way to convert from Unicode to other code
pages - Useful when doing tasks such as making
comparisons
NFC Canonical Decomposition, followed by
Canonical Composition
32Unicode Normalization
NORMALIZE Language Function
- NORMALIZE
- Returns either CHAR or LONGCHAR
- Matches the source string
- CHAR variable must be UTF-8
- LONGCHAR variable can be any form of Unicode
- UTF-8, UTF-16, UTF-32
result-string NORMALIZE(source-string,
normalization-mode)
33Normalization Modes Supported
Normalization modes from ICU library
- NFD Canonical Decomposition
- NFC Canonical Decomposition, followed by
Canonical Composition (default) - NFKD Compatibility Decomposition
- NFKC Compatibility Decomposition, followed by
Canonical Composition - None No change to source string. Turns off
normalization when normalization-mode is a
variable
34Unicode Normalization
Additional information
- Unicode Normalization Forms
- Recommended for understanding normalization forms
used with NORMALIZE function - http//www.unicode.org/unicode/reports/tr15/
- International Components for Unicode (ICU)
libraries globalization, in-depth information - http//icu.sourceforge.net/userguide/intro.html
35Default Word-Break Tables
- Prior to 10.1A
- User had to configure word-break tables for use
with double-byte and UTF-8 databases
36Default Word-Break Tables
10.1A simplifies implementing double-byte
databases
- Default Word-Break Tables added for
- Double-byte
- UTF-8 Databases
- These are available out of the box
- Either in product or for download
- Simplifies accessing non-single-byte databases
37Default Word-Break Tables
10.1A simplifies implementing double-byte
databases
- 10.1A provides 10 compiled files
- See list on next slide
- Ranging from proword.245 to proword.254
- Located in subdirectory with corresponding empty
databases - Subdirectory prolang/ltlanguagegt
38Default Word-Break TablesCompiled, Available out
of the box
10.1A simplifies implementing double-byte
databases
- Available as part of the Supplemental PROMSGS
package - Available for download
- Japanese SHIFT-JIS proword.253
- Japanese EUCJIS proword.250
- Korean CP949 proword.248
- Korean KSC5601 proword.252
- Chinese (simplified) CP936 proword.247
- Chinese (simplified) GB2312 proword.251
- Chinese (traditional) CP950 proword.249
- Chinese (traditional) BIG-5 proword.246
- Chinese (traditional) CP950-HKSCS proword.245
- UTF-8 proword.254
39Default Word-Break Tables
- What if you are using proword file in the range
of 245 254? - Copy the file to proword.ltnnngt
- Where ltnnngt is less than 240
- Apply word rule to the database
- No index-build is required for this change
- Remember, apply the change in all tiers (Client,
Server, Database) to prevent corruption!
40Agenda
- International support with OpenEdge 10
- OpenEdge internationalization update
- GB18030
- Sorting and Collations
- Unicode Normalization
- Default word-break tables and double-byte
- For more information, go to
- Summary
This presentation includes annotations with
additional, complementary information
41For More Information, go to
- Expand to New Countries Business Empowerment
Program - Contact your Account Manager
- Product documentation
- OpenEdge Development Internationalizing
Applications - OpenEdge Development Visual Translator
- OpenEdge Development Translation Manager
- Visit PSDN for white papers and presentations,
for example - Understanding Internationalization web seminar
- Training and Professional Services
www.progress.com
42Agenda
- International support with OpenEdge 10
- OpenEdge internationalization update
- GB18030
- Sorting and Collations
- Unicode Normalization
- Default word-break tables and double-byte
- For more information, go to
- Summary
This presentation includes annotations with
additional, complementary information
43In Summary
- Use UTF-8
- GB18030
- Linguistic Sorting and Collations
- Use ICU-
- Unicode Normalization
- Default word-break tables and double-byte
- Expand to New Countries Business Empowerment
Program
44Questions?
45Thank you foryour time
46(No Transcript)