DEV10: Supporting Multiple Languages In Your Application - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

DEV10: Supporting Multiple Languages In Your Application

Description:

Each with regionally-limited repertoire of characters. Unicode. Uni code = One ... UTF-8 database with 'basic' collation. Names: beet, carrot, edilla, entry, ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 47

Provided by: PSC64

Category:

more less

Transcript and Presenter's Notes

Title: DEV10: Supporting Multiple Languages In Your Application

1
DEV-10 Supporting Multiple Languages In Your
Application
Salvador Viñals
Consultant Product Manager
2
Agenda

International support with OpenEdge 10
OpenEdge internationalization update
GB18030
Sorting and Collations
Unicode Normalization
Default word-break tables and double-byte
For more information, go to
Summary

This presentation includes annotations with
additional, complementary information
3
Code-Pages and Unicode

Code-pages
Many code-pages
Max 255 characters each
Each with regionally-limited repertoire of
characters
Unicode
Uni code One
Uni code Universal
Virtually all the world's characters
Distinguishes characters by script, but not by
language.
UTF-8, UTF-16, UTF-32
Unicode binary representations (8,16,32 bits)

4
OpenEdge Products
International readiness

OpenEdge 10 products support UTF-8 (Unicode)
Database (Personal, Workgroup, Enterprise)
Application Servers AppServer, WebSpeed (Basic,
Enterprise)
GUI Clients (Client Networking, WebClient) and
Batch Client
Exceptions
Character Client and DataServers Use code-pages
instead
Code-pages and Unicode can interoperate

5
Configurations
6
Translation Products

Translation Manager (TranMan)
Visual Translator (VisTran)
Products life cycle
Progress V9 Functionally Stable
OpenEdge 10 Active

TranMan and VisTran run on Windows only, however
they can be used to manage translations of ChUI
or GUI applications.
7
Agenda

International support with OpenEdge 10
OpenEdge internationalization update
GB18030
Sorting and Collations
Unicode Normalization
Default word-break tables and double-byte
For more information, go to
Summary

This presentation includes annotations with
additional, complementary information
8
Support for GB18030 Code Page

Chinese code page
Required for all new software sold in mainland
China

9
Support for GB18030 Code Page

Why is this code page unique?
Does not fit into lead-byte / trail-byte model
It has 1, 2, and 4 byte characters
Cannot tell from lead-byte if there are 2 or 4
bytes in the character

10
Support for GB18030 Code Page

Supported by making conversions of the GB18030
code page to and from UTF-8
Requires cpinternal to be UTF-8
No cpinternal for GB18030
Reading and writing a file in GB18030
Converts to/from UTF-8

11
Linguistic Sorting
The goal

Unicode sorting for UTF-8
Language-sensitive collations
Tailor app to expectations of locale
Language
Location (country, region, etc.)
Easy to use
Functions just like any other collation for ABL,
and OpenEdge Database or SQL users
Prior to 10.0B UTF-8 collation was binary sort

12
Some collation examples Latin alphabet
13
Linguistic Sorting
Internals

OpenEdge Database meta-schema
Table _DB-collate
Already used for single-byte sort weights
New functionality used for summary information
Table _Collation
Added in 10.0A in preparation
Can hold any amount of collation data

14
Linguistic Sorting

ABL Usage
Reference collation by name
For example ICU-fr for French
Specify using
-cpcoll lttable namegt
Identifies collation table to use with code page
in memory at session startup
lttable namegt is the collation table in convmap.cp
or the name of the ICU collation
ABL Statements
COMPARE
COLLATE

15
Linguistic Sorting

COMPARE and COLLATE new strengths supported
10.0A strengths CASE-INSENSITIVE,
CASE-SENSITIVE, CAPS and RAW
Added strengths
PRIMARY
SECONDARY CASE-INSENSITIVE
TERTIARY CASE-SENSITIVE
QUATERNARY

16
Linguistic Sorting
Sort order depends on selected collation
/ French collation / DISPLAY ICU-fr
COMPARE("côte", "lt", "coté", "case-insensitive",
"ICU-fr") / Spanish collation / DISPLAY
ICU-es COMPARE("côte", "lt", "coté",
"case-insensitive", "ICU-es")

Output of above statements

ICU-fr yes ICU-es no
17
Linguistic Sorting

OpenEdge uses collations for
The cpcoll startup parameter
The database collation
The collation of a database CLOB column
An argument to the COMPARE function or COLLATE
option of the BY phrase

18
Linguistic Sorting
Rules

Once a collation is specified for the database in
the _Collation table, it cannot be modified
Once the collation is written to the _Collation
table, it is the only collation with that name
that can be used by that database
It is strongly recommended that databases should
be backed up before using an ICU collation

19
Linguistic Sorting
Example 1 of 4

The following examples assume
UTF-8 database with basic collation
Names
beet, carrot, çedilla, entry, école, trust, zoom

FOR EACH words WHERE name lt t DISPLAY
name. END.

Output result

beet carrot entry
20
Linguistic Sorting
Example 2 of 4
FOR EACH words WHERE name gt t DISPLAY
name. END.

Output result

trust zoom école çedilla
21
Linguistic Sorting
Example 3 of 4
FOR EACH words WHERE COMPARE(name lt
t,case-insensitive,ICU-en) DISPLAY
name. END.

Output result

Before, without COMPARE

beet carrot entry école çedilla
beet carrot entry
22
Linguistic Sorting
Example 4 of 4
FOR EACH words WHERE COMPARE(name lt
t,case-insensitive,ICU-en) BY
COLLATE(name,case-insensitive,ICU-en) DISPLA
Y name. END.

Output result

Before, without BY COLLATE

beet carrot çedilla école entry
beet carrot entry école çedilla
23
Linguistic Sorting
Supported Collations

OpenEdge supports ICU collations in the icui18n
library for supported OpenEdge languages

One additional collation is supported - Japanese
Hiragana Quaternary as case-sensitive
Uses the QUATERNARY strength as the
CASE-SENSITIVE strength

ICU-ja__HQ Japanese Hiragana Quaternary

24
Linguistic SortingICU Collations Available 1 of 3

ICU-UCA UCA (default Unicode Collation
Algorithm)
ICU-ar Arabic
ICU-be Belarusian
ICU-bg Bulgarian
ICU-ca Catalan
ICU-cs Czech
ICU-da Danish
ICU-de__PHONEBOOK German phonebook
ICU-el Greek
ICU-en_BE English Belgium
ICU-eo Esperanto
ICU-es Spanish
ICU-es__TRADITIONAL Spanish traditional
ICU-et Estonian
ICU-fa Persian
ICU-fi Finnish
ICU-fr French
ICU-gu Gujarati

25
Linguistic SortingICU Collations Available 2 of 3

ICU-he Hebrew
ICU-hi Hindi
ICU-hi__DIRECT Hindi direct
ICU-hr Croatian
ICU-hu Hungarian
ICU-is Icelandic
ICU-ja Japanese
ICU-ko Korean
ICU-kn Kannada
ICU-lt Lithuanian
ICU-lv Latvian
ICU-mk Macedonian
ICU-mr Marathi
ICU-mt Maltese
ICU-nb Norwegian Bokmål
ICU-nn Norwegian Nynorsk
ICU-pl Polish
ICU-ro Romanian

26
Linguistic SortingICU Collations Available 3 of 3

ICU-ru Russian
ICU-sh Saint Helena
ICU-sk Slovak
ICU-sl Slovenian
ICU-sq Albanian
ICU-sr Serbian
ICU-sv Swedish
ICU-ta Tamil
ICU-te Telugu
ICU-th Thai
ICU-tr Turkish
ICU-uk Ukrainian
ICU-vi Vietnamese
ICU-zh Chinese
ICU-zh__PINYIN Chinese Pinyin
ICU-zh_HK Chinese Hong Kong
ICU-zh_MO Chinese Macau
ICU-zh_TW Chinese Taiwan

27
Collations Gotchas

If Database, Clients and Servers use different
collations (-cpcoll), indexed and non-indexed
queries may return different results
If a client needs different collation than
database, you can use COMPARE, COLLATE on the
client
Performance impact with large results sets

28
Configuration Gotchas
Typical character client configuration, 1/2

Database code-page is 1252 on Windows server
OpenEdge install startup.pf setting is
cpinternal 1252 cpstream 1252
French Windows Client with
a default Windows code page of 1252, and
a DOS system code page of ibm850
DOS Character Client starts without specifying
-cpinternal and cpstream
so uses 1252 from startup.pf

29
Configuration Gotchas
Typical character client configuration, 2/2

User enters è (Hex 8A in ibm850)
Since session is started with cpinternal 1252
OpenEdge doesnt convert when writing to the
database.
The entered value is written to the database as
8A, when it should be E8 (1252)
Start Character Client with cpinternal and
cpstream set to ibm850

30
Unicode Normalization
What is Normalization?

Unicode has different ways of expressing the same
characters
Decomposed
Á (U0041, Latin Capital Letter A)
(U0301, Combining Acute Accent )
Composed
Á (U00C1, Latin Capital Letter A with Acute)

31
Unicode Normalization
Why Normalization?

XML (and other W3C entities) expects data in
NFC form
Best way to convert from Unicode to other code
pages
Useful when doing tasks such as making
comparisons

NFC Canonical Decomposition, followed by
Canonical Composition
32
Unicode Normalization
NORMALIZE Language Function

NORMALIZE
Returns either CHAR or LONGCHAR
Matches the source string
CHAR variable must be UTF-8
LONGCHAR variable can be any form of Unicode
UTF-8, UTF-16, UTF-32

result-string NORMALIZE(source-string,
normalization-mode)
33
Normalization Modes Supported
Normalization modes from ICU library

NFD Canonical Decomposition
NFC Canonical Decomposition, followed by
Canonical Composition (default)
NFKD Compatibility Decomposition
NFKC Compatibility Decomposition, followed by
Canonical Composition
None No change to source string. Turns off
normalization when normalization-mode is a
variable

34
Unicode Normalization
Additional information

Unicode Normalization Forms
Recommended for understanding normalization forms
used with NORMALIZE function
http//www.unicode.org/unicode/reports/tr15/
International Components for Unicode (ICU)
libraries globalization, in-depth information
http//icu.sourceforge.net/userguide/intro.html

35
Default Word-Break Tables

Prior to 10.1A
User had to configure word-break tables for use
with double-byte and UTF-8 databases

36
Default Word-Break Tables
10.1A simplifies implementing double-byte
databases

Default Word-Break Tables added for
Double-byte
UTF-8 Databases
These are available out of the box
Either in product or for download
Simplifies accessing non-single-byte databases

37
Default Word-Break Tables
10.1A simplifies implementing double-byte
databases

10.1A provides 10 compiled files
See list on next slide
Ranging from proword.245 to proword.254
Located in subdirectory with corresponding empty
databases
Subdirectory prolang/ltlanguagegt

38
Default Word-Break TablesCompiled, Available out
of the box
10.1A simplifies implementing double-byte
databases

Available as part of the Supplemental PROMSGS
package
Available for download
Japanese SHIFT-JIS proword.253
Japanese EUCJIS proword.250
Korean CP949 proword.248
Korean KSC5601 proword.252
Chinese (simplified) CP936 proword.247
Chinese (simplified) GB2312 proword.251
Chinese (traditional) CP950 proword.249
Chinese (traditional) BIG-5 proword.246
Chinese (traditional) CP950-HKSCS proword.245
UTF-8 proword.254

39
Default Word-Break Tables

What if you are using proword file in the range
of 245 254?
Copy the file to proword.ltnnngt
Where ltnnngt is less than 240
Apply word rule to the database
No index-build is required for this change
Remember, apply the change in all tiers (Client,
Server, Database) to prevent corruption!

40
Agenda