Creating MultiLingual and MultiLocale Databases - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Creating MultiLingual and MultiLocale Databases

Description:

Creating complex systems in a global environment requires more than internationalized code. ... Create associated 'installed locale' records at the hub or ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 34

Provided by: timwoodsco

Category:

more less

Transcript and Presenter's Notes

Title: Creating MultiLingual and MultiLocale Databases

1
Creating Multi-Lingual and Multi-Locale Databases

International Unicode Conference 19
Presented by Addison Phillips
Globalization Architect
webMethods, Inc.

2
Introduction

Audience Beginning Developers
Presenter Addison Phillips Globalization
Architect webMethods, Inc. mailtoaphillips_at_webm
ethods.com
Presentation http//www.inter-locale.com
Creating complex systems in a global environment
requires more than internationalized code. Since
most Enterprise system rely on relational
databases, a global-ready system must also
consider database design in order to be truly
effective.

3
Our Problem

This presentation is based on the lessons learned
in developing and deploying a B2B conversation
management system (webMethods for Trading
Networks) and partnership management software
(webMethods PartnerConnect).
The products we created share a central database
that allow webMethods customers to manage their
B2B trading partnerships.
Terminology
A trading partner is a company that you want to
do business with.
An initiative is a specific opportunity to work
together.

4
Trading Networks

Companies need to store information about
transactions and business relationships world
wide and in real time.
We call this Global Business Visibility

5
Partner Connect Goals

Centrally served (one instance).
Centrally managed (initiatives can be deployed
anywhere).
Localized (so partners can interact with
initiatives in their own language).
Cultural and market sensitivity (customized to
fit different market conditions locally).
Created and managed by the customer entirely
through HTML interface.

6
Profile and Conversation Management
7
Enter the Database

Serve both global and local initiatives from a
single instance.
Store data in multiple writing systems (scripts,
languages).
Provide for actual differences in the data due to
user location (locale).
Provide for localization of global content.
Provide for local content management.

8
Basic Rules for a Global DB Schema

Expand fields to support changes in character
encoding.
Expand fields to support differences in the
storage requirements of other locales (cultural
or linguistic expansion, as opposed to encoding)
Classify data as locale-neutral,
locale-intrinsic, or locale-related and
re-normalize the tables accordingly.
Create efficient access to both global and
locale-specific information.

9
Selecting an Encoding

If a database instance will only serve a single
locale (or compatible locales), then the
character encoding can be selected based on local
requirements (legacy encoding).
If the database must store data from many
locales (or incompatible writing systems), then
the character encoding selected must be a Unicode
encoding.
Each database vendor has a unique approach to
this.
Encodings vary in terms of performance and
capability.
Generally the two choices you have are
UTF-8
UTF-16 (formerly known as UCS-2)

10
Character Encodings and DDL

Each database vendor provides their own encoding
support.
Most support legacy encodings and their
variants.
Need a Unicode encoding to support multiple
languages (globally)
Each vendor handles Unicode encodings differently.

CREATE TABLE Address (
cust_id number,
attn varchar(50),
department varchar(50),
street1 varchar(50),
street2 varchar(50),
city varchar(50),
state char(2),
zip varchar(5),
country varchar(18))

11
Example Cloudscape

Cloudscape is a pure Java database.
Uses java.lang.String objects to store char and
varchar data, so all string data is stored as
UCS-2.
(1) java.lang.Character equals (1) unit in DDL

CREATE TABLE Address (
cust_id number,
attn varchar(50),
department varchar(50),
street1 varchar(50),
street2 varchar(50),
city varchar(50),
state char(2),
zip varchar(5),
country varchar(18))

12
Example Oracle

Oracle provides several native Unicode encodings.
The most commonly used one is called UTF8.
Characters in UTF-8 range from one to four bytes
Char and varchar2 types are defined in bytes, not
characters.
So a varchar2(30) can reliably store 10 Unicode
2.1.8 characters (and as many as 30).
Note that a varchar2(60) is required to store
surrogate pairs.

CREATE TABLE Address (
cust_id number,
attn varchar2(150),
department varchar2(150),
street1 varchar2(150),
street2 varchar2(150),
city varchar2(150),
state char(6),
zip varchar2(15),
country varchar2(18))

13
Oracle Example

Create a table and insert values.
Notice that multibyte values take more room to
store.

14
Example MS SQL Server 2000

SQL Server 2000 provides support for the UTF-16
encoding of Unicode via the nchar and nvarchar
datatypes.
Char and varchar2 must use a legacy encoding,
with sizes defined in bytes (so a varchar(30) can
store as many as 30 and as few as 15 characters
in Shift-JIS CP932).
Nchar and nvarchar are defined in characters, so
an nvarchar(30) can store 30 characters.
Note that an nvarchar(30) can only store 15
characters beyond UFFFF.

CREATE TABLE Address (
cust_id integer,
attn nvarchar(50),
department nvarchar(50),
street1 nvarchar(50),
street2 nvarchar(50),
city nvarchar(50),
state nchar(2),
zip nvarchar (5),
country nvarchar(18))

15
SQL Server Example

Create a table.
Insert data using multibyte characters.
Insert data using single-byte characters.

16
Oracle Unicode Encoding Variations

AL24UTFFSS. The original Unicode encoding
supported by Oracle. It is not compatible with
modern Unicode and should be avoided.
UTF8. A multibyte encoding used by most versions
of Oracle. This version encodes Unicode Scalar
Values larger than UFFFF as the UTF-8 sequence
for a pair of surrogate characters.
This results in binary sorting sequences
compatible with UTF-16 representations.
This violates the Unicode shortest form
requirement (note that this is invisible to my
Java application).
(All JDBC drivers adjust the connection to use
this encoding automatically.)
AL32UTF8. A UTF-8 encoding provided in Oracle 9i
that correctly encodes Unicode Scalar Values
larger than UFFFE using the shortest form. Note
that the sorting sequence is different.
nchar/nvharchar support for UTF-16 in Oracle 9i.

17
MS SQL Server 2000 Issues

Code Page 65001 This is Microsofts code page
for UTF-8.
It can not be used as a char/varchar/text
encoding, even in the most recent versions of MS
SQL Server.
See http//support.microsoft.com/support/kb/articl
es/Q232/5/80.ASP for more information.
You can use a different encoding (by setting the
collation) for each data column, but this is not
a very convenient way to work in a global
environment.
JDBC connections to SQL Server use the JDBC-ODBC
driver. This driver cannot tell the difference
between n-types and regular types, and thus
cannot retrieve Unicode string values.
Note that this also applies to several middleware
products, notably Merant.
Note that this also applies to use of variant
text types in other databases (such as Oracle 9i).

18
Some Other Databases

Sybase
ASE supports UTF-8.
Sybase 11 supports UTF-8 via an add-on.
ASE is adding support for UTF-16 via a new data
type.
IBM DB/2
Supports UTF-16 (as CCSID 13844).
Supports UTF-8 as a database encoding.
MySQL doesnt support Unicode.
The Open Source folks need to get to work ?

19
Modifying Size Constraints

UTF-8 has a maximum number of bytes-per-character
of 4.
Vast majority of characters use 3 or less.
Older systems (JDK, for example) cannot access
the 4-byte characters.
Determine size requirements
Specific constraint
--or--
Arbitrary constraint.

Specific Size Limit
Check length using code (database fields have
variable restrictions).
For UTF-8 Multiply by 3 (or 4) bytes to get
field length.
Example varchar2(10) becomes varchar2(30).
Arbitrary Size Limit
Multiply the desired maximum by 3 bytes to get
approximate size.
Adjust according to database and performance
requirements.
Example varchar2(100) becomes varchar2(255).
Was able to store 100 characters, now a minimum
maximum of 85.

20
Cultural Data Expansion

Data also changes size (and sometimes type)
because of culture or locale.
Examples
Social Security Number is 13 digits in France
Postal code not all numeric outside USA and may
be quite long.
Different address units than State
Spanish users often have two or three middle
names.
Avoid arbitrarily small char and varchar field
lengths. Most databases optimize storage of
variable length fields.
But avoid performance killing sizes.
Oracle block size limitations.
Oracle JDBC character conversion latching
maximum (2000 bytes in 8.0.5).

21
Cultural Data Expansion

State (char2) becomes province (up to 85
characters).
ZIP code (varchar9) becomes postalcode (up to 50
characters and probably much more).
Address fields expand from 50 bytes to 255 bytes
(or about 85 characters).
Dont assume that the same fields will always
represent the exact same data values.

CREATE TABLE Address (
address_id char(24),
cust_id char(24),
contact_id char(24),
country_id char(2),
department varchar2(255),
street1 varchar2(255),
street2 varchar2(255),
city varchar2(255),
province varchar2(255),
postalcode varchar2(150))

22
Whats Left?

So far weve
Expanded storage to deal with character
encodings.
Expanded storage to deal with cultural and
linguistic data expansion.
We still need to
Allow for localization of textual elements.
Allow for relational changes due to cultural or
linguistic requirements.

23
Basic Questionnaire Table Structure
QUESTION -------- Q_ID char(24) QUES_ID char(24) T
YPE_ID char(2) QUESTION varchar(255) SEQ_NUM numbe
r
24
Structure with Localization
25
Selecting the Locale