Title: Creating MultiLingual and MultiLocale Databases
1Creating Multi-Lingual and Multi-Locale Databases
- International Unicode Conference 19
- Presented by Addison Phillips
- Globalization Architect
- webMethods, Inc.
2Introduction
- Audience Beginning Developers
- Presenter Addison Phillips Globalization
Architect webMethods, Inc. mailtoaphillips_at_webm
ethods.com - Presentation http//www.inter-locale.com
- Creating complex systems in a global environment
requires more than internationalized code. Since
most Enterprise system rely on relational
databases, a global-ready system must also
consider database design in order to be truly
effective.
3Our Problem
- This presentation is based on the lessons learned
in developing and deploying a B2B conversation
management system (webMethods for Trading
Networks) and partnership management software
(webMethods PartnerConnect). - The products we created share a central database
that allow webMethods customers to manage their
B2B trading partnerships. - Terminology
- A trading partner is a company that you want to
do business with. - An initiative is a specific opportunity to work
together.
4Trading Networks
- Companies need to store information about
transactions and business relationships world
wide and in real time. - We call this Global Business Visibility
5Partner Connect Goals
- Centrally served (one instance).
- Centrally managed (initiatives can be deployed
anywhere). - Localized (so partners can interact with
initiatives in their own language). - Cultural and market sensitivity (customized to
fit different market conditions locally). - Created and managed by the customer entirely
through HTML interface.
6Profile and Conversation Management
7Enter the Database
- Serve both global and local initiatives from a
single instance. - Store data in multiple writing systems (scripts,
languages). - Provide for actual differences in the data due to
user location (locale). - Provide for localization of global content.
- Provide for local content management.
8Basic Rules for a Global DB Schema
- Expand fields to support changes in character
encoding. - Expand fields to support differences in the
storage requirements of other locales (cultural
or linguistic expansion, as opposed to encoding) - Classify data as locale-neutral,
locale-intrinsic, or locale-related and
re-normalize the tables accordingly. - Create efficient access to both global and
locale-specific information.
9Selecting an Encoding
- If a database instance will only serve a single
locale (or compatible locales), then the
character encoding can be selected based on local
requirements (legacy encoding). - If the database must store data from many
locales (or incompatible writing systems), then
the character encoding selected must be a Unicode
encoding. - Each database vendor has a unique approach to
this. - Encodings vary in terms of performance and
capability. - Generally the two choices you have are
- UTF-8
- UTF-16 (formerly known as UCS-2)
10Character Encodings and DDL
- Each database vendor provides their own encoding
support. - Most support legacy encodings and their
variants. - Need a Unicode encoding to support multiple
languages (globally) - Each vendor handles Unicode encodings differently.
- CREATE TABLE Address (
- cust_id number,
- attn varchar(50),
- department varchar(50),
- street1 varchar(50),
- street2 varchar(50),
- city varchar(50),
- state char(2),
- zip varchar(5),
- country varchar(18))
11Example Cloudscape
- Cloudscape is a pure Java database.
- Uses java.lang.String objects to store char and
varchar data, so all string data is stored as
UCS-2. - (1) java.lang.Character equals (1) unit in DDL
- CREATE TABLE Address (
- cust_id number,
- attn varchar(50),
- department varchar(50),
- street1 varchar(50),
- street2 varchar(50),
- city varchar(50),
- state char(2),
- zip varchar(5),
- country varchar(18))
12Example Oracle
- Oracle provides several native Unicode encodings.
The most commonly used one is called UTF8. - Characters in UTF-8 range from one to four bytes
- Char and varchar2 types are defined in bytes, not
characters. - So a varchar2(30) can reliably store 10 Unicode
2.1.8 characters (and as many as 30). - Note that a varchar2(60) is required to store
surrogate pairs.
- CREATE TABLE Address (
- cust_id number,
- attn varchar2(150),
- department varchar2(150),
- street1 varchar2(150),
- street2 varchar2(150),
- city varchar2(150),
- state char(6),
- zip varchar2(15),
- country varchar2(18))
13Oracle Example
- Create a table and insert values.
- Notice that multibyte values take more room to
store.
14Example MS SQL Server 2000
- SQL Server 2000 provides support for the UTF-16
encoding of Unicode via the nchar and nvarchar
datatypes. - Char and varchar2 must use a legacy encoding,
with sizes defined in bytes (so a varchar(30) can
store as many as 30 and as few as 15 characters
in Shift-JIS CP932). - Nchar and nvarchar are defined in characters, so
an nvarchar(30) can store 30 characters. - Note that an nvarchar(30) can only store 15
characters beyond UFFFF.
- CREATE TABLE Address (
- cust_id integer,
- attn nvarchar(50),
- department nvarchar(50),
- street1 nvarchar(50),
- street2 nvarchar(50),
- city nvarchar(50),
- state nchar(2),
- zip nvarchar (5),
- country nvarchar(18))
15SQL Server Example
- Create a table.
- Insert data using multibyte characters.
- Insert data using single-byte characters.
16Oracle Unicode Encoding Variations
- AL24UTFFSS. The original Unicode encoding
supported by Oracle. It is not compatible with
modern Unicode and should be avoided. - UTF8. A multibyte encoding used by most versions
of Oracle. This version encodes Unicode Scalar
Values larger than UFFFF as the UTF-8 sequence
for a pair of surrogate characters. - This results in binary sorting sequences
compatible with UTF-16 representations. - This violates the Unicode shortest form
requirement (note that this is invisible to my
Java application). - (All JDBC drivers adjust the connection to use
this encoding automatically.) - AL32UTF8. A UTF-8 encoding provided in Oracle 9i
that correctly encodes Unicode Scalar Values
larger than UFFFE using the shortest form. Note
that the sorting sequence is different. - nchar/nvharchar support for UTF-16 in Oracle 9i.
17MS SQL Server 2000 Issues
- Code Page 65001 This is Microsofts code page
for UTF-8. - It can not be used as a char/varchar/text
encoding, even in the most recent versions of MS
SQL Server. - See http//support.microsoft.com/support/kb/articl
es/Q232/5/80.ASP for more information. - You can use a different encoding (by setting the
collation) for each data column, but this is not
a very convenient way to work in a global
environment. - JDBC connections to SQL Server use the JDBC-ODBC
driver. This driver cannot tell the difference
between n-types and regular types, and thus
cannot retrieve Unicode string values. - Note that this also applies to several middleware
products, notably Merant. - Note that this also applies to use of variant
text types in other databases (such as Oracle 9i).
18Some Other Databases
- Sybase
- ASE supports UTF-8.
- Sybase 11 supports UTF-8 via an add-on.
- ASE is adding support for UTF-16 via a new data
type. - IBM DB/2
- Supports UTF-16 (as CCSID 13844).
- Supports UTF-8 as a database encoding.
- MySQL doesnt support Unicode.
- The Open Source folks need to get to work ?
19Modifying Size Constraints
- UTF-8 has a maximum number of bytes-per-character
of 4. - Vast majority of characters use 3 or less.
- Older systems (JDK, for example) cannot access
the 4-byte characters. - Determine size requirements
- Specific constraint
- --or--
- Arbitrary constraint.
- Specific Size Limit
- Check length using code (database fields have
variable restrictions). - For UTF-8 Multiply by 3 (or 4) bytes to get
field length. - Example varchar2(10) becomes varchar2(30).
- Arbitrary Size Limit
- Multiply the desired maximum by 3 bytes to get
approximate size. - Adjust according to database and performance
requirements. - Example varchar2(100) becomes varchar2(255).
Was able to store 100 characters, now a minimum
maximum of 85.
20Cultural Data Expansion
- Data also changes size (and sometimes type)
because of culture or locale. - Examples
- Social Security Number is 13 digits in France
- Postal code not all numeric outside USA and may
be quite long. - Different address units than State
- Spanish users often have two or three middle
names. - Avoid arbitrarily small char and varchar field
lengths. Most databases optimize storage of
variable length fields. - But avoid performance killing sizes.
- Oracle block size limitations.
- Oracle JDBC character conversion latching
maximum (2000 bytes in 8.0.5).
21Cultural Data Expansion
- State (char2) becomes province (up to 85
characters). - ZIP code (varchar9) becomes postalcode (up to 50
characters and probably much more). - Address fields expand from 50 bytes to 255 bytes
(or about 85 characters). - Dont assume that the same fields will always
represent the exact same data values.
- CREATE TABLE Address (
- address_id char(24),
- cust_id char(24),
- contact_id char(24),
- country_id char(2),
- department varchar2(255),
- street1 varchar2(255),
- street2 varchar2(255),
- city varchar2(255),
- province varchar2(255),
- postalcode varchar2(150))
22Whats Left?
- So far weve
- Expanded storage to deal with character
encodings. - Expanded storage to deal with cultural and
linguistic data expansion. - We still need to
- Allow for localization of textual elements.
- Allow for relational changes due to cultural or
linguistic requirements.
23Basic Questionnaire Table Structure
QUESTION -------- Q_ID char(24) QUES_ID char(24) T
YPE_ID char(2) QUESTION varchar(255) SEQ_NUM numbe
r
24Structure with Localization
25Selecting the Locale
- How Java does it
- ltbaseclassgtltspecific languagegt ltspecific
countrygtltspecific variantgt - ltbaseclassgtltspecific languagegt ltspecific
countrygt - ltbaseclassgtltspecific languagegt
- ltbaseclassgtltdefault languagegt ltdefault
countrygtltdefault variantgt - ltbaseclassgtltdefault languagegt ltdefault countrygt
- ltbaseclassgtltdefault languagegt
- ltbaseclassgt
- How can we replicate this in SQL?
26One Method
- SELECT FROM Questionnaire WHERE InitiativeID
? - SELECT FROM Question WHERE Q_ID ?
- (while more questions)
- (do)
- SELECT FROM QuestionLocale WHERE Ques_ID ?
AND LCID? - (until you find a record)
- (wend)
QUESTIONLOCALE -------------- Q_ID QUES_ID LANG_ID
TERRITORY_ID QUESTION
- Inefficient.
- Difficult to manage.
27Our Solution
- Concept of installed locale
- Create associated installed locale records at
the hub or questionnaire level. - Perform locale negotiation once.
- No additional searches required.
- Lets look
28(No Transcript)
29(No Transcript)
30Appearance of Questions
31Data and Locale
- Some data is locale neutral.
- Formatted at display time to match users locale.
- Values dont vary by locale.
- Note It may be in a language.
- Some data is locale related.
- Data locale implied by context.
- Formatting/Validation is supplied by context.
- Locale can be inherited or cascaded.
- Some data is locale intrinsic.
- Business Logic (format/validation) changes due to
datas locale. - Locale must be tagged.
- Implies a separate table.
32Simplify with Locale Related
33QUESTIONS AND ANSWERS
- Presentation Available at http//www.inter-locale.
com - mailtoaphillips_at_webmethods.com