Creating MultiLingual and MultiLocale Databases - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Creating MultiLingual and MultiLocale Databases

Description:

Creating complex systems in a global environment requires more than internationalized code. ... Create associated 'installed locale' records at the hub or ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 34
Provided by: timwoodsco
Category:

less

Transcript and Presenter's Notes

Title: Creating MultiLingual and MultiLocale Databases


1
Creating Multi-Lingual and Multi-Locale Databases
  • International Unicode Conference 19
  • Presented by Addison Phillips
  • Globalization Architect
  • webMethods, Inc.

2
Introduction
  • Audience Beginning Developers
  • Presenter Addison Phillips Globalization
    Architect webMethods, Inc. mailtoaphillips_at_webm
    ethods.com
  • Presentation http//www.inter-locale.com
  • Creating complex systems in a global environment
    requires more than internationalized code. Since
    most Enterprise system rely on relational
    databases, a global-ready system must also
    consider database design in order to be truly
    effective.

3
Our Problem
  • This presentation is based on the lessons learned
    in developing and deploying a B2B conversation
    management system (webMethods for Trading
    Networks) and partnership management software
    (webMethods PartnerConnect).
  • The products we created share a central database
    that allow webMethods customers to manage their
    B2B trading partnerships.
  • Terminology
  • A trading partner is a company that you want to
    do business with.
  • An initiative is a specific opportunity to work
    together.

4
Trading Networks
  • Companies need to store information about
    transactions and business relationships world
    wide and in real time.
  • We call this Global Business Visibility

5
Partner Connect Goals
  • Centrally served (one instance).
  • Centrally managed (initiatives can be deployed
    anywhere).
  • Localized (so partners can interact with
    initiatives in their own language).
  • Cultural and market sensitivity (customized to
    fit different market conditions locally).
  • Created and managed by the customer entirely
    through HTML interface.

6
Profile and Conversation Management
7
Enter the Database
  • Serve both global and local initiatives from a
    single instance.
  • Store data in multiple writing systems (scripts,
    languages).
  • Provide for actual differences in the data due to
    user location (locale).
  • Provide for localization of global content.
  • Provide for local content management.

8
Basic Rules for a Global DB Schema
  • Expand fields to support changes in character
    encoding.
  • Expand fields to support differences in the
    storage requirements of other locales (cultural
    or linguistic expansion, as opposed to encoding)
  • Classify data as locale-neutral,
    locale-intrinsic, or locale-related and
    re-normalize the tables accordingly.
  • Create efficient access to both global and
    locale-specific information.

9
Selecting an Encoding
  • If a database instance will only serve a single
    locale (or compatible locales), then the
    character encoding can be selected based on local
    requirements (legacy encoding).
  • If the database must store data from many
    locales (or incompatible writing systems), then
    the character encoding selected must be a Unicode
    encoding.
  • Each database vendor has a unique approach to
    this.
  • Encodings vary in terms of performance and
    capability.
  • Generally the two choices you have are
  • UTF-8
  • UTF-16 (formerly known as UCS-2)

10
Character Encodings and DDL
  • Each database vendor provides their own encoding
    support.
  • Most support legacy encodings and their
    variants.
  • Need a Unicode encoding to support multiple
    languages (globally)
  • Each vendor handles Unicode encodings differently.
  • CREATE TABLE Address (
  • cust_id number,
  • attn varchar(50),
  • department varchar(50),
  • street1 varchar(50),
  • street2 varchar(50),
  • city varchar(50),
  • state char(2),
  • zip varchar(5),
  • country varchar(18))

11
Example Cloudscape
  • Cloudscape is a pure Java database.
  • Uses java.lang.String objects to store char and
    varchar data, so all string data is stored as
    UCS-2.
  • (1) java.lang.Character equals (1) unit in DDL
  • CREATE TABLE Address (
  • cust_id number,
  • attn varchar(50),
  • department varchar(50),
  • street1 varchar(50),
  • street2 varchar(50),
  • city varchar(50),
  • state char(2),
  • zip varchar(5),
  • country varchar(18))

12
Example Oracle
  • Oracle provides several native Unicode encodings.
    The most commonly used one is called UTF8.
  • Characters in UTF-8 range from one to four bytes
  • Char and varchar2 types are defined in bytes, not
    characters.
  • So a varchar2(30) can reliably store 10 Unicode
    2.1.8 characters (and as many as 30).
  • Note that a varchar2(60) is required to store
    surrogate pairs.
  • CREATE TABLE Address (
  • cust_id number,
  • attn varchar2(150),
  • department varchar2(150),
  • street1 varchar2(150),
  • street2 varchar2(150),
  • city varchar2(150),
  • state char(6),
  • zip varchar2(15),
  • country varchar2(18))

13
Oracle Example
  • Create a table and insert values.
  • Notice that multibyte values take more room to
    store.

14
Example MS SQL Server 2000
  • SQL Server 2000 provides support for the UTF-16
    encoding of Unicode via the nchar and nvarchar
    datatypes.
  • Char and varchar2 must use a legacy encoding,
    with sizes defined in bytes (so a varchar(30) can
    store as many as 30 and as few as 15 characters
    in Shift-JIS CP932).
  • Nchar and nvarchar are defined in characters, so
    an nvarchar(30) can store 30 characters.
  • Note that an nvarchar(30) can only store 15
    characters beyond UFFFF.
  • CREATE TABLE Address (
  • cust_id integer,
  • attn nvarchar(50),
  • department nvarchar(50),
  • street1 nvarchar(50),
  • street2 nvarchar(50),
  • city nvarchar(50),
  • state nchar(2),
  • zip nvarchar (5),
  • country nvarchar(18))

15
SQL Server Example
  • Create a table.
  • Insert data using multibyte characters.
  • Insert data using single-byte characters.

16
Oracle Unicode Encoding Variations
  • AL24UTFFSS. The original Unicode encoding
    supported by Oracle. It is not compatible with
    modern Unicode and should be avoided.
  • UTF8. A multibyte encoding used by most versions
    of Oracle. This version encodes Unicode Scalar
    Values larger than UFFFF as the UTF-8 sequence
    for a pair of surrogate characters.
  • This results in binary sorting sequences
    compatible with UTF-16 representations.
  • This violates the Unicode shortest form
    requirement (note that this is invisible to my
    Java application).
  • (All JDBC drivers adjust the connection to use
    this encoding automatically.)
  • AL32UTF8. A UTF-8 encoding provided in Oracle 9i
    that correctly encodes Unicode Scalar Values
    larger than UFFFE using the shortest form. Note
    that the sorting sequence is different.
  • nchar/nvharchar support for UTF-16 in Oracle 9i.

17
MS SQL Server 2000 Issues
  • Code Page 65001 This is Microsofts code page
    for UTF-8.
  • It can not be used as a char/varchar/text
    encoding, even in the most recent versions of MS
    SQL Server.
  • See http//support.microsoft.com/support/kb/articl
    es/Q232/5/80.ASP for more information.
  • You can use a different encoding (by setting the
    collation) for each data column, but this is not
    a very convenient way to work in a global
    environment.
  • JDBC connections to SQL Server use the JDBC-ODBC
    driver. This driver cannot tell the difference
    between n-types and regular types, and thus
    cannot retrieve Unicode string values.
  • Note that this also applies to several middleware
    products, notably Merant.
  • Note that this also applies to use of variant
    text types in other databases (such as Oracle 9i).

18
Some Other Databases
  • Sybase
  • ASE supports UTF-8.
  • Sybase 11 supports UTF-8 via an add-on.
  • ASE is adding support for UTF-16 via a new data
    type.
  • IBM DB/2
  • Supports UTF-16 (as CCSID 13844).
  • Supports UTF-8 as a database encoding.
  • MySQL doesnt support Unicode.
  • The Open Source folks need to get to work ?

19
Modifying Size Constraints
  • UTF-8 has a maximum number of bytes-per-character
    of 4.
  • Vast majority of characters use 3 or less.
  • Older systems (JDK, for example) cannot access
    the 4-byte characters.
  • Determine size requirements
  • Specific constraint
  • --or--
  • Arbitrary constraint.
  • Specific Size Limit
  • Check length using code (database fields have
    variable restrictions).
  • For UTF-8 Multiply by 3 (or 4) bytes to get
    field length.
  • Example varchar2(10) becomes varchar2(30).
  • Arbitrary Size Limit
  • Multiply the desired maximum by 3 bytes to get
    approximate size.
  • Adjust according to database and performance
    requirements.
  • Example varchar2(100) becomes varchar2(255).
    Was able to store 100 characters, now a minimum
    maximum of 85.

20
Cultural Data Expansion
  • Data also changes size (and sometimes type)
    because of culture or locale.
  • Examples
  • Social Security Number is 13 digits in France
  • Postal code not all numeric outside USA and may
    be quite long.
  • Different address units than State
  • Spanish users often have two or three middle
    names.
  • Avoid arbitrarily small char and varchar field
    lengths. Most databases optimize storage of
    variable length fields.
  • But avoid performance killing sizes.
  • Oracle block size limitations.
  • Oracle JDBC character conversion latching
    maximum (2000 bytes in 8.0.5).

21
Cultural Data Expansion
  • State (char2) becomes province (up to 85
    characters).
  • ZIP code (varchar9) becomes postalcode (up to 50
    characters and probably much more).
  • Address fields expand from 50 bytes to 255 bytes
    (or about 85 characters).
  • Dont assume that the same fields will always
    represent the exact same data values.
  • CREATE TABLE Address (
  • address_id char(24),
  • cust_id char(24),
  • contact_id char(24),
  • country_id char(2),
  • department varchar2(255),
  • street1 varchar2(255),
  • street2 varchar2(255),
  • city varchar2(255),
  • province varchar2(255),
  • postalcode varchar2(150))

22
Whats Left?
  • So far weve
  • Expanded storage to deal with character
    encodings.
  • Expanded storage to deal with cultural and
    linguistic data expansion.
  • We still need to
  • Allow for localization of textual elements.
  • Allow for relational changes due to cultural or
    linguistic requirements.

23
Basic Questionnaire Table Structure
QUESTION -------- Q_ID char(24) QUES_ID char(24) T
YPE_ID char(2) QUESTION varchar(255) SEQ_NUM numbe
r
24
Structure with Localization
25
Selecting the Locale
  • How Java does it
  • ltbaseclassgtltspecific languagegt ltspecific
    countrygtltspecific variantgt
  • ltbaseclassgtltspecific languagegt ltspecific
    countrygt
  • ltbaseclassgtltspecific languagegt
  • ltbaseclassgtltdefault languagegt ltdefault
    countrygtltdefault variantgt
  • ltbaseclassgtltdefault languagegt ltdefault countrygt
  • ltbaseclassgtltdefault languagegt
  • ltbaseclassgt
  • How can we replicate this in SQL?

26
One Method
  • SELECT FROM Questionnaire WHERE InitiativeID
    ?
  • SELECT FROM Question WHERE Q_ID ?
  • (while more questions)
  • (do)
  • SELECT FROM QuestionLocale WHERE Ques_ID ?
    AND LCID?
  • (until you find a record)
  • (wend)

QUESTIONLOCALE -------------- Q_ID QUES_ID LANG_ID
TERRITORY_ID QUESTION
  • Inefficient.
  • Difficult to manage.

27
Our Solution
  • Concept of installed locale
  • Create associated installed locale records at
    the hub or questionnaire level.
  • Perform locale negotiation once.
  • No additional searches required.
  • Lets look

28
(No Transcript)
29
(No Transcript)
30
Appearance of Questions
31
Data and Locale
  • Some data is locale neutral.
  • Formatted at display time to match users locale.
  • Values dont vary by locale.
  • Note It may be in a language.
  • Some data is locale related.
  • Data locale implied by context.
  • Formatting/Validation is supplied by context.
  • Locale can be inherited or cascaded.
  • Some data is locale intrinsic.
  • Business Logic (format/validation) changes due to
    datas locale.
  • Locale must be tagged.
  • Implies a separate table.

32
Simplify with Locale Related
33
QUESTIONS AND ANSWERS
  • Presentation Available at http//www.inter-locale.
    com
  • mailtoaphillips_at_webmethods.com
Write a Comment
User Comments (0)
About PowerShow.com