Title: Unicode in Distributed Systems
1Unicode in Distributed Systems
Michael G. McKenna mgm_at_globalisation.org Globalisa
tion Strategist Haddon Hill International
2Distributed Systems
a
d
End user device
Data Access
e
Data
b
End User interface standards
Application
Unicode can be implemented in any of the
functional areas
3Distributed Systems Status Quo
- Heterogeneous
- Large Investments
- Mixed Proprietary and International Standards
- Often Under Parochial Control
- Work Group to Global Organizational Size
4The Enterprise in the Real World
Java
Internet Clients
SYBASE
ApplServer
DB2
Application Development
Data Collection
Oracle
Distributed Enterprise Information
Multiple SQL Database Access
Embedded Training
Flat Files
Web Servers
Legacy Non- Relational Data
Real-Time
Mainframe Data
Distributed Systems
Plug Play Users
CORBA
J2EE
5 The Enterprise in the Real World
Java
Internet Clients
SYBASE
ApplServer
DB2
Application Development
Data Collection
Oracle
Distributed Enterprise Information
Multiple SQL Database Access
Embedded Training
Flat Files
Web Servers
Legacy Non- Relational Data
Real-Time
Mainframe Data
LAN-Based Systems
Plug Play Users
TCP/IP
CORBA
J2EE
6 Enterprise Client/Server Requirements
Multilingual
Java
WWW Clients
SYBASE
Developer End-User Productivity
ApplServer
DB2
Application Development
Oracle
Data Entry
Transparency
Multiple SQL Database Access
Distributed Enterprise Information
Embedded Training
Flat Files
Remote Backup
Legacy Non- Relational Data
Real-Time
Localisable
SeamlessInteroperability
Mainframe Data
System Management
Auditing
LAN Based Systems
Plug Play Users
System Configuration
Performance Monitoring
TCP/IP
CORBA
Security
J2EE
7Any Component Can Affect Globalization
Network
Database Server
Client Application
Database Design
Server API
Client API
Non-RDBMS Data
OS API
GUI API
8Globalisation Spans all Areas
Data
O/S
Network API
Server Comm API
Application
9Rating Distributed Systems
- A system for rating levels of Internationalisation
- 3 Global Ready / Local Cultural Authenticity
- 2 Global Ready
- 1 Single-Locale Ready (Europe or Asia)
- 0 Locale-Specific Early Adopter
- -1 8-bit Clean
- -2 7-bit Dirty
- -3 Dont Care
10Level (-3) Dont Care
- Ethnocentric attitude of organization
- Lack of understanding
- No desire
- I18N thought of as another feature
- Fear, uncertainty and doubt
11Level (-2) 7-bit dirty
- 7-bit ASCII support
- U.S. only
- ASCII sort only
- U.S. platforms/environments only
- U.S.-specific UI
- U.S. keyboards, terminals, printers
12Level (-1) 8-bit clean
- 8-bit data integrity (the 8th bit is not
stripped) - Support for 8-bit object names
- 16-bit data integrity for pass through
13Level (0) Minimum I18N
- 8-bit and multibyte codeset support
- 8-bit and multibyte lexical support
- 8-bit and multibyte object names
- European sort orders
- Localizable
- European and Asian platform/HW support
- Documentation on I18N
- Multibyte input and display
- Application development in target language
- European and Asian keyboards, terminals, printers
14Level (1) Minimum Heterogeneous I18N
- Unicode support
- Can add European sort orders
- Distributed locale management
- All messages localizable
- Language-sensitive string operations
- Locale-sensitive cultural string formatting
- Transparent Connectivity
- Imperial calendars
- Codeset conversion
- Localizable user interface
- European multilingual application development
15Level (2) Global Ready
- Can add new character set conversions in the
field - Bi-directional support
- Robust codeset conversion
- Support world-wide multilingual application
development - Multiscript heterogeneous distributed processing
16Level (3) Cultural Authenticity
- Full Unicode support
- Keisen tables, radar charts in Japan
- Non-Gregorian calendars
- Composite characters
- Vertical input and display
17Evolution of Client/Server/Intranet
Enterprise
Internet
Departmental
10-100 Users 100s to 1000s of Users Centralized
Server(s) Distributed Servers Mainframe
Extracts Mainframe Integration Single
Function Corporate-Wide Stand Alone Integrated Sim
ple Administration Complex Management Single
Vendor Many Vendors 10s of Gigabytes Gigabytes to
Terabytes
Any User Any Machine Anywhere
Systems Applications Databases
World-wide HTTP/HTML Remote Mgmt
18Evolution of Client/Server/Intranet
Enterprise
Intranet
Departmental
10-100 Users 100s to 1000s of Users Centralized
Server(s) Distributed Servers Mainframe
Extracts Mainframe Integration Single
Function Corporate-Wide Stand Alone Integrated Sim
ple Administration Complex Management Single
Vendor Many Vendors 10s of Gigabytes Gigabytes to
Terabytes
Any User Any Machine Anywhere
Systems Applications Databases
World-wide HTTP/HTML Remote Mgmt
19 Evolution of Client/Server/Intranet
Enterprise
Intranet
Departmental
10-100 Users 100s to 1000s of Users Centralized
Server(s) Distributed Servers Mainframe
Extracts Mainframe Integration Single
Function Corporate-Wide Stand Alone Integrated Sim
ple Administration Complex Management Single
Vendor Many Vendors 10s of Gigabytes Gigabytes to
Terabytes
Any User Any Machine Anywhere
Systems Applications Databases
World-wide HTTP/HTML Remote Mgmt
20Legacy Systems
- Communication through Gateways
- Proprietary Character Sets
- Many Asian Implementations
- Lots of Data Lot of Mone
Gateway
Gateway
Bridge
21Three-Tier I18N System Normalisation
LANGUAGE
VIEW
DATA
22Phased Approach for Distributed Unicode
- Phase I - encapsulated Unicode
- used internally, conversion filters to operating
system environment and external distributed APIs - Phase II - Unicode on the wire
- Unicode for transmission to distributed
applications - Requires application control on both sides of the
wire - Phase III - Unicode end-to-end
- Unicode enabled user-I/O with appropriate
software - Competitive advantage for multiplatform
portability - Finally - Unicode everywhere
- Operating environments and standards catch up.
Change the conversion filters and the distributed
applications continue to work
23Phase I - Encapsulated Unicode
- Unicode Enabled application inside a conversion
envelope.
24Phase II - Unicode on the Wire
non-Unicode App
- Conversion filters to operating environment and
distributed non-Unicode APIs
25Phase III - Unicode End-to-End
non-Unicode App
- If needed, use proprietary software to enable
Unicode technology for user interfaces.
26Final Phase - Unicode Everywhere
- Distributed Environment vendors and standards
bodies support Unicode - Unicode used everywhere for communication, data
representation, and user interfacing
27Legacy System Integration
PCs
3270s
Local Data Servers
AS/400
AS/400
- Rightsizing a Large Legacy System
- 3270 terminals in 16 countries connected to
AS/400 MIS system - Integration with new Client/Server
- Microsoft Windows clients (1st tier)
- Sun Sparc Solaris Unix servers (2nd tier)
- IBM AS/400 backend (3rd tier)
28Example B2B Technologyto enable Global eCommerce
- All data in Unicode
- UTF-8 in XML
- All resource and message files stored in Unicode
for portability - Support for Unicode internally
- Java and XML
29Convertibility
CS0
- Mapping Tables
- National standards
- International standards
- Vendor standards
- Always a mapping
- Unicode base standards
- Replacement characters
CS0
30CORBA and Code Set Conversion
- Use Unicode for Inter-ORB global communications
- OMG Common Object Services (COS)
- Inter-ORB Bridge Support
- General Inter-ORB Protocol (GIOP)
- Internet Inter-ORB Protocol (IIOP)
- Code Set Negotiation use CONV_Frame IDL
Diagram from The Common Object Request Broker
Architecture and Specification, Rev. 2.2,
Chapter 11 ORB Interoperability Architecture
31CORBA IOP/IOR
- IOP Inter-ORB Protocol
- IOR Interoperable Object Reference (like URL,
with attributes)
Diagram from The Common Object Request Broker
Architecture and Specification, Rev. 2.2,
Chapter 11 ORB Interoperability Architecture
32Transmission Code Set
- Character Set The characters, independent of
encoding - Code Set The encoded values of a Character Set
- OSF Character and Code Set Registry
- ftp.opengroup.org/pub/code_set_registry
Diagram from The Common Object Request Broker
Architecture and Specification, Rev. 2.2,
Chapter 11 ORB Interoperability Architecture
33Character Set Conversion
- User definable
- Table-driven
- API for user-defined routines
- Robust
- Negotiated conversion policy with Server
- CMR - Client Makes Right
- SMR - Server Makes Right
- UNR - Universal Network Representation
- Unicode based conversions
34Character Set Conversion
- Configurable error results depending on
data-integrity needs - Exact Match
- Best Guess
- Error plus replacement character
- Multiple character sets supportable with ICU as a
Conversion Envelope
35Character Set and Sort Order Definitions
- International and commercial standards supplied
by ICU - User definable for others
- 8-bit
- Multi-byte
- Unicode reference set
- Utilities for creating character sets
- Sort order issues
- Multilingual sorting
- Multiple sort orders and indexing
- Default vs expected sorting
36Unicode SQL Database
- Virtually every written business language
supported - Allows world-wide solutions
37Unicode in Databases
38IETF
- RFC 2277 - All Protocols
- IETF Policy on Character Sets and Languages
- The Internet is international
- Must identify charset
- Must support UTF-8
- Must identify language
- Multilanguage support required
- RFC 2130 - new Protocols and Formats
- Unicode default (UTF-8)
39MIME
- MIME charset Parameters
- Used for Character encoding identification
- HTTP
- HTML
- XML
- CSS
40HTTP encoding negotiation
Client sends Accept-Charset HTTP header
Accept-Charset UTF-8, ISO-8859-1q0.9,q0.1
Server know encoding and sends charset
parameter in HTTP
header
Content-Type text/html charsetUTF-8
HTML clues
document header
ltmeta http-equivContent-Type
contenttext/html charsetUTF-8gt
links
lta
href
charsetUTF-8gt lt/agt
42
Unicode in Dist Sys
(c) 2002 M. McKenna
IUC22
41Determining Internet Encodings
- In priority order
- 1. User override
- 2. HTTP header or protocol information
- 3. Self-identification
- ltmetagt for HTML
- encoding for XML
- _at_charset for CSS
- 4. charset parameters on links
- 5. User preferences/heuristics
42LDAP version 3.0
LDAP strings are UTF-8
Directory entries can be in any language
RFC 2251 to RFC 2256
43XML and Java
- XML - Portable Data
- Java - Portable Code
- XML tag structures map to Java Classes
- Default encoding for XML is Unicode
- Encoding for Java Strings is Unicode
44XML
- Conforming parsers must support
- UTF-16
- UTF-8
- UTF-8 is the default encoding
- lt?xml version1.0" encodingUTF-8" ?gt
- Character entities are Unicode values
- dddd
- xUUUU
- CSS _at_charset UTF-8
45Java
java.lang.String - Unicode
inputStreamReader
converts
sourceCharset
to Unicode
outputStreamWriter
converts Unicode to
targetCharset
Different list of
charsets
supported per Vendor
Java 1.1
vs
Java 2 and Unicode
Java 2 Swing set has better display support
http//www.
javasoft
.com
search on internationalization
46GUI in Java
- Portable consistent interface
- Use Java 1.2
- Use Java Foundation Classes (e.g. JTextArea)
- Use Java Locale class
- Link to O/S through JNI
- Java runs in native Unicode
47Java and I18n
java.
io
Character set conversion
InputStreamReader
,
OutputStreamWriter
java.
util
Locale
Date, Calendar
ResourceBundle
java.text
String handling, formatting
Collation
48Java Methods for JDBC
- Connection
- Data Binding
- Formatting Output
- Date and Time
- Collation
- Translated Messages
49Connection
What language?
System default?
Locale
defLocale
Locale.
getDefault
()
set properties.put("LANGUAGE",
(
defLocale
.
getDisplayLanguage
(Locale.US)).
toLowerCase
())
User choice?
Server choices list
us_
english
select name from master..
syslanguages
Server Default?
Java Application/Applet can be different
What character set?
sp_server_info server_
csname
set properties.put(CHARSET, server_
csname
)
50Data Binding
Static locale
User-driven
System default
Menu pull-down
Dynamic locale
Data-driven
per-column
per-row
generated by business rules
Format using
java.text
51Formatting Output
Numeric
java.text
DecimalFormat
NumberFormat
ChoiceFormat
Date and Time
java.text
DateFormat
,
SimpleDateFormat
java.
util
Calendar,
GregorianCalendar
Date
TimeZone
.
SimpleTimeZone
52Example www.3m.com
Flag
Banner
Meta Data
Content
53Generic Datetime for input to remote systems
- Use YYYYMMDD format (ISO 8601 format)
- insert 20021104 into table1(date_col)
- / 4 November 2002 /
- avoids language and format confusions
54E-Marketplace Technology
XML Facilitates eCommerce.
55Example Message (DTD)
- lt?xml version"1.0" encoding"UTF-8" ?gt
- lt!DOCTYPE Book
- lt!ELEMENT Book (BookDesc) gt
- lt!ELEMENT BookDesc (Title, Author, Publisher,
ISBN, Price, CoverImage, Desc) gt - lt!ATTLIST Book xmllang CDATA REQUIRED gt
- lt!ELEMENT Title (PCDATA) gt
- lt!ELEMENT Author (PCDATA) gt
- lt!ELEMENT Publisher (PCDATA) gt
- lt!ELEMENT ISBN (PCDATA) gt
- lt!ELEMENT Price (Currency, Amount) gt
- lt!ELEMENT Currency (PCDATA) gt
- lt!ELEMENT Amount (PCDATA) gt
- lt!ENTITY CoverImage EMPTY gt
- lt!ELEMENT Desc (PCDATA) gt
- lt!ATTLIST CoverImagemage type (bmpgifjpgother)
"gif"gt - lt!NOTATION gif SYSTEM ("gwswin/gws.exe"gt
- lt!NOTATION bmp SYSTEM ("gwswin/gws.exe"gt
- lt!NOTATION jpg SYSTEM ("gwswin/gws.exe"gt
- lt!NOTATION other SYSTEM ("gwswin/gws.exe"gt
56Example Message (XML)
- ltBookgt
- ltBookDesc xmllangENgt
- ltTitlegtJava in a Nutshelllt/Titlegt
- ltAuthorgtDavid Flanaganlt/Authorgt
- ltPublishergtO'Reilly Associates
lt/Publishergt - ltISBNgt156592262Xlt/ISBNgt
- ltPricegt
- ltCurrencygtUSDlt/Currencygt
- ltAmountgt24.95lt/Amountgt
- lt/Pricegt
- ltCoverImagegtjnut_us.giflt/CoverImagegt
- ltDescgtThe bestselling Java in a Nutshell
has been updated to cover Java 1.1. If you're a
Java programmer who is migrating to 1.1, this
second ... lt/Descgt - lt/BookDescgt
57Example Message (XML)
- ltBookDesc xmllangDEgt
- ltTitlegtJava in a Nutshelllt/Titlegt
- ltAuthorgtDavid Flanaganlt/Authorgt
- ltPublishergtOReilly/VVA lt/Publishergt
- ltISBNgt3897211009lt/ISBNgt
- ltPricegt
- ltCurrencygtEURlt/Currencygt
- ltAmountgt25.95lt/Amountgt
- lt/Pricegt
- ltCoverImagegtjnut_de.giflt/CoverImagegt
- ltDescgtDieses Handbuch ist eine unentbehrliche
Kurzreferenz, die dazu gedacht ist,
aufgeschlagen neben der Tastatur jedes
Java-Programmierers zu liegen. Es enthält eine
... lt/Descgt - lt/BookDescgt
58Example Message (XML)
- ltBookDesc xmllangJPgt
- ltTitlegt lt/Titlegt
- ltAuthorgtDavid Flanaganlt/Authorgt
- ltPublishergt lt/Publishergt
- ltISBNgt4-900900-08-7lt/ISBNgt
- ltPricegt
- ltCurrencygtJPYlt/Currencygt
- ltAmountgt3900.00lt/Amountgt
- lt/Pricegt
- ltCoverImagegtjnut_jp.giflt/CoverImagegt
- ltDescgt
- ... lt/Descgt
- lt/BookDescgt
- lt/Bookgt
59Example Order (XML)
- ltOrdergt
- ltOrderNumgt20193786lt/OrderNumgt
- ltUserIdNumgtA47US37892lt/UserIdNumgt
- ltItemsOrderedgt
- ltItemgt
- ltProductIdgt 156592262X lt/ProductIdgt
- ltQtygt1lt/Qtygt
- lt/Itemgt
- ltItemgt
- ltProductIdgt 3897211009 lt/ProductIdgt
- ltQtygt12lt/Qtygt
- lt/Itemgt
- ltItemgt
- ltProductIdgt 4900900087 lt/ProductIdgt
- ltQtygt2lt/Qtygt
- lt/Itemgt
- lt/ItemsOrderedgt
- lt/Ordergt
60Business Rules
- Reports
- Taxes
- Currency
- Dual Currency Display
- Currency Conversion
- Payment and Settlement
- Import/Export
- Business Process
- Workflow
61Communication Protocols
- CORBA 3.0
- string, char supports UTF-8
- wstring, wchar supports UTF-16
- COM, DCOM
- Allows Unicode
- ActiveX
- Unicode interface
62Fonts
True Type
Bitstream Cyberbit
Monotype
BDF, Java
cobble together from many sources
Dynamic
Composed of multiple fonts
Bitstream Truedoc
www.
truedoc
.com
63Web Services
HTTP
64E-Marketplace Technology
Service Provider layers in their Services to
seamlessly add value to all trading partners.
65UDDI
- Describes
- What is it?
- Where is it?
- How do I get it?
66UDDI - I18n
- Need to track time zone usage
- Useful to have alternate names
- Specify normalized formats to use
67WSDL
Web Services Description Language
- Services are defined using six major elements
- types describe the messages exchanged.
- message abstract definition of the data being
transmitted - portType set of abstract operations refering to
an input message or output messages. - binding protocol and data format specifications
- port address for a binding - single
communication endpoint. - service aggregate a set of related ports
68WSDL - I18n
Web Services Description Language
- Pure XML
- Use xmllang and locale attributes
- Export to UDDI
- Service provider localizes to supported locales
69SOAP
Simplified Object Access Protocol
SOAP
SOAP
70SOAP
71SOAP - I18n
Simplified Object Access Protocol
Locale B
Locale A
SOAP
I18N Info
SOAP
I18N Info
72 73System Architecture
Text Handling
Character Handling
Application Software
Cultural Profiler
Message System
Collations
String Formatting
Portability Layer
Conversions
en fr de jp
Operating System
Resource Store
Resource Files
Resource Files
Resource Files
74Business GlobalizationTo Enable Global eCommerce
- Features
- Fully internationalized
- Currency, International Taxes
- Unicode support
- Business Services Framework
- Logistics
- International Payment
- Currency Exchange
- Import/Export Compliance
- Landed Cost
- Translation Services
- Regional Early Adopter Process
- Thailand
- Malaysia
- Israel
- Middle East
- Greece
- India
Tier-2 English-UK Danish Dutch Finnish Norwegian S
wedish Portuguese-PT Czech-SAP Hungarian-SAP Polis
h-SAP Russian-SAP
Tier-1 English German French Italian Japanese Port
uguese-BR Spanish-Intl Chinese-S Chinese-T Korean
APAC
EMEA
Americas
Includes enterprise, content auction
applications and global services
75Summary
- Unicode is a powerful portability and
interoperability solution for distributed
environments - Global, distributed computing (2 Level of I18N)
requires Unicode to be effective - Unicode can be acquired in a phased approach
- Unicode is now required to use new technologies
(RFC 2277) - XML, Java
76Global Vision
- Think Globally, Act Locally
- The trade relationships of the World makes for a
very small planet economically, but complex
culturally - The World Needs Unicode Today!