Jerry%20Held - PowerPoint PPT Presentation

About This Presentation
Title:

Jerry%20Held

Description:

7-bits per character. SQL*Loader stripped the 8th bit off each ... 4E9C bit pattern is. 0100 1110 1001 1100. right-most 6 bits go to third UTF8 byte 01 1100 ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 47
Provided by: analys4
Learn more at: http://www.nocoug.org
Category:
Tags: 20held | jerry

less

Transcript and Presenter's Notes

Title: Jerry%20Held


1
NLS and The Case of the Missing Kanji Brian
HitchcockOCP DBA 8, 8i, 9i Global Sales IT Sun
Microsystems brian.hitchcock_at_sun.com brhora_at_aol.co
m
NLS -- National Language Support Kanji --
Japanese characters
NoCOUG
2
How It All Started
  • Existing Sybase database and application
  • Needed to convert to Oracle
  • Use Oracle Migration Workbench
  • OMWB works well
  • I wasnt told there was multi-byte data in the
    Sybase database
  • After the migration to Oracle
  • Kanji data were missing

3
Kanji Become Lost
  • How was Kanji stored in Sybase?
  • How was application working with Sybase?
  • Why lost when migrated to Oracle?
  • How to fix in Oracle?
  • Would application work with Oracle?

4
Before Upgrade to Oracle
0xB0A1
0xB0A1
Source system inserts bytes of Kanji characters
into db
0xB0A1
Browser examines each byte, detects multi-byte
characters, displays Kanji character
EUC-JP character code for this character is 0xB0A1
Select Japanese (Auto-Detect) character set in
Netscape to view Kanji characters
5
Moving Sybase Data to Oracle
0xB0A1
0x3021
Oracle Db
Character set WE8ISO8859P1
0x3021
0xB0A1
Flat file produced using Sybase bcp utility
0x3021
0xB0A1
0!
Browser displays the characters for byte codes
0x30 and 0x21 which are 0 and !
EUC-JP character code for this character is 0xB0A1
6
Moving Sybase Data to Oracle
  • What happened?
  • SQLLoader defaults to US7ASCII character set
  • 7-bits per character
  • SQLLoader stripped the 8th bit off each byte
  • 8th bit set to 0
  • 8-bit characters are now 7-bit characters
  • Original character data is lost
  • 8-bit characters cant be represented in the
    US7ASCII character set

7
US7ASCII to WE8ISO8859P1
B0A1 hexadecimal B 0 A 1 1011 0000
1010 0001 strip off 8th (highest order) bit, set
this bit to 0 0011 0000 0010 0001 3 0
2 1 3021 0!
Hex Decimal Binary A 10 1010 B
11 1011 C 12
1100 D 13 1101 E 14
1110 F 15 1111
Browser reads each byte, sees the 8th bit set to
0, decides that each byte represents a single
byte character, character codes 30 and 21
represent the characters 0 and !
EUC-JP character code for this character is 0xB0A1
8
Fix -- Sybase Data to Oracle
Note WE8ISO8859P1 character set does not support
Kanji characters
0xB0A1
0xB0A1
Oracle Db
Character set WE8ISO8859P1
0xB0A1
0xB0A1
Flat file produced using Sybase bcp utility
0xB0A1
0xB0A1
Browser detects multi-byte characters, displays
Kanji character
EUC-JP character code for this character is 0xB0A1
9
Current Oracle Production
Note WE8ISO8859P1 character set does not support
Kanji characters
0xB0A1
0xB0A1
Application Oracle Db character set WE8ISO8859P1
Source system inserts bytes of Kanji characters
into db
0xB0A1
Browser examines each byte, detects multi-byte
characters, displays Kanji character
EUC-JP character code for this character is 0xB0A1
Select Japanese (Auto-Detect) character set in
Netscape to view Kanji characters
10
Existing Application
  • How does it store/retrieve Kanji?
  • Multi-byte Kanji characters
  • Stored in WE8ISO8859P1 (single-byte) db
  • Application
  • JDBC retrieves bytes from WE Oracle db
  • Java generates HTML, sent to client browser
  • Netscape, view HTML using Japanese
    (Auto-Detect) character set
  • Display Kanji

11
Existing Application
Application
ISO1
ISO1
8859-1
UCS2 -- 2 byte Unicode
Oracle Db
Japanese Auto-Detect
8859-1
Character set WE8ISO8859P1
Browser detects multi-byte characters, displays
Kanji character
12
Existing Application
  • Each piece of software makes some decision
    (default) about character set
  • You need to understand this process for your
    application

13
What Really Happened?
  • Source Kanji data
  • From EUC-JP character set
  • Multi-byte
  • Kanji multi-byte stored in Sybase db
  • Default character set ISO-1
  • 8-bit, single-byte
  • Kanji multi-byte stored in Oracle db
  • Character set WE8ISO8859P1
  • 8-bit, single-byte

14
Convert to UTF8
  • Why?
  • Eliminate all the issues shown so far
  • Store multiple languages correctly
  • Correctly encoded
  • Support clients inserting data in languages other
    than Japanese Kanji
  • Existing application can only support languages
    based on Latin characters and Kanji

15
Conversion is Simple -- Isnt It?
  • Export WE8ISO8859P1 database
  • Set export client NLS_LANG
  • AMERICAN_AMERICA.WE8ISO8859P1
  • Import into UTF8 database
  • Set import client NLS_LANG
  • AMERICAN_AMERICA.WE8ISO8859P1
  • Test application
  • Application works!
  • Is everything OK?

16
Meanwhile, Back at the Ranch
  • While application testing is going on
  • Insert sample bytes for Kanji into WE db
  • Use Oracle SQL CHR() function
  • Export from WE db, import into UTF8 db
  • Examine same bytes in UTF8 db
  • Compare UTF8 bytes to manually generated UTF8
    bytes for the Kanji characters
  • NOT the same bytes!
  • What does this mean?

17
UTF8 Encoding Process
JA16EUC to UTF8 Conversion JA16EUC encoding is
0xB0A1, Unicode code point for this character is
4E9C (no formula for this, Oracle uses a lookup
table) Number of bytes used in UTF8 encoding
based on Unicode code point Unicode UTF8 bytes
0x0000 - 0x007f 0xxxxxxx 0x0080 - 0x07ff
110xxxxx 10xxxxxx 0x0800 - 0xffff 1110xxxx
10xxxxxx 10xxxxxx 4E9C requires 3 bytes in UTF8
4E9C bit pattern is 0100 1110 1001 1100
right-most 6 bits go to third UTF8 byte 01 1100
next 6 bits go to second UTF8 byte 1110 10
remaining 4 bits go in first UTF8 byte 0100
11100100 10111010 10011100 E 4 B A 9 C UTF8
character code is 0xE4BA9C
  • Think your life is boring?

Metalink Doc ID Note69518.1 Determining the
codepoint for UTF8 characters
18
Bytes is Bytes
ISO-2022-JP character code for this character
is 0x3021
Unicode byte code for this character is 0x4E9C
Unicode lookup table
Formula
Formula
Formula
Formula
JIS row/cell values for this character are Row
16, Column 1
Shift-JIS byte code for this character is 0x889F
UTF8 byte code for this character is 0xE4BA9C
EUC-JP byte code for this character is 0xB0A1
Kanji character encodings shown for the various
character sets
19
Conversion Issue
Correct UTF8 byte code for this character is
0xE4BA9C
0xC2B0, 0xC2A1
Import loads export file character set
WE8ISO8859P1
0xB0A1
0xB0A1
EUC-JP character code for this character is 0xB0A1
0xB0A1
Oracle export hard codes the source db character
set into the export file
0xC2B0, 0xC2A1
20
Import to UTF8 Conversion
Existing WE8ISO8859P1 data to UTF8
Conversion EUC-JP encoding for the character is
0xB0A1, but import detects that the data came
from a single-byte export file (WE8ISO8859P1) Imp
ort reads each byte, one at a time, 0xB0A1
becomes 0xB0 followed by 0xA1, and converts
these to the Unicode (UCS2) equivalent -- for
single-byte character codes, the Unicode
equivalent simply has two leading bytes of 0's --
0xB0 and 0xA1 become U00B0 and U00A1, import
then converts from UCS2 to UTF8 Number of bytes
used in UTF8 encoding based on Unicode code
point Unicode UTF8
bytes 0x0000 - 0x007f 0xxxxxxx 0x0080 - 0x07ff
110xxxxx 10xxxxxx 0x0800 - 0xffff 1110xxxx
10xxxxxx 10xxxxxx 00B0 and 00A1 both require 2
bytes in UTF8 00B0 bit pattern is
00A1 bit pattern is 0000 0000 1011 0000
0000 0000 1010 0001 right-most 6 bits go to
second UTF8 byte next 5 bits go in first UTF8
byte 1100 0010 1011 0000 1100 0010 1010
0001 C 2 B 0 C 2 A
1 WE8ISO8859P1 character code 0xB0A1 becomes
UTF character codes 0xC2B0, 0xC2A1 The correct
conversion of this EUC-JP character code 0xB0A1
to UTF8 is 0xE4BA9C
21
What Happened?
  • Oracle did exactly what is was told to do
  • Take bytes from WE database
  • Convert to UTF8 bytes
  • Export file was made from WE database
  • WE is single-byte character set
  • Convert each byte one at a time to UTF8
  • Kanji character consists of 2 bytes in WE db
  • Converting each byte to UTF8 not the same as
    converting the pair of bytes to UTF8
  • Yeah, but, application works! (?)

22
Wheres the Problem?
  • UTF8 db has Kanji as 0xC2B0, 0xC2A1
  • Correct UTF8 encoding is 0xE4BA9C
  • If new, correctly encoded Kanji is inserted
  • Database contains two sets of bytes for same
    Kanji character
  • How does app deal with this?
  • Existing app only works using Netscape Japanese
    (Auto-Detect) character set
  • App is not really UTF8, only works for Japanese
    characters

23
How Does Application Work?
  • Review
  • Oracle db created using UTF8 character set
  • Java retrieves char data (bytes) from UTF8 db
  • Converts to UCS2 (Unicode)
  • Java code generates HTML
  • Client browser displays Kanji characters
  • Netscape, Japanese (Auto-Detect) char set
  • Application still works
  • bytes in UTF8 db dont represent UTF8 encoded
    Kanji

24
Application Works (?)
Application
0xB0, 0xA1
0xB0, 0xA1
0xB0, 0xA1
ISO1
UCS2
ISO1
UTF8
0xB0, 0xA1
0xC2B0, 0xC2A1
Oracle Db
Export WE database, import into UTF8 db
8859-1
Japanese Auto-Detect
Character set UTF8
Browser detects multi-byte characters, displays
Kanji character
EUC-JP character code for this character is 0xB0A1
25
Test Application
  • Insert bytes for correctly encoded Kanji
  • Into UTF8 db
  • Use CHR() function
  • Display this data using existing application
  • Does NOT display Kanji!
  • Using Japanese (Auto-Detect) character set
  • Try Netscape UTF8 character set
  • Doesnt display Kanji
  • UTF8 character set should work, shouldnt it?

26
Where Are We?
  • Correctly encoded UTF8 multi-byte character data
    for Kanji does not work with existing application
  • Simply converting (export WE, import to UTF8)
    doesnt result in correctly encoded UTF8
    character data
  • Need to figure out what app code is doing
  • Whoever wrote it is gone
  • The usual state of affairs

27
How To Debug App Code?
  • Dont use app code
  • write very simple Java Servlet
  • (The Java Diva helps with this)
  • Servlet simply retrieves character data from db
  • Runs in iPlanet web server
  • generates HTML for client browser
  • Use servlet to retrieve correct UTF8 Kanji
  • Does not display Kanji!
  • Fix servlet then can fix application code?

28
Modified Servlet Code
res.setContentType("text/htmlcharsetUTF-8")
PrintWriter out new PrintWriter( new
OutputStreamWriter(res.getOutputStream(), "UTF-8")
,true) out.println("ltMETA HTTP-EQUIV" DQ
"Content-Type" DQ " CONTENT" DQ
"text/html charsetutf-8" DQ "gt")
29
Fix Application
  • Make same changes to application code
  • Browser displays Kanji correctly
  • Manually generated, correctly encoded UTF8
  • Application interacts with Dynamo
  • Need to reconfigure Dynamo for UTF8 data
  • Application fixed (?)
  • Works with correctly encoded UTF8 multi-byte data

30
Is Application really fixed?
  • Fixed app retrieves correctly encoded UTF8
    character data
  • What about existing character data?
  • Data that was exported from WE and imported into
    UTF8 db
  • Use fixed app code to retrieve existing data
  • Existing Kanji are not displayed
  • Original app did display existing data...
  • Existing data is not correctly encoded UTF8

31
Fixed Application
Application
0xB0, 0xA1
0xC2B0, 0xC2A1
UTF8
UCS2
UTF8
UTF8
0xC2B0, 0xC2A1
0xC2B0, 0xC2A1
Oracle Db
UTF8t
Browser displays characters for the UTF8 bytes
0xC2B0, 0xC2A1 which are degree sign and
upside down exclamation point
8859-1
EUC-JP character code for this character is 0xB0A1
Character set UTF8

32
How To Fix Existing Data?
  • Whats wrong with existing data (UTF8 db)
  • Character data is not correctly encoded UTF8
  • It is UTF8 encoded Unicode of each single byte
    that was exported from WE database
  • Before importing into UTF8 database?
  • EUC-JP character set (Latin ASCII and Kanji)
  • Stored in single-byte WE database
  • Need to convert UTF8 of WE of EUC-JP to correct
    UTF8 bytes for Kanji

33
Review of Bytes is Bytes
  • Original Kanji character 0xB0A1 (EUC-JP)
  • Inserted into Oracle database
  • 0xB0, 0xA1 in WE8ISO8859P1 db
  • Exported/imported into Oracle UTF8 db
  • Individual bytes converted to UTF8
  • Original Kanji character was 2 bytes
  • Became 4 bytes in UTF8 db
  • 0xC2B0, 0xC2A1
  • Correct UTF8 bytes are 0xE4BA9C

34
How to Convert Existing Data?
  • Fix in Oracle WE before export/import
  • No point, export/import will corrupt character
    data, will need to fix after export/import
  • Dont export/import
  • SQL select each table to flat files from WE db
  • SQLLoader into UTF8 database
  • Use CHARACTERSET JA16EUC option
  • More work moving each table one at a time

35
SQLLoader Option
0xE4BA9C
Oracle Db
Character set UTF8
0xB0A1
0xB0A1
0xB0A1
SQL Select to flat file for each table
0xB0A1
EUC-JP character code for this character is 0xB0A1
Select UTF8 character set to view Kanji
characters
0xE4BA9C
36
Convert Existing Data
  • Fix data after import into UTF8 database
  • Export from WE, import into UTF8 database
  • Use Oracle SQL CONVERT() function
  • CONVERT() from UTF8 to WE8ISO8859P1
  • CONVERT() from JA16EUC to UTF8
  • Need to CONVERT() each column of each table that
    contains multi-byte data
  • How to be sure which columns to CONVERT()?
  • CONVERT() all columns that contain char data?
  • Must test using CONVERT() to verify it works

37
Fix After Import
0xB0A1
0xC2B0, 0xC2A1
0xB0, 0xA1
Bytes of originalEUC-JP Kanji character
0xE4BA9C
0xB0A1
0xB0, 0xA1
Correctly encoded UTF8 bytes for this Kanji are
0xE4BA9C
EUC-JP character code for this character is 0xB0A1
38
Oracle CONVERT()
  • Syntax, examples
  • select CONVERT(ltcolumngt, ltdestination char setgt,
    ltsource char setgt)
  • select CONVERT(ltcolumngt, WE8ISO8859P1, UTF8)
  • select CONVERT(ltcolumngt, UTF8, JA16EUC)
  • Dont re-run CONVERT() without testing
  • re-run may corrupt data
  • regenerate original source data, re-run CONVERT()

39
Overall Conversion Process
  • What we did
  • Identify tables/columns contain multi-byte data
  • Export from WE database
  • Import into UTF8 database
  • rowsn, create tables, dont load data
  • Widen columns for UTF8 multi-byte data
  • increase to 3 times original width
  • Import into UTF8 database (again)
  • ignorey, load data into existing tables

40
Overall Conversion Process
  • Continued
  • CONVERT() columns that contain multi-byte data
  • Test, compare with data from existing
    application/data
  • Conversion includes converting all pieces of the
    application, not just the Oracle database

41
Details - Source Char Set?
  • How did I determine this?
  • Original Kanji data was from EUC-JP
  • How was this determined?
  • Examine bytes of original character data
  • Display Original Kanji characters
  • Find single Kanji in Japanese dictionary
  • Gives row-cell code of Kanji in JIS-0208
  • Using other reference sources
  • manually generate bytes for the Kanji in various
    encodings
  • Compare with bytes of original Kanji data

42
Rosetta Stone?
Oracle8i National Language Support Guide
Release 2 (8.1.6) December 1999 Part No.
A76966-01 page 3-22
43
Reference Books Used
The New Nelson Japanese-English Character
Dictionary By John H. Haig,Andrew N. Nelson
Published by Periplus Editions, Ltd Date
Published 11/1996 ISBN 0804820368 The
Unicode Standard With CD-ROM By Unicode
Consortium Published by Addison Wesley Longman,
Inc. Date Published 04/1995 ISBN 0201483459
CJKV Information Processing By Ken Lunde,Gigi
Estabrook (Editor) Published by O'Reilly
Associates, Incorporated Date Published 01/1999
ISBN 1565922247
44
Lessons Learned
  • Oracle (and Sybase) dont store characters
  • They store bytes, strings of bytes
  • Normally, Oracle does NO checking of character
    set
  • does NOT check that bytes inserted represent
    correct characters in database character set
  • Only under specific circumstances does Oracle
    apply a character set to char data
  • Changing character set affects more than just the
    database

45
Lessons Learned
  • Bytes of character from any char set can be
    stored in db of any charset
  • EUC-JP char in WE db, in UTF8 db
  • bytes in db are not correct bytes for the
    character in the db character set
  • all apps, users, dbs must know that db contains
    char data from other char set
  • Any char set conversion may corrupt the char data
    -- import WE into UTF8 db

46
Lessons Learned
  • Simply exporting db, importing into UTF8 does not
    solve the problems
  • Testing requires generating correctly encoded
    character data
  • Every piece of an application makes a decision
    about character set (default)
  • If all data in db really is in the db char set
  • export, import to db of other char set works
  • Need to see original character data
  • Verify data after char set conversion
Write a Comment
User Comments (0)
About PowerShow.com