Title: Migration%20of%20a%204GL%20and%20Relational%20Database%20to%20Unicode
1Migration of a 4GL and Relational Database to
Unicode
Tex Texin
International Product Manager
2Presentation Goals
- Outline Migration Steps
- Describe Design Considerations
- Leverage Existing Double-byte Implementation
- Describe Impact on 4GL and Report Formats
3PROGRESS Application Development Suite
- Powerful tools for the rapid creation of
distributed business applications - Creates character, GUI, or web-based clients with
common source - Host-based, client-server, or n-tier distribution
on variety of platforms - Scalable, robust RDBMS and open
- International, double-byte enabled
4Possible Configuration Options
GUI Client Client-Server
Web-based Client
Database Server
Progress Database
Optional n-tier Application Server
Host-based Character Client
Other Database
5Why do our customers need Unicode?
- Many do not... However,
- Multinationals deploy across regions with
incompatible character sets, yet they must share
data between them. - Programs are distributed worldwide with one
container of text in many languages. - Certain applications require multilingual
databases. E.g. Translation systems and web-based
applications.
6The Existing Architecture
- 1.5M lines of C code
- 0.3M lines of 4GL code
- Double-byte enabled
- CJK, 9 double-byte charsets supported
- 2-byte only, no 3 or 4-byte
- No shift-sequenced charsets
- DBE changes earmarked, easy to find
- 4 years, 3 developers, 2 QA
7Estimated cost of implementing UCS-2, was very
big!
- Changing to 16-bit text units affects almost
every source module - Largest cost is separating char variables based
on usage for text or binary data. - Use 16-bit null terminators, ignore 8-bit
- A Þ 0041, 0000 Ã Þ 0100, 0000
- Pointer arithmetic (advance 2 bytes)
- Sizing (bytes or characters)
- New API to use new WIDE TEXT datatype
8Product requirements for a multilingual version
- Minimize cost for application migration
- Minimize cost for application upgrade
- Minimize support cost
- One executable!
- Maintain user-definable character sets
- Add UTF-8 as just another character set
- UTF-8 algorithms are compatible with other
charsets
9Scaled down multilingual proposal UTF-8
implementation
- Implement UTF-8 as 3-byte character set
- Leverage extend double-byte enabling
- Places to change are already earmarked
- Restrict to composed characters for now
- Restrict to no surrogates
- Supports all the markets we are in
- UTF-8-enable 4GL and RDBMS first
- Provides multilingual logic and storage
- Javaother client technologies coming
10Architecture changesUTF-8-enabling the string
library
- N-byte enable characterstring functions
- GetNextChar, GetPreviousChar
- GetCharacterSize (table-based)
- Modified IsFirstByte
- New GetColumnLength
- New datatype normalized BIG char
- Minor algorithm changes for efficiency
- Find Character
11Architecture changesUTF-8-enabling character
tables
- String libraries use character tables
- Alphanumeric, Lead-byte, Tail-byte
- Upper, lower case (700 characters)
- New property ColumnCount
- New table formats
- Old architecture presumed 256 byte table
- Now organized by range lists and trie
- Update table compiler allow hex entry
12Architecture changesUTF-8-enabling sorting
- How to sort multilingual data?
- Binary sort used for double-byte data
- With UTF-8, Europe is 2-byte, CJK 3-byte
- Solution
- Binary sort on server
- Client uses native sort
- Bump key length limit for UTF-8
- Next phase will be enhanced sort
13Architecture changesCharacter conversion
algorithms
- Existing, user-definable, conversions
- Single-byte character set table maps
- Double-byte Shift-JIS - EUCJIS algorithm
- New table-driven automated conversions
- Single-byte to UTF-8, and back
- Double-byte to UCS-2 and back
- UTF-8 - UCS-2
- Trie for speed and memory optimization
- Requires significant QA for data integrity
14Architecture changesImpact on the 4GL user
- 4GL is character set independent
- Almost all functions are character-based
- 3 functions require optional byte-basing
- Length, Substring, Overlay
- Options Byte, Character
- Add new option Column
- Format (Picture) Phrase
- XXXX has different meaning for UTF-8
15Status
- Functioning Well
- Going to second beta
- Implemented with very low cost
- Performance is OK
- Metrics not yet available
- Testing is most significant cost
- Reviewing all character set properties
- Evaluating all conversions
16Pièce de Résistance
17Futures
- For the Progress International Team
- Multilingual Clients
- Enhanced Character Folding
- Enhanced Sorting
- For Progress Customers
- Deployment of multilingual databases
- Worldwide access to these databases
- Worldwide deployment of multi-language
applications
18Conclusions
- Migration can be achieved in phases
- Migration thru UTF-8 can be low cost
- Double-byte applications can migrate easily to
UTF-8 - Asian users can integrate with other languages
now - Non-English users can integrate with Asian
languages now
19Any questions?