Experience Unicode Enabling MySQL - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Experience Unicode Enabling MySQL

Description:

Software Internationalization Services & Technology. Experience Unicode ... Removal of Rosette to use glibc features and/or ICU. Measure and improve performance ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 37
Provided by: tomem9
Category:

less

Transcript and Presenter's Notes

Title: Experience Unicode Enabling MySQL


1
Experience Unicode Enabling MySQL
Thomas Emerson Senior Software Engineer Basis
Technology Corp.
2
Overview
  • Introduction

3
Overview
  • Introduction
  • What is mySQL?

4
Overview
  • Introduction
  • What is mySQL?
  • Character Set Architecture in mySQL

5
Overview
  • Introduction
  • What is mySQL?
  • Character Set Architecture in mySQL
  • Phased Implementation

6
Overview
  • Introduction
  • What is mySQL?
  • Character Set Architecture in mySQL
  • Phased Implementation
  • Summary

7
Overview
  • Introduction
  • What is mySQL?
  • Character Set Architecture in mySQL
  • Phased Implementation
  • Summary
  • Q A

8
Preliminaries
  • Cell Phones? Just say vibrate.

9
Preliminaries
  • Cell Phones? Just say vibrate.
  • If you need to take a call, please get up and
    leave.

10
Preliminaries
  • Cell Phones? Just say vibrate.
  • If you need to take a call, please get up and
    leave.
  • If you fall asleep, you will be rediculed.

11
Assumptions
  • Unicode and Unicode Terminology

12
Assumptions
  • Unicode and Unicode Terminology
  • Basic RDBMS concepts

13
Introduction
  • Who am I and why am I here?

14
Introduction
  • Who am I and why am I here?
  • Large amounts of linguistic and lexicographic
    data
  • Simplified and Traditional Chinese
  • Japanese
  • Korean
  • Thai
  • Western and Eastern European Languages

15
Introduction
  • Who am I and why am I here?
  • Large amounts of linguistic and lexicographic
    data
  • Accessability
  • Across Platforms
  • Web-based Interface

16
Introduction
  • Who am I and why am I here?
  • Large amounts of linguistic and lexicographic
    data
  • Accessability
  • Low Impact
  • Could not take cycles (hard-, soft-, or wet-)
    from our Oracle 8i system and its DBA.
  • Didnt have big iron available
  • No budget

17
What is mySQL?
  • GPLd buzz-word compliant SQL engine
  • High Performance
  • Robust
  • Popular

18
What is mySQL?
  • GPLd buzz-word compliant SQL engine
  • Supports Industry Standards
  • Entry-level SQL92
  • ODBC Level 0-2

19
What is mySQL?
  • GPLd buzz-word compliant SQL engine
  • Supports Industry Standards
  • Extensions
  • Advanced (though complex) authentication system

20
What is mySQL?
  • GPLd buzz-word compliant SQL engine
  • Supports Industry Standards
  • Extensions
  • Advanced (though complex) authentication system
  • Extra datatypes, including ENUM and SET

21
What is mySQL?
  • GPLd buzz-word compliant SQL engine
  • Supports Industry Standards
  • Extensions
  • Excellent Support for Legacy Encodings
  • Big Five, GB 2312, and GBK
  • EUC-JP and ShiftJIS
  • TIS 620
  • ISO-Latin-1
  • KOI-8R

22
What is mySQL?
  • GPLd buzz-word compliant SQL engine
  • Supports Industry Standards
  • Extensions
  • Excellent Support for Legacy Encodings
  • C and C APIs, and bindings for Python, Perl,
    PHP, and others.

23
I18N Architecture in mySQL
  • Server can be built to support multiple encodings
  • Databases can only contain a single character set
  • Support for single- and double-byte character
    sets.

24
Phased Implementation
  • UTF-8 in and out
  • UTF-8 as a multibyte encoding
  • UCS-2 as the internal encoding

25
Phase I
  • No Unicode-specific features.
  • Unicode support is piggy-backed as ISO-Latin-1.
  • This is surprisingly effective, but

26
Phase I
  • No Unicode-specific features.
  • Unicode support is piggy-backed as ISO-Latin-1.
  • This is surprisingly effective, but
  • Wild card searches are awkward (since each
    character is composed of up to three Latin 1
    characters)
  • No regular expression support
  • No collation support

27
Phase I (cont.)
  • The Font End problem was solved with PHP
    (www.php.org)
  • An HTML front end using UTF-8 as the document
    charset
  • PHP not Unicode aware, but it just doesnt matter!

28
Phase II
  • Treat UTF-8 as a multibyte character set

29
Phase II
  • Treat UTF-8 as a multibyte character set
  • Simple collation model

30
Phase II
  • Treat UTF-8 as a multibyte character set
  • Simple collation model
  • Still no regular expression support

31
Phrase II (cont.)
  • Rosette is used as the Unicode layer
  • No longer limited to a single character set
  • But now we need to differentiate between language
    and script!

32
Phrase III
  • Use UCS-2 as the internal character
    representation.
  • Transcoding to legacy encodings as needed, so
    existing databases will continue to work.
  • Each column can have a different legacy encoding

33
Phase III (Cont)
  • Data can be imported, transcoded and filtered
    using Rosettes full transform functionality.
  • Hankaku/Zenkaku transformation
  • Case Conversion
  • SGML Entity Folding
  • Ad nauseum

34
Status
  • Phase I is complete and live.
  • Phase II is underway, as time allows. UTF-8
    support in place. Collation still going.
  • Phase III is planned, but not yet started.

35
Status (cont.)
  • Removal of Rosette to use glibc features and/or
    ICU
  • Measure and improve performance
  • All will be released to the MySQL code

36
QA
  • Tom Emersontree_at_basistech.com
  • Slides and other information available
    athttp//cymru.basistech.com/iuc17
Write a Comment
User Comments (0)
About PowerShow.com