EACC to Unicode Migration - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

EACC to Unicode Migration

Description:

EACC to Unicode Migration K.T. Lam, HKUST Library. 14 ... Libraries and CJK experts are providing advice and suggesting solutions ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 57

Provided by: ktl2

Category:

more less

Transcript and Presenter's Notes

Title: EACC to Unicode Migration

1
EACC to Unicode Migration
OCLC CJK Users Group 2006 Annual Meeting April
8 2006, San Francisco

Ki Tat LAM
Head of Library Systems
The Hong Kong University of Science and
Technology Library
lblkt_at_ust.hk

2
Contents

Migrating systems from EACC to Unicode
environments
Why migrating?
What has been done?
HKIUG Unicode Initiatives
Issues
EACC/Unicode mapping table
Round-trip cross-walk
Improving searching with TSVCC Linking
Font display

3
An Observation
4
????
System for determining the beginning, length and
divisions of a year
5
(No Transcript)
6
Why Migrating?

EACC (East Asian Character Code, ANSI
Z39.64-1989) was introduced into the CJK library
community by RLG in the early 1980s (known as
REACC at that time)
Its was an important milestone for the first
time, we began to have a C-J-K unified standard
with a relatively large character set (about
16,000) for use in bibliographic records

7
Why Migrating? cont.

By adopting EACC as an alternate character set in
MARC 21 (at that time it was called USMARC),
libraries with East Asian collections were able
to share and use CJK cataloging records via the
OCLC and RLIN cataloging platforms
However, great effort is required for integrated
library systems (ILS) to make use of the
EACC-based CJK data in the records

8
Why Migrating? cont.

To communicate in EACC is extremely difficult
because EACC failed to be supported in the
mainstream IT environment
Hardly you can find EACC supported by operating
systems, fonts, input methods, editors, etc.,
both in the old days and today
It will also be unlikely to see EACC supported in
web browsers in the current Internet era
Why? EACCs three-byte coding structure is
alien to the binary computing world

9
Why Migrating? cont.

Due to its unpopularity, EACC became a frozen
standard and there is no way to fix errors and
add characters
If EACC is stored natively in the bibliographic
database, then in order to input and display CJK
characters at the application layers (such as
OPAC and record editor), ILS will have to rely on
lossy mapping tables to map EACC to other
character encodings (e.g. BIG5, GB, JIS, KSC and
UTF-8)

10
Why Migrating? cont.

Unicode comes to the rescue
Single standard for written texts of almost all
languages in the world
Has more than 96,000 characters, most of them are
CJK
An active standard, with constant updates
Widely adopted and supported in the current IT
environment major operating systems and web
browsers, plus many devices and applications,
speak the Unicode language

11
Why Migrating? cont.

With more than 25 years influence by EACC, it is
unlikely that all library systems and data can be
migrated overnight to the Unicode mainstream
It is anticipated that there will be a period of
parallel operation, with co-existing EACC and
Unicode bibliographic data interchanging among
systems, resulting in confusion and data loss
Even if systems have migrated to Unicode, there
are still problems that require attention

12
What has been done?

MARC 21 specifications for MARC-8 and UCS/Unicode
environment
LCs code tables for mapping between MARC-8 and
Unicode
OCLC WorldCat migration to Unicode platform
OCLC Connexions Unicode support
LCs Voyager upgrade
INNOPAC/Millennium
HKIUG Unicode Initiatives

13
MARC 21 Specifications

In 2000, the Library of Congress issued
Specifications to distinguish the encoding of
MARC 21 records in the original (MARC-8)
environment and in the new UCS/Unicode
environmenthttp//www.loc.gov/marc/specification
s/speccharintro.html
MARC-8 means characters are encoded in one 8-bit
byte (e.g. ASCII) and three 8-bit bytes (e.g.
EACC)

14
A MARC 21 bibliographic record in ISO2709 format
viewed in Notepad, showing CJK characters encoded
in EACC in MARC-8 environment
15
MARC 21 Specifications cont.

UCS/Unicode Environmenthttp//www.loc.gov/marc/s
pecifications/speccharucs.html
Use UTF-8 as character encoding
Leader position 9 contains value a
Field 066 (Character Sets Present) is not needed
The script identification information in subfield
6 (Linkage) can be dropped
Lengths specified by number of 8-bit bytes,
rather than number of characters.

16
MARC 21 Specifications cont.

Unicode combining rule for diacritics, i.e.
combining marks follow rather than precede the
character they modify

17
(No Transcript)
18
MARC 21 Specifications cont.

LC issued code tables for mapping between MARC-8
and UCS/Unicode
Not only for EACC, but also for other Latin and
non-Latin scripts such as ANSEL, Hebrew,
Cyrillic, Arabic and Greek
Provide essential information for ILSs Unicode
implementation

19
(No Transcript)
20
(No Transcript)
21
MARC 21 Specifications cont.

UNICODE-MARC Discussion Listhttp//listserv.loc.
gov/listarch/unicode-marc.html
Since July 2005
Active discussion on issues concerning Unicode
implementation in MARC 21
Some of the discussion was summarized as MARC
Proposal 2006-04, "Technique for conversion of
Unicode to MARC-8, and was approved by MARBI in
January 2006, with changes.http//www.loc.gov/ma
rc/marbi/2006/2006-04.html

22
OCLC WorldCat and Connexion

WorldCat migrated to Oracle with Unicode
support
Released Connexion client software
Unicode-based, running on Windows
Comprehensive CJK support
Rely on Windows IME for input of CJK characters
Export and import of records in both MARC-8 and
UCS/Unicode environments.

23
LCs Catalog

Its Voyager system was upgraded recently to
provide Unicode support
Capable of displaying and searching CJK data in
880 fields
Allows export of records in MARC-8 and Unicode
environments
Issued a cataloging policy position paper for the
Unicode implementation at LC (March 2006), with
details on current implementation and future
opportunitieshttp//www.loc.gov/catdir/cpso/unic
ode.pdf

24
INNOPAC/Millennium

INNOPAC has been supporting EACC, and CJK in
general, since its implementation at HKUST
Library 15 years ago
Millennium clients run on Windows XP with Unicode
support
CJK records are stored in EACC internally but
provides option to migrate the storage to Unicode
HKIUG Unicode Task Force is working with the
vendor to improve the Unicode storage

25
HKIUG Unicode Initiatives

HKIUG Hong Kong Innovative Users Group
Founded in 1996
Members from all 15 INNOPAC libraries in Hong
Kong and Macau, including the eight Hong Kong
government-funded universities
HKIUG Unicode Initiatives since 2003, to work
closely with the ILS vendor (Innovative
Interfaces Inc.) to improve INNOPAC /
Millenniums CJK support

26
HKIUG Unicode Initiatives cont.

Achievements
Developed HKIUG Version of the EACC to Unicode
mapping table
Resolved EACC to Unicode multi-mapping problem
Developed TSVCC (Traditional, Simplified, Variant
Chinese Characters) linking tables
HKIUG Unicode Task Force - to maintain the
Unicode and TSVCC tables and to assist the vendor
on Unicode migration members from CUHK, CITYU,
HKUST and HKU

27
Migration Issues

The need of EACC/Unicode mapping table
Multi-mapping and round trip failure problems
TSVCC linking
Font display problem

28
HKIUG EACC/Unicode Table

First released in September 2003 last revised in
August 2005
Contains
15672 EACC characters
7043 pure CCCII characters
Mapping for EACC characters - follows LC as much
as possible
Contains 7043 Pure CCCII that have no EACC
equivalent - includes them to avoid too many
missing characters

29
(No Transcript)
30
(No Transcript)
31
HKIUG EACC/Unicode Table cont.

Identified
160 multi-mapping linked cases, e.g.
49 multi-mapping unlinked cases, e.g.
Causing failure in round-trip crosswalk

32
(No Transcript)
33
U5386
34
Export
35
Export output is 27 46 2A incorrect!
36
TSVCC Linking

When searching ?? Li fa, you will prefer to
retrieve records that have
??
??
where ? and ? have Traditional Simplified
relationship
Similarly, when searching ?, you will prefer to
retrieve its Variant ?
Requires linking T,S,V forms during searching

37
(No Transcript)
38
(No Transcript)
39
????? Excuse me, are they typos! Shouldnt it be
?????
40
Google is capable linking ? and ?
41
TSVCC Linking cont.

HKIUG Unicode Task Force constructed two versions
of TSVCC Linking tables
EACC Version released November 2004
Unicode Version draft created March 2006
for ILSs that store characters in EACC and in
Unicode respectively

42
TSVCC Linking cont.

EACC Version
Table M (80 entries) linking relationship is not
purely from EACC, e.g.
214349 ? 274349 ? 2D4349 ? 21462A ?
27462A ? 4B462A ? U5386 multi-mapped
27462A,274349
Table V (3065 entries) linking relationship is
purely from EACC, e.g.
21306C ? 2D306C ? 33306C ? 4B306C ?

43
(No Transcript)
44
(No Transcript)
45
TSVCC Linking cont.

Unicode Version
Still in draft construction
So far has 3061 entries, e.g.
U5C5B ? U5C4F ? U6452 ? EACC link
(27/21415A) AND Variant form of U5C4F is
U5C5B
U965D ? U965C ? U9655 ? EACC link
(23/294A44) AND Simplified form of U965D is
U9655 is

46
(No Transcript)
47
(No Transcript)
48
TSVCC Linking cont.

Plan to include linking of New/Old forms in the
TSVCC Unicode Version, e.g.

49
TSVCC Linking cont.

Results of implementing TSVCC Linking
Improvement in searching higher recall
Trade-off lower precision
If search results are sorted/displayed in TSVCC
normalized form, misleading and inaccurate
display may occur - such as the OCLC Connexion
browse list display problem mentioned previously

50
Font Issues

Do not believe in What you see is what you have,
because What you see varies with fonts !
For example, the following glyphs have different
code points in EACC

51
Font Issues

But in Unicode, they are assigned the same code
points. Depending on the font in use, you will
see different glyphs

52
Conclusion

How far are we?
Both LC and OCLC have done enormous work in
enabling and promoting the use of Unicode in MARC
records
ILS vendors are working very hard to implement
and enhance the Unicode support
Libraries and CJK experts are providing advice
and suggesting solutions

53
Conclusion cont.

We have reviewed various migration issues
The need for an accurate EACC/Unicode mapping
table
Extending to non-EACC characters
Multi-mappings and round-trip failure
TSVCC Linking
Font display issues

54
Conclusion cont.

The failure of round-trip crosswalk between
systems will continue to be a problem until
everyone interchanges MARC records purely in
Unicode. This will only happen when the majority
of systems store and use data natively in
Unicode.
Unlike EACC, Unicode does not have a build-in
linking relationship. Implementing TSVCC is
essential for improving searching.

55
Additional References

Assessment of Options for Handling Full Unicode
Character Encodings in MARC 21 -- Part 1 New
Scripts ( January 2004) and Part 2 Issues (June
2005).http//www.loc.gov/marc/marbi/list-report.
html
Joan M. Aliprand. The structure and content of
MARC 21 records in the Unicode environment.
Information technology and libraries, v.24, no.4,
December 2005, p.170-179.
Wong, Philip and K.T. Lam. HKIUGs Unicode
projects untangling the chaotic codes. HKIUG
Annual Meeting 2005. http//hdl.handle.net/1783.1
/2429

56
Thank You!

Write a Comment

User Comments (0)