Title: EACC to Unicode Migration
1EACC to Unicode Migration
OCLC CJK Users Group 2006 Annual Meeting April
8 2006, San Francisco
- Ki Tat LAM
- Head of Library Systems
- The Hong Kong University of Science and
Technology Library - lblkt_at_ust.hk
2Contents
- Migrating systems from EACC to Unicode
environments - Why migrating?
- What has been done?
- HKIUG Unicode Initiatives
- Issues
- EACC/Unicode mapping table
- Round-trip cross-walk
- Improving searching with TSVCC Linking
- Font display
3An Observation
4????
System for determining the beginning, length and
divisions of a year
5(No Transcript)
6Why Migrating?
- EACC (East Asian Character Code, ANSI
Z39.64-1989) was introduced into the CJK library
community by RLG in the early 1980s (known as
REACC at that time) - Its was an important milestone for the first
time, we began to have a C-J-K unified standard
with a relatively large character set (about
16,000) for use in bibliographic records
7Why Migrating? cont.
- By adopting EACC as an alternate character set in
MARC 21 (at that time it was called USMARC),
libraries with East Asian collections were able
to share and use CJK cataloging records via the
OCLC and RLIN cataloging platforms - However, great effort is required for integrated
library systems (ILS) to make use of the
EACC-based CJK data in the records
8Why Migrating? cont.
- To communicate in EACC is extremely difficult
because EACC failed to be supported in the
mainstream IT environment - Hardly you can find EACC supported by operating
systems, fonts, input methods, editors, etc.,
both in the old days and today - It will also be unlikely to see EACC supported in
web browsers in the current Internet era - Why? EACCs three-byte coding structure is
alien to the binary computing world
9Why Migrating? cont.
- Due to its unpopularity, EACC became a frozen
standard and there is no way to fix errors and
add characters - If EACC is stored natively in the bibliographic
database, then in order to input and display CJK
characters at the application layers (such as
OPAC and record editor), ILS will have to rely on
lossy mapping tables to map EACC to other
character encodings (e.g. BIG5, GB, JIS, KSC and
UTF-8)
10Why Migrating? cont.
- Unicode comes to the rescue
- Single standard for written texts of almost all
languages in the world - Has more than 96,000 characters, most of them are
CJK - An active standard, with constant updates
- Widely adopted and supported in the current IT
environment major operating systems and web
browsers, plus many devices and applications,
speak the Unicode language
11Why Migrating? cont.
- With more than 25 years influence by EACC, it is
unlikely that all library systems and data can be
migrated overnight to the Unicode mainstream - It is anticipated that there will be a period of
parallel operation, with co-existing EACC and
Unicode bibliographic data interchanging among
systems, resulting in confusion and data loss - Even if systems have migrated to Unicode, there
are still problems that require attention
12What has been done?
- MARC 21 specifications for MARC-8 and UCS/Unicode
environment - LCs code tables for mapping between MARC-8 and
Unicode - OCLC WorldCat migration to Unicode platform
- OCLC Connexions Unicode support
- LCs Voyager upgrade
- INNOPAC/Millennium
- HKIUG Unicode Initiatives
13MARC 21 Specifications
- In 2000, the Library of Congress issued
- Specifications to distinguish the encoding of
MARC 21 records in the original (MARC-8)
environment and in the new UCS/Unicode
environmenthttp//www.loc.gov/marc/specification
s/speccharintro.html - MARC-8 means characters are encoded in one 8-bit
byte (e.g. ASCII) and three 8-bit bytes (e.g.
EACC)
14A MARC 21 bibliographic record in ISO2709 format
viewed in Notepad, showing CJK characters encoded
in EACC in MARC-8 environment
15MARC 21 Specifications cont.
- UCS/Unicode Environmenthttp//www.loc.gov/marc/s
pecifications/speccharucs.html - Use UTF-8 as character encoding
- Leader position 9 contains value a
- Field 066 (Character Sets Present) is not needed
- The script identification information in subfield
6 (Linkage) can be dropped - Lengths specified by number of 8-bit bytes,
rather than number of characters.
16MARC 21 Specifications cont.
- Unicode combining rule for diacritics, i.e.
combining marks follow rather than precede the
character they modify
17(No Transcript)
18MARC 21 Specifications cont.
- LC issued code tables for mapping between MARC-8
and UCS/Unicode - Not only for EACC, but also for other Latin and
non-Latin scripts such as ANSEL, Hebrew,
Cyrillic, Arabic and Greek - Provide essential information for ILSs Unicode
implementation
19(No Transcript)
20(No Transcript)
21MARC 21 Specifications cont.
- UNICODE-MARC Discussion Listhttp//listserv.loc.
gov/listarch/unicode-marc.html - Since July 2005
- Active discussion on issues concerning Unicode
implementation in MARC 21 - Some of the discussion was summarized as MARC
Proposal 2006-04, "Technique for conversion of
Unicode to MARC-8, and was approved by MARBI in
January 2006, with changes.http//www.loc.gov/ma
rc/marbi/2006/2006-04.html
22OCLC WorldCat and Connexion
- WorldCat migrated to Oracle with Unicode
support - Released Connexion client software
- Unicode-based, running on Windows
- Comprehensive CJK support
- Rely on Windows IME for input of CJK characters
- Export and import of records in both MARC-8 and
UCS/Unicode environments.
23LCs Catalog
- Its Voyager system was upgraded recently to
provide Unicode support - Capable of displaying and searching CJK data in
880 fields - Allows export of records in MARC-8 and Unicode
environments - Issued a cataloging policy position paper for the
Unicode implementation at LC (March 2006), with
details on current implementation and future
opportunitieshttp//www.loc.gov/catdir/cpso/unic
ode.pdf
24INNOPAC/Millennium
- INNOPAC has been supporting EACC, and CJK in
general, since its implementation at HKUST
Library 15 years ago - Millennium clients run on Windows XP with Unicode
support - CJK records are stored in EACC internally but
provides option to migrate the storage to Unicode - HKIUG Unicode Task Force is working with the
vendor to improve the Unicode storage
25HKIUG Unicode Initiatives
- HKIUG Hong Kong Innovative Users Group
- Founded in 1996
- Members from all 15 INNOPAC libraries in Hong
Kong and Macau, including the eight Hong Kong
government-funded universities - HKIUG Unicode Initiatives since 2003, to work
closely with the ILS vendor (Innovative
Interfaces Inc.) to improve INNOPAC /
Millenniums CJK support
26HKIUG Unicode Initiatives cont.
- Achievements
- Developed HKIUG Version of the EACC to Unicode
mapping table - Resolved EACC to Unicode multi-mapping problem
- Developed TSVCC (Traditional, Simplified, Variant
Chinese Characters) linking tables - HKIUG Unicode Task Force - to maintain the
Unicode and TSVCC tables and to assist the vendor
on Unicode migration members from CUHK, CITYU,
HKUST and HKU
27Migration Issues
- The need of EACC/Unicode mapping table
- Multi-mapping and round trip failure problems
- TSVCC linking
- Font display problem
28HKIUG EACC/Unicode Table
- First released in September 2003 last revised in
August 2005 - Contains
- 15672 EACC characters
- 7043 pure CCCII characters
- Mapping for EACC characters - follows LC as much
as possible - Contains 7043 Pure CCCII that have no EACC
equivalent - includes them to avoid too many
missing characters
29(No Transcript)
30(No Transcript)
31HKIUG EACC/Unicode Table cont.
- Identified
- 160 multi-mapping linked cases, e.g.
- 49 multi-mapping unlinked cases, e.g.
- Causing failure in round-trip crosswalk
32(No Transcript)
33U5386
34Export
35Export output is 27 46 2A incorrect!
36TSVCC Linking
- When searching ?? Li fa, you will prefer to
retrieve records that have - ??
- ??
- where ? and ? have Traditional Simplified
relationship - Similarly, when searching ?, you will prefer to
retrieve its Variant ? - Requires linking T,S,V forms during searching
37(No Transcript)
38(No Transcript)
39????? Excuse me, are they typos! Shouldnt it be
?????
40Google is capable linking ? and ?
41TSVCC Linking cont.
- HKIUG Unicode Task Force constructed two versions
of TSVCC Linking tables - EACC Version released November 2004
- Unicode Version draft created March 2006
- for ILSs that store characters in EACC and in
Unicode respectively
42TSVCC Linking cont.
- EACC Version
- Table M (80 entries) linking relationship is not
purely from EACC, e.g. - 214349 ? 274349 ? 2D4349 ? 21462A ?
27462A ? 4B462A ? U5386 multi-mapped
27462A,274349 - Table V (3065 entries) linking relationship is
purely from EACC, e.g. - 21306C ? 2D306C ? 33306C ? 4B306C ?
43(No Transcript)
44(No Transcript)
45TSVCC Linking cont.
- Unicode Version
- Still in draft construction
- So far has 3061 entries, e.g.
- U5C5B ? U5C4F ? U6452 ? EACC link
(27/21415A) AND Variant form of U5C4F is
U5C5B - U965D ? U965C ? U9655 ? EACC link
(23/294A44) AND Simplified form of U965D is
U9655 is
46(No Transcript)
47(No Transcript)
48TSVCC Linking cont.
- Plan to include linking of New/Old forms in the
TSVCC Unicode Version, e.g.
49TSVCC Linking cont.
- Results of implementing TSVCC Linking
- Improvement in searching higher recall
- Trade-off lower precision
- If search results are sorted/displayed in TSVCC
normalized form, misleading and inaccurate
display may occur - such as the OCLC Connexion
browse list display problem mentioned previously
50Font Issues
- Do not believe in What you see is what you have,
because What you see varies with fonts ! - For example, the following glyphs have different
code points in EACC
51Font Issues
- But in Unicode, they are assigned the same code
points. Depending on the font in use, you will
see different glyphs
52Conclusion
- How far are we?
- Both LC and OCLC have done enormous work in
enabling and promoting the use of Unicode in MARC
records - ILS vendors are working very hard to implement
and enhance the Unicode support - Libraries and CJK experts are providing advice
and suggesting solutions
53Conclusion cont.
- We have reviewed various migration issues
- The need for an accurate EACC/Unicode mapping
table - Extending to non-EACC characters
- Multi-mappings and round-trip failure
- TSVCC Linking
- Font display issues
54Conclusion cont.
- The failure of round-trip crosswalk between
systems will continue to be a problem until
everyone interchanges MARC records purely in
Unicode. This will only happen when the majority
of systems store and use data natively in
Unicode. - Unlike EACC, Unicode does not have a build-in
linking relationship. Implementing TSVCC is
essential for improving searching.
55Additional References
- Assessment of Options for Handling Full Unicode
Character Encodings in MARC 21 -- Part 1 New
Scripts ( January 2004) and Part 2 Issues (June
2005).http//www.loc.gov/marc/marbi/list-report.
html - Joan M. Aliprand. The structure and content of
MARC 21 records in the Unicode environment.
Information technology and libraries, v.24, no.4,
December 2005, p.170-179. - Wong, Philip and K.T. Lam. HKIUGs Unicode
projects untangling the chaotic codes. HKIUG
Annual Meeting 2005. http//hdl.handle.net/1783.1
/2429
56Thank You!