EACC to Unicode Migration - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

EACC to Unicode Migration

Description:

EACC to Unicode Migration K.T. Lam, HKUST Library. 14 ... Libraries and CJK experts are providing advice and suggesting solutions ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 57
Provided by: ktl2
Category:

less

Transcript and Presenter's Notes

Title: EACC to Unicode Migration


1
EACC to Unicode Migration
OCLC CJK Users Group 2006 Annual Meeting April
8 2006, San Francisco
  • Ki Tat LAM
  • Head of Library Systems
  • The Hong Kong University of Science and
    Technology Library
  • lblkt_at_ust.hk

2
Contents
  • Migrating systems from EACC to Unicode
    environments
  • Why migrating?
  • What has been done?
  • HKIUG Unicode Initiatives
  • Issues
  • EACC/Unicode mapping table
  • Round-trip cross-walk
  • Improving searching with TSVCC Linking
  • Font display

3
An Observation
4
????
System for determining the beginning, length and
divisions of a year
5
(No Transcript)
6
Why Migrating?
  • EACC (East Asian Character Code, ANSI
    Z39.64-1989) was introduced into the CJK library
    community by RLG in the early 1980s (known as
    REACC at that time)
  • Its was an important milestone for the first
    time, we began to have a C-J-K unified standard
    with a relatively large character set (about
    16,000) for use in bibliographic records

7
Why Migrating? cont.
  • By adopting EACC as an alternate character set in
    MARC 21 (at that time it was called USMARC),
    libraries with East Asian collections were able
    to share and use CJK cataloging records via the
    OCLC and RLIN cataloging platforms
  • However, great effort is required for integrated
    library systems (ILS) to make use of the
    EACC-based CJK data in the records

8
Why Migrating? cont.
  • To communicate in EACC is extremely difficult
    because EACC failed to be supported in the
    mainstream IT environment
  • Hardly you can find EACC supported by operating
    systems, fonts, input methods, editors, etc.,
    both in the old days and today
  • It will also be unlikely to see EACC supported in
    web browsers in the current Internet era
  • Why? EACCs three-byte coding structure is
    alien to the binary computing world

9
Why Migrating? cont.
  • Due to its unpopularity, EACC became a frozen
    standard and there is no way to fix errors and
    add characters
  • If EACC is stored natively in the bibliographic
    database, then in order to input and display CJK
    characters at the application layers (such as
    OPAC and record editor), ILS will have to rely on
    lossy mapping tables to map EACC to other
    character encodings (e.g. BIG5, GB, JIS, KSC and
    UTF-8)

10
Why Migrating? cont.
  • Unicode comes to the rescue
  • Single standard for written texts of almost all
    languages in the world
  • Has more than 96,000 characters, most of them are
    CJK
  • An active standard, with constant updates
  • Widely adopted and supported in the current IT
    environment major operating systems and web
    browsers, plus many devices and applications,
    speak the Unicode language

11
Why Migrating? cont.
  • With more than 25 years influence by EACC, it is
    unlikely that all library systems and data can be
    migrated overnight to the Unicode mainstream
  • It is anticipated that there will be a period of
    parallel operation, with co-existing EACC and
    Unicode bibliographic data interchanging among
    systems, resulting in confusion and data loss
  • Even if systems have migrated to Unicode, there
    are still problems that require attention

12
What has been done?
  • MARC 21 specifications for MARC-8 and UCS/Unicode
    environment
  • LCs code tables for mapping between MARC-8 and
    Unicode
  • OCLC WorldCat migration to Unicode platform
  • OCLC Connexions Unicode support
  • LCs Voyager upgrade
  • INNOPAC/Millennium
  • HKIUG Unicode Initiatives

13
MARC 21 Specifications
  • In 2000, the Library of Congress issued
  • Specifications to distinguish the encoding of
    MARC 21 records in the original (MARC-8)
    environment and in the new UCS/Unicode
    environmenthttp//www.loc.gov/marc/specification
    s/speccharintro.html
  • MARC-8 means characters are encoded in one 8-bit
    byte (e.g. ASCII) and three 8-bit bytes (e.g.
    EACC)

14
A MARC 21 bibliographic record in ISO2709 format
viewed in Notepad, showing CJK characters encoded
in EACC in MARC-8 environment
15
MARC 21 Specifications cont.
  • UCS/Unicode Environmenthttp//www.loc.gov/marc/s
    pecifications/speccharucs.html
  • Use UTF-8 as character encoding
  • Leader position 9 contains value a
  • Field 066 (Character Sets Present) is not needed
  • The script identification information in subfield
    6 (Linkage) can be dropped
  • Lengths specified by number of 8-bit bytes,
    rather than number of characters.

16
MARC 21 Specifications cont.
  • Unicode combining rule for diacritics, i.e.
    combining marks follow rather than precede the
    character they modify

17
(No Transcript)
18
MARC 21 Specifications cont.
  • LC issued code tables for mapping between MARC-8
    and UCS/Unicode
  • Not only for EACC, but also for other Latin and
    non-Latin scripts such as ANSEL, Hebrew,
    Cyrillic, Arabic and Greek
  • Provide essential information for ILSs Unicode
    implementation

19
(No Transcript)
20
(No Transcript)
21
MARC 21 Specifications cont.
  • UNICODE-MARC Discussion Listhttp//listserv.loc.
    gov/listarch/unicode-marc.html
  • Since July 2005
  • Active discussion on issues concerning Unicode
    implementation in MARC 21
  • Some of the discussion was summarized as MARC
    Proposal 2006-04, "Technique for conversion of
    Unicode to MARC-8, and was approved by MARBI in
    January 2006, with changes.http//www.loc.gov/ma
    rc/marbi/2006/2006-04.html

22
OCLC WorldCat and Connexion
  • WorldCat migrated to Oracle with Unicode
    support
  • Released Connexion client software
  • Unicode-based, running on Windows
  • Comprehensive CJK support
  • Rely on Windows IME for input of CJK characters
  • Export and import of records in both MARC-8 and
    UCS/Unicode environments.

23
LCs Catalog
  • Its Voyager system was upgraded recently to
    provide Unicode support
  • Capable of displaying and searching CJK data in
    880 fields
  • Allows export of records in MARC-8 and Unicode
    environments
  • Issued a cataloging policy position paper for the
    Unicode implementation at LC (March 2006), with
    details on current implementation and future
    opportunitieshttp//www.loc.gov/catdir/cpso/unic
    ode.pdf

24
INNOPAC/Millennium
  • INNOPAC has been supporting EACC, and CJK in
    general, since its implementation at HKUST
    Library 15 years ago
  • Millennium clients run on Windows XP with Unicode
    support
  • CJK records are stored in EACC internally but
    provides option to migrate the storage to Unicode
  • HKIUG Unicode Task Force is working with the
    vendor to improve the Unicode storage

25
HKIUG Unicode Initiatives
  • HKIUG Hong Kong Innovative Users Group
  • Founded in 1996
  • Members from all 15 INNOPAC libraries in Hong
    Kong and Macau, including the eight Hong Kong
    government-funded universities
  • HKIUG Unicode Initiatives since 2003, to work
    closely with the ILS vendor (Innovative
    Interfaces Inc.) to improve INNOPAC /
    Millenniums CJK support

26
HKIUG Unicode Initiatives cont.
  • Achievements
  • Developed HKIUG Version of the EACC to Unicode
    mapping table
  • Resolved EACC to Unicode multi-mapping problem
  • Developed TSVCC (Traditional, Simplified, Variant
    Chinese Characters) linking tables
  • HKIUG Unicode Task Force - to maintain the
    Unicode and TSVCC tables and to assist the vendor
    on Unicode migration members from CUHK, CITYU,
    HKUST and HKU

27
Migration Issues
  • The need of EACC/Unicode mapping table
  • Multi-mapping and round trip failure problems
  • TSVCC linking
  • Font display problem

28
HKIUG EACC/Unicode Table
  • First released in September 2003 last revised in
    August 2005
  • Contains
  • 15672 EACC characters
  • 7043 pure CCCII characters
  • Mapping for EACC characters - follows LC as much
    as possible
  • Contains 7043 Pure CCCII that have no EACC
    equivalent - includes them to avoid too many
    missing characters

29
(No Transcript)
30
(No Transcript)
31
HKIUG EACC/Unicode Table cont.
  • Identified
  • 160 multi-mapping linked cases, e.g.
  • 49 multi-mapping unlinked cases, e.g.
  • Causing failure in round-trip crosswalk

32
(No Transcript)
33
U5386
34
Export
35
Export output is 27 46 2A incorrect!
36
TSVCC Linking
  • When searching ?? Li fa, you will prefer to
    retrieve records that have
  • ??
  • ??
  • where ? and ? have Traditional Simplified
    relationship
  • Similarly, when searching ?, you will prefer to
    retrieve its Variant ?
  • Requires linking T,S,V forms during searching

37
(No Transcript)
38
(No Transcript)
39
????? Excuse me, are they typos! Shouldnt it be
?????
40
Google is capable linking ? and ?
41
TSVCC Linking cont.
  • HKIUG Unicode Task Force constructed two versions
    of TSVCC Linking tables
  • EACC Version released November 2004
  • Unicode Version draft created March 2006
  • for ILSs that store characters in EACC and in
    Unicode respectively

42
TSVCC Linking cont.
  • EACC Version
  • Table M (80 entries) linking relationship is not
    purely from EACC, e.g.
  • 214349 ? 274349 ? 2D4349 ? 21462A ?
    27462A ? 4B462A ? U5386 multi-mapped
    27462A,274349
  • Table V (3065 entries) linking relationship is
    purely from EACC, e.g.
  • 21306C ? 2D306C ? 33306C ? 4B306C ?

43
(No Transcript)
44
(No Transcript)
45
TSVCC Linking cont.
  • Unicode Version
  • Still in draft construction
  • So far has 3061 entries, e.g.
  • U5C5B ? U5C4F ? U6452 ? EACC link
    (27/21415A) AND Variant form of U5C4F is
    U5C5B
  • U965D ? U965C ? U9655 ? EACC link
    (23/294A44) AND Simplified form of U965D is
    U9655 is

46
(No Transcript)
47
(No Transcript)
48
TSVCC Linking cont.
  • Plan to include linking of New/Old forms in the
    TSVCC Unicode Version, e.g.

49
TSVCC Linking cont.
  • Results of implementing TSVCC Linking
  • Improvement in searching higher recall
  • Trade-off lower precision
  • If search results are sorted/displayed in TSVCC
    normalized form, misleading and inaccurate
    display may occur - such as the OCLC Connexion
    browse list display problem mentioned previously

50
Font Issues
  • Do not believe in What you see is what you have,
    because What you see varies with fonts !
  • For example, the following glyphs have different
    code points in EACC

51
Font Issues
  • But in Unicode, they are assigned the same code
    points. Depending on the font in use, you will
    see different glyphs

52
Conclusion
  • How far are we?
  • Both LC and OCLC have done enormous work in
    enabling and promoting the use of Unicode in MARC
    records
  • ILS vendors are working very hard to implement
    and enhance the Unicode support
  • Libraries and CJK experts are providing advice
    and suggesting solutions

53
Conclusion cont.
  • We have reviewed various migration issues
  • The need for an accurate EACC/Unicode mapping
    table
  • Extending to non-EACC characters
  • Multi-mappings and round-trip failure
  • TSVCC Linking
  • Font display issues

54
Conclusion cont.
  • The failure of round-trip crosswalk between
    systems will continue to be a problem until
    everyone interchanges MARC records purely in
    Unicode. This will only happen when the majority
    of systems store and use data natively in
    Unicode.
  • Unlike EACC, Unicode does not have a build-in
    linking relationship. Implementing TSVCC is
    essential for improving searching.

55
Additional References
  • Assessment of Options for Handling Full Unicode
    Character Encodings in MARC 21 -- Part 1 New
    Scripts ( January 2004) and Part 2 Issues (June
    2005).http//www.loc.gov/marc/marbi/list-report.
    html
  • Joan M. Aliprand. The structure and content of
    MARC 21 records in the Unicode environment.
    Information technology and libraries, v.24, no.4,
    December 2005, p.170-179.
  • Wong, Philip and K.T. Lam. HKIUGs Unicode
    projects untangling the chaotic codes. HKIUG
    Annual Meeting 2005. http//hdl.handle.net/1783.1
    /2429

56
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com