Standards, Text Representation, and Localization - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Standards, Text Representation, and Localization

Description:

Standards, Text Representation, and Localization. Lisa Moore. IBM. Unicode Consortium ... Open source and locales. The future. GK3 3rd Global Knowledge ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 24
Provided by: lisam76
Category:

less

Transcript and Presenter's Notes

Title: Standards, Text Representation, and Localization


1
Standards, Text Representation, and
Localization
  • Lisa Moore
  • IBM
  • Unicode Consortium

2
Agenda
  • Key standards organizations
  • Script/language support in Unicode
  • Locales
  • Open source and locales
  • The future

3
Agenda
  • Key standards organizations
  • Script/language support in Unicode
  • Locales
  • Open source and locales
  • The future

4
Web, Internet-related Standards Organizations
  • IETF
  • ICANN
  • W3C
  • Unicode
  • ISO
  • SEI
  • Unicode has liaison relationships with IETF, W3C,
    ISO, and SEI

5
Internet Organizations
  • IETF http//www.ietf.org/
  • Internet Engineering Task Force
  • International community of network designers,
    operators, vendors, and researchers concerned
    with the architecture and smooth operation of the
    Internet (e.g. TCP/IP, data transmission)
  • Defines UTF-8 as default character encoding form
    for new protocols
  • ICANN http//www.icann.org/
  • Internet Corporation for Assigned Names and
    Numbers
  • Manages domain names and addresses
  • Looks to IETF to define protocols such as for
    Internationalized Domain Names

6
Web Publishing and Characters
  • W3C - http//www.w3.org
  • World Wide Web Consortium
  • Text markup languages (HTML, XHTML, XML, MathML),
    Style Sheets (CSS, XSL), web services, etc.
  • Interactions with the Web (text publishing,
    services)
  • Default character encoding for XML is Unicode
    (UTF-16 or UTF-8)
  • Unicode http//www.unicode.org
  • Characters
  • And all those properties
  • ISO - http//www.iso.org
  • International Standards Organization
  • Many, many standards, including IT
  • JTC1/SC2/WG2 also encodes characters, ISO/IEC
    10646
  • Complete alignment with Unicode

7
Script Encoding Initiative (SEI)
  • SEI http//linguistics.berkeley.edu/sei/
  • Established in the UC Berkeley Department of
    Linguistics in 2002
  • Project devoted to the preparation of formal
    proposals for encoding scripts and script
    elements not yet supported in Unicode (ISO/IEC
    10646)
  • Focus on less common and historical scripts
  • Proposals for the encoding of minority and
    historical scripts often entail significant
    research
  • User communities are often not well-connected to
    the ICT standardization process
  • Goal is to fund the preparation of script
    proposals that will be approved UTC and WG2
    without requiring extensive involvement of the
    committees
  • A secondary goal for certain scripts is to
    produce freely-available fonts to help to promote
    widespread adoption and implementation of the
    scripts
  • SEI project leader active participant with
    Unicode and WG2

8
What happens when you go to a website
  • Enter a URL http//www.gkpeventsonthefuture.org/G
    K3/
  • gkpeventsonthefuture.org is an address managed by
    ICANN, transmission between computers to the
    address is defined by IETF
  • http defines exchange between browser and server
    defined by W3C
  • Web page structure displayed using HTML defined
    by W3C
  • Individual text elements defined by Unicode,
    characters assigned to hex numbers between 0000
    and 10FFFF

9
Agenda
  • Key standards organizations
  • Script/language support in Unicode
  • Locales
  • Open source and locales
  • The future

10
Script support in Unicode
  • Unicode 5.0
  • Asian 34
  • African 3
  • European 6
  • Middle East - 4
  • Other 6
  • Unicode 5.1
  • Asian 7
  • African 1

11
Number of Characters
  • Unicode 5.0
  • Alphabetics, Symbols 16,486
  • Han (CJK) 71,226
  • Hangul 11,172
  • Total graphic characters 98,884

12
Agenda
  • Key standards organizations
  • Script/language support in Unicode
  • Locales
  • Open source and locales
  • The future

13
Locale Identifiers
  • A locale is represented by an identifier,
    typically with two parts
  • Language - 2 letter code from ISO639 standard
  • Territory (country) - 2 letter territory code
    from ISO3166 standard
  • Examples en_US, ms_MY, so_SO

14
Types of Locale Data
  • Dates, times, calendar formats
  • Number, currency formats
  • Measurement system
  • Collation specification
  • Sorting
  • Searching
  • Matching
  • Translated names for language, territory, script,
    time zone, currencies, etc.
  • Script and characters used by a language
  • ICU Locale Explorer

15
Agenda
  • Key standards organizations
  • Script/language support in Unicode
  • Locales
  • Open source and locales
  • The future

16
Open Source and Localization
  • ICU - http//www.icu-project.org/
  • International Components for Unicode
  • ICU is a mature, widely used set of C/C and
    Java libraries providing Unicode and
    Globalization support for software applications.
  • Unicode Locales Project CLDR -http//www.unicode.o
    rg/cldr/
  • Both received significant IBM support

17
Common Locale Data Project
  • Began as Common XML Locale Repository (CXLR)
    developed by OpenI18N in 2003
  • CLDR project began in 2004
  • Hosted by Unicode Consortium
  • http//www.unicode.org/cldr/
  • Goals
  • Common, necessary software locale data for all
    world languages
  • Collect and maintain locale data
  • XML format for effective interchange
  • Freely available

18
CLDR in use (partial list)
  • Companies / Organizations
  • Adobe, Apple (Mac OS X), abas Software,
    Ascential Software, Avaya, BEA, BluePhoenix
    Solutions, BMC Software (Remedy), Business
    Objects, caris, CERN, ClearCommerce, Cognos,
    Debian Linux, D programming language, Gentoo
    Linux, GNU Classpath, Google, HP, Hyperion, IBM,
    Inktomi, Innodata Isogen, Isogon, Informatica,
    Intel, Interlogics, IONA, IXOS, Macromedia,
    Mathworks, OpenOffice, Language Analysis Systems,
    Lawson Software, Leica Geosystems GIS Mapping
    LLC, Mandrake Linux, Novell (SuSE), Optio
    Software, PayPal, Progress Software, Python, QNX,
    Quark, Rogue Wave, SAP, Siebel, SIL, SPSS,
    Software AG, Sun Microsystems (Solaris, Java),
    Sybase, Teradata (NCR), Trados, Trend Micro,
    Virage, webMethods, WMS Gaming, Xerox, Yahoo!,
    and many more

19
Latest Release CLDR 1.5
  • Released July 31, 2007
  • 394 locales
  • 135 languages
  • 149 territories
  • 42 more data
  • 27,000 new or modified data items
  • Over 160 different contributors

20
Agenda
  • Key standards organizations
  • Script/language support in Unicode
  • Locales
  • Open source and locales
  • The future

21
The Future Could Include You!
  • Simplest involvment
  • Use Unicode CLDR locales and ICU
  • Survey tool/bug report/feature request
  • Contact the SEI Project to encode new
    characters/scripts
  • More formal involvement
  • Vetting, assessment, tools, policies, decisions
  • Any Unicode member eligible to name CLDR
    representatives including country liaison members
  • Participate in UTC through membership

22
Summary
  • Maintaining and Enhancing the localization
    available on the Internet and Web is a challenge
  • Many standards groups with overlapping
    responsibilities are involved
  • Businesses and the open source community are also
    engaged
  • The Unicode Consortium hopes more people will use
    ICU and contribute to the Unicode Locales project

23
For More Information
  • Unicode
  • http//www.unicode.org/
  • CLDR
  • http//www.unicode.org/cldr/
  • LDML specification
  • http//unicode.org/reports/tr35
  • Survey tool
  • http//unicode.org/cldr/apps/survey/
  • ICU Locale Explorer
  • http//demo.icu-project.org/icu-bin/locexp
  • lisam_at_us.ibm.com
Write a Comment
User Comments (0)
About PowerShow.com