Language Tags and Locale Identifiers - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Language Tags and Locale Identifiers

Description:

Co-Editor, Language Tag Registry Update (LTRU) Working Group ... XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl. Well understood ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 34
Provided by: APHI4
Category:

less

Transcript and Presenter's Notes

Title: Language Tags and Locale Identifiers


1
Language Tagsand Locale Identifiers
  • A Status Report

2
Presenter and Agenda
  • Addison Phillips
  • Internationalization Architect, Yahoo!
  • Co-Editor, Language Tag Registry Update (LTRU)
    Working Group (RFC 3066bis, draft-matching)
  • Language tags
  • Locale identifiers

3
Languages? Locales?
  • Whats a language tag?
  • What the _at_ is a locale?
  • Why do identifiers matter?

4
Language Tags
  • Enable presentation, selection, and negotiation
    of content
  • Defined by BCP 47
  • Widely used! XML, HTML, RSS, MIME, SOAP, SMTP,
    LDAP, CSS, XSL, CCXML, Java, C, ASP, perl.
  • Well understood (?)

5
Locale Identifiers
  • Different ideas
  • Accept-Locale vs. Accept-Language
  • URIs/URNs, etc.
  • CLDR/LDML
  • And Requirements
  • Operating environments and harmonization
  • App Servers
  • Web Services
  • New Solution? Cost of Adoption
  • UTF-8 to the browser 8 long years

6
In the Beginning
  • Received Wisdom from the Dark Ages
  • Locales
  • japanese, french, german, C
  • ENU, FRA, JPN
  • ja_JP.PCK
  • AMERICAN_AMERICA.WE8ISO8859P1
  • Languages
  • looked a lot like locales (and vice versa)

7
Locales and Language Tags meet
  • Conversations in Prague
  • Language tags are being locale identifiers
    anyway
  • Not going to need a big new thing
  • Just a few things to fix
  • we can do this really fast

8
BCP 47 Basic Structure
  • Alphanumeric (ASCII only) subtags
  • Up to eight characters long
  • Separated by hyphens
  • Case not important (i.e. zh ZH zH Zh)

18alphanum - 18 alphanum
9
RFC 1766
  • zh-TW

ISO 639-1 (alpha2)
ISO 3166 (alpha2)
i-klingon
Registered value
10
RFC 3066
  • sco-GB

ISO 639-2 (alpha 3 codes)
But use
eng-GB
X
alpha 2 codes when they exist
11
Problems
  • Script Variation
  • zh-Hant/zh-Hans
  • (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.)
  • Obsolence of registrations
  • art-lojban (now jbo), i-klingon (now tlh)
  • Instability in underlying standards
  • sr-CS (CS used to be Czechoslovakia

12
And More Problems
  • Lack of scripts
  • Little support for registered values in software
  • Reassignment of values by ISO 3166
  • Lack of consistent tag formation (Chinese
    dialects?)
  • Standards not readily available, bad references
  • Bad implementation assumptions
  • 18 alphanum - 18 alphanum
  • 23 ALPHA - 2ALPHA
  • Many registrations to cover small variations
  • 8 German registrations to cover two variations

13
LTRU and draft-registry
  • Defines a generative syntax
  • machine readable
  • future proof, extensible
  • Defines a single source
  • Stable subtags, no conflicts
  • Machine readable
  • Defines when to use subtags
  • (sometimes)

14
RFC 3066bis and LTRU
  • sl-Latn-IT-rozaj-x-mine

ISO 639-1/2 (alpha2/3)
ISO 15924 script codes (alpha 4)
ISO 3166 (alpha2) or UN M49
Registered variants (any number)
Private Use and Extension
15
More Examples
  • es-419 (Spanish for Americas)
  • en-US (English for USA)
  • de-CH-1996 (Old tags are all valid)
  • sl-rozaj-nedis (Multiple variants)
  • zh-t-wadegile (Extensions)

16
Benefits
  • Subtag registry in one place one source.
  • Subtags identified by length/content
  • Extensible
  • Compatible with RFC 3066 tags
  • Stable subtags are forever

17
Problems
  • Matching
  • Does en-US match en-Latn-US?
  • Tag Choices
  • Users have more to choose from.
  • Implementations
  • More to do, more to think about
  • (easier to parse, process, support the good stuff)

18
Tag Matching
  • Uses Language Ranges in a Language Priority
    List to select sets of content according to the
    language tag
  • Four Schemes
  • Basic Filtering
  • Extended Filtering
  • Scored Filtering
  • Lookup

19
Filtering
  • Ranges specify the least specific item
  • en matches en, en-US, en-Brai, en-boont
  • Basic matching uses plain prefixes
  • Extended matching can match inside bits
  • en--US

20
Scored Filtering
  • Assigns a weight or score to each match
  • Result set is ordered by match quality
  • Postulated by John Cowan

21
Lookup
  • Range specifies the most specific tag in a match.
  • en-US matches en and en-US but not
    en-US-boont
  • Mirrors the locale fallback mechanism and many
    language negotiation schemes.

22
What Do I Do (Content Author)?
  • Not much.
  • Existing tags are all still valid tagging is
    mostly unchanged.
  • Resist temptation to (ab)use the private use
    subtags.
  • Unless your language has script variations
  • Tag content with the appropriate script subtag(s)
  • Script subtags only apply to a small number of
    languages zh, sr, uz, az, mn, and a
    very small number of others.

23
What Do I Do (Programmer)?
  • Check code for compliance with 3066bis
  • Decide on well-formed or validating
  • Implement suppress-script
  • Change to using the registry
  • Bother infrastructure folks (Java, MS, Mozilla,
    etc) to implement the standard

24
What Do I Do (End-User)?
  • Check and update your language ranges.
  • Tag content wisely.

25
LTRU Milestone Dates
  • (Done) RFC 3066bis
  • Registry went live in December 2005
  • Produce Matching RFC
  • Draft-11 available (WG Last Call started Monday)
  • (Anticipated) Produce RFC 3066ter
  • This includes ISO 639-3 support, extended
    language subtags, and possibly ISO 639-6

26
Things to Read
  • Registry Draft
  • http//www.inter-locale.com
  • http//www.ietf.org/internet-drafts/draft-ietf-ltr
    u-registry-12.txt
  • Matching Draft
  • http//www.inter-locale.com
  • LTRU Mailing List
  • https//www1.ietf.org/mailman/listinfo/ltru

27
Things to Do (languages)
  • Get involved in LTRU
  • Get involved in W3C I18N Core WG!
  • Write implementations
  • Work on adoption of 3066bis understand the
    impact
  • Then get involved with Locale identifiers

28
Back to Locales
  • IUC 20 Round Table
  • Suzanne Toppings Multilingual Article
  • Tex Texin and the Locales list

29
Locale Identifiers and Web Services
30
W3C and Unicode
  • W3C
  • Identifiers and cross-over with language tags
  • Web services
  • XML, HTML
  • Unicode Consortium
  • LDML
  • CLDR
  • Standards for content

31
Language Tags and Locale Identifiers SPEC
  • First Working Draft coming soon
  • URIs?
  • Simple tags?

32
WS-I18N SPEC
  • First Working Draft now available
  • http//www.w3.org/TR/ws-i18n

33
Ideas?
Write a Comment
User Comments (0)
About PowerShow.com