Issues in Indic Collation - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Issues in Indic Collation

Description:

You have a very basic understanding of collation ... concerns (invalid codepoints in repertoire, Devanagari structures applied to ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 23
Provided by: downloadM
Category:

less

Transcript and Presenter's Notes

Title: Issues in Indic Collation


1
Issues in Indic Collation
  • Cathy Wissink
  • Program Manager, Globalization
  • Windows Division
  • Microsoft

2
Well, this is an obscure topic
  • History
  • Very hot topic among developers of Indic-script
    products!
  • Greater implications for the standard
  • Need to get the word out
  • This paper (as well as Michael Kaplans encoding
    talk) was also given at Tamil Internet conference
    August 2001

3
Assumptions
  • You have a very basic understanding of collation
  • You have a very basic understanding of Indic
    scripts

4
So were all on the same page
  • A definition of collation (for today)
  • The culturally expected ordering of linguistic
    characters in a particular language
  • Often referred to as sorting, ordering,
    alphabetizing
  • Informants recognize correct vs. incorrect
    collation for their language, but often have a
    hard time explaining the particular collation
    rules

5
Concerns
  • Within Indic community, belief that encoding
    order collation order
  • ISCII (encoding standard of India) also makes an
    attempt to use encoding order as a collation
    order
  • Note placement of codepoints in ISCII is
    different between 1988 and 1991 versions

6
Concerns, part
  • Encoding order of Unicode not appropriate for
    collation order, thus Unicode will not work!
  • Other concerns (invalid codepoints in repertoire,
    Devanagari structures applied to other Indic
    scripts as in ISCII, lack of transliteration
    support)

7
And why should developers care about these
concerns?
  • Indic development community is creating new
    encoding alternatives to Unicode
  • Unnecessary fragmenting reminiscent of the bad
    old days
  • What type of precedent does this set regarding
    new encodings?

8
Encoding ! Collation
  • Many reasons this is not the case, but will focus
    on two major reasons, applicable to many scripts
    and languages
  • A single order (as used in an encoding) is not
    sufficient for most scripts
  • Proper sorting must take characters (or sorting
    elements) rather than codepoints into account

9
Why is a single encoding order not sufficient?
  • One script, but often multiple languages and
    multiple sort orders (e.g., Latin script includes
    German, Danish, Swedish, Spanish, French, Turkish)

10
Single order of encoding
11
Examples of different sorts within same script
  • German a lt ält bo lt ö lt p
  • Danish y lt z lt ä lt ö
  • Swedish y lt z lt å lt ä lt ö
  • Turkish a lt ä lt b lt o lt ö lt p

12
How this applies to Indic
  • Is one order sufficient for all Devanagari script
    languages?
  • Hindi vs. Marathi different sort of Lla (U0933)
  • Hindi 0932 lt 0933 lt 0934 that is ? lt ? lt ?
  • Marathi 0939 lt 0933 lt 0915094d0937 conjunct
    that is ? lt ? lt ???
  • Early indications show that Konkani ordering is
    not the same as either Sanskrit or Hindi ordering
    (research not yet complete)

13
Proper sorting must take sorting elements into
account
  • The building blocks of collation!
  • Sorting element
  • (definition) The discrete elements in a language
    that carry a primary weight in sorting
  • Users consider these characters in their
    language
  • Users expect groupings of these strings to be
    collected based on these elements

14
With a monolingual script, can there be a useful
encoding order?
  • Not necessarily! Even with a single language per
    script, code point order is not always enough.
    Sorting elements often include more than one
    codepoint

15
Multiple codepoints, single sorting elements
  • Compressions of base characters (two or three
    codepoints one sorting element)
  • Base character modifier mark treated as sorting
    element (two or three codepoints one sorting
    element)
  • Syllable-like sorting elements and conjuncts in
    Indic

16
Examples of characters ! to codepoints
  • Traditional Spanish ch, Croatian lj,
    Vietnamese ng
  • Indic conjuncts (Ksha, Shri)
  • Ä (e.g., base char combining mark) in many
    languages
  • Indicconsonants anusvara, nukta
  • There are often multiple ways to create sorting
    elements from codepoints these must all be taken
    into account when designing sorting data!

17
How this applies to Indic
  • Hindi base consonant candrabindu lt
  • base consonant anusvara lt
  • base consonant visarga lt
  • base consonant
  • i.e. ?? lt ?? lt ??lt ?
  • (each of the above lines is a unique sorting
    element the first are multiple codepoints)

18
Again, encoding ! collation
  • Unicode is the ideal encoding, but it should not
    be overburdened with other global functionality
  • It is the responsibility of an application to
    handle collation properly

19
Unicode only one part of the globalization
solution
  • To fully support scripts, software developers
    need to consider not just encodings, but also
  • Character display and layout
  • Input methods
  • Font support and rendering engines
  • National Language Support

20
Is it actually possible to implement Indic on
Unicode?
  • Yes!
  • Commercial solutions currently available,
    including Windows 2000/XP (any version)
  • ISCII only supported in codepage translation
    functions (WideCharToMultiByte, etc.)
  • All other globalization support handled via
    Unicode (rendering, fonts, input, NLS)

21
If Unicode works for Indic, how do we get the
word out?
  • Commercial vendors need to talk about their
    solutions (like this talk!)
  • Consider better clarification in the UCA
    technical report re encoding order vs.
    collation?
  • Consider a technical report on this topic,
    perhaps expanded to include other Southeast Asian
    scripts?

22
Comments? Questions?
Write a Comment
User Comments (0)
About PowerShow.com