Issues in Indic Collation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Issues in Indic Collation

1
Issues in Indic Collation

2
Well, this is an obscure topic

History
Very hot topic among developers of Indic-script
products!
Greater implications for the standard
Need to get the word out
This paper (as well as Michael Kaplans encoding
talk) was also given at Tamil Internet conference
August 2001

3
Assumptions

4
So were all on the same page

A definition of collation (for today)
The culturally expected ordering of linguistic
characters in a particular language
Often referred to as sorting, ordering,
alphabetizing
Informants recognize correct vs. incorrect
collation for their language, but often have a
hard time explaining the particular collation
rules

5
Concerns

Within Indic community, belief that encoding
order collation order
ISCII (encoding standard of India) also makes an
attempt to use encoding order as a collation
order
Note placement of codepoints in ISCII is
different between 1988 and 1991 versions

6
Concerns, part

Encoding order of Unicode not appropriate for
collation order, thus Unicode will not work!
Other concerns (invalid codepoints in repertoire,
Devanagari structures applied to other Indic
scripts as in ISCII, lack of transliteration
support)

7
And why should developers care about these
concerns?

8
Encoding ! Collation

Many reasons this is not the case, but will focus
on two major reasons, applicable to many scripts
and languages
A single order (as used in an encoding) is not
sufficient for most scripts
Proper sorting must take characters (or sorting
elements) rather than codepoints into account

9
Why is a single encoding order not sufficient?

One script, but often multiple languages and
multiple sort orders (e.g., Latin script includes
German, Danish, Swedish, Spanish, French, Turkish)

10
Single order of encoding
11
Examples of different sorts within same script

12
How this applies to Indic

Is one order sufficient for all Devanagari script
languages?
Hindi vs. Marathi different sort of Lla (U0933)
Hindi 0932 lt 0933 lt 0934 that is ? lt ? lt ?
Marathi 0939 lt 0933 lt 0915094d0937 conjunct
that is ? lt ? lt ???
Early indications show that Konkani ordering is
not the same as either Sanskrit or Hindi ordering
(research not yet complete)

13
Proper sorting must take sorting elements into
account

The building blocks of collation!
Sorting element
(definition) The discrete elements in a language
that carry a primary weight in sorting
Users consider these characters in their
language
Users expect groupings of these strings to be
collected based on these elements

14
With a monolingual script, can there be a useful
encoding order?

Not necessarily! Even with a single language per
script, code point order is not always enough.
Sorting elements often include more than one
codepoint

15
Multiple codepoints, single sorting elements

Compressions of base characters (two or three
codepoints one sorting element)
Base character modifier mark treated as sorting
element (two or three codepoints one sorting
element)
Syllable-like sorting elements and conjuncts in
Indic

16
Examples of characters ! to codepoints

Traditional Spanish ch, Croatian lj,
Vietnamese ng
Indic conjuncts (Ksha, Shri)
Ä (e.g., base char combining mark) in many
languages
Indicconsonants anusvara, nukta
There are often multiple ways to create sorting
elements from codepoints these must all be taken
into account when designing sorting data!

17
How this applies to Indic

Hindi base consonant candrabindu lt
base consonant anusvara lt
base consonant visarga lt
base consonant
i.e. ?? lt ?? lt ??lt ?
(each of the above lines is a unique sorting
element the first are multiple codepoints)

18
Again, encoding ! collation

Unicode is the ideal encoding, but it should not
be overburdened with other global functionality
It is the responsibility of an application to
handle collation properly

19
Unicode only one part of the globalization
solution

To fully support scripts, software developers
need to consider not just encodings, but also
Character display and layout
Input methods
Font support and rendering engines
National Language Support

20
Is it actually possible to implement Indic on
Unicode?

Yes!
Commercial solutions currently available,
including Windows 2000/XP (any version)
ISCII only supported in codepage translation
functions (WideCharToMultiByte, etc.)
All other globalization support handled via
Unicode (rendering, fonts, input, NLS)

21
If Unicode works for Indic, how do we get the
word out?

Commercial vendors need to talk about their
solutions (like this talk!)
Consider better clarification in the UCA
technical report re encoding order vs.
collation?
Consider a technical report on this topic,
perhaps expanded to include other Southeast Asian
scripts?

22
Comments? Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Issues in Indic Collation PowerPoint PPT Presentation