Title: Issues in Indic Collation
1Issues in Indic Collation
- Cathy Wissink
- Program Manager, Globalization
- Windows Division
- Microsoft
2Well, this is an obscure topic
- History
- Very hot topic among developers of Indic-script
products! - Greater implications for the standard
- Need to get the word out
- This paper (as well as Michael Kaplans encoding
talk) was also given at Tamil Internet conference
August 2001
3Assumptions
- You have a very basic understanding of collation
- You have a very basic understanding of Indic
scripts
4So were all on the same page
- A definition of collation (for today)
- The culturally expected ordering of linguistic
characters in a particular language - Often referred to as sorting, ordering,
alphabetizing - Informants recognize correct vs. incorrect
collation for their language, but often have a
hard time explaining the particular collation
rules
5Concerns
- Within Indic community, belief that encoding
order collation order - ISCII (encoding standard of India) also makes an
attempt to use encoding order as a collation
order - Note placement of codepoints in ISCII is
different between 1988 and 1991 versions
6Concerns, part
- Encoding order of Unicode not appropriate for
collation order, thus Unicode will not work! - Other concerns (invalid codepoints in repertoire,
Devanagari structures applied to other Indic
scripts as in ISCII, lack of transliteration
support)
7And why should developers care about these
concerns?
- Indic development community is creating new
encoding alternatives to Unicode - Unnecessary fragmenting reminiscent of the bad
old days - What type of precedent does this set regarding
new encodings?
8Encoding ! Collation
- Many reasons this is not the case, but will focus
on two major reasons, applicable to many scripts
and languages - A single order (as used in an encoding) is not
sufficient for most scripts - Proper sorting must take characters (or sorting
elements) rather than codepoints into account
9Why is a single encoding order not sufficient?
- One script, but often multiple languages and
multiple sort orders (e.g., Latin script includes
German, Danish, Swedish, Spanish, French, Turkish)
10Single order of encoding
11Examples of different sorts within same script
- German a lt ält bo lt ö lt p
- Danish y lt z lt ä lt ö
- Swedish y lt z lt å lt ä lt ö
- Turkish a lt ä lt b lt o lt ö lt p
12How this applies to Indic
- Is one order sufficient for all Devanagari script
languages? - Hindi vs. Marathi different sort of Lla (U0933)
- Hindi 0932 lt 0933 lt 0934 that is ? lt ? lt ?
- Marathi 0939 lt 0933 lt 0915094d0937 conjunct
that is ? lt ? lt ??? - Early indications show that Konkani ordering is
not the same as either Sanskrit or Hindi ordering
(research not yet complete)
13Proper sorting must take sorting elements into
account
- The building blocks of collation!
- Sorting element
- (definition) The discrete elements in a language
that carry a primary weight in sorting - Users consider these characters in their
language - Users expect groupings of these strings to be
collected based on these elements
14With a monolingual script, can there be a useful
encoding order?
- Not necessarily! Even with a single language per
script, code point order is not always enough.
Sorting elements often include more than one
codepoint
15Multiple codepoints, single sorting elements
- Compressions of base characters (two or three
codepoints one sorting element) - Base character modifier mark treated as sorting
element (two or three codepoints one sorting
element) - Syllable-like sorting elements and conjuncts in
Indic
16Examples of characters ! to codepoints
- Traditional Spanish ch, Croatian lj,
Vietnamese ng - Indic conjuncts (Ksha, Shri)
- Ä (e.g., base char combining mark) in many
languages - Indicconsonants anusvara, nukta
- There are often multiple ways to create sorting
elements from codepoints these must all be taken
into account when designing sorting data!
17How this applies to Indic
- Hindi base consonant candrabindu lt
- base consonant anusvara lt
- base consonant visarga lt
- base consonant
- i.e. ?? lt ?? lt ??lt ?
- (each of the above lines is a unique sorting
element the first are multiple codepoints)
18Again, encoding ! collation
- Unicode is the ideal encoding, but it should not
be overburdened with other global functionality - It is the responsibility of an application to
handle collation properly
19Unicode only one part of the globalization
solution
- To fully support scripts, software developers
need to consider not just encodings, but also - Character display and layout
- Input methods
- Font support and rendering engines
- National Language Support
20Is it actually possible to implement Indic on
Unicode?
- Yes!
- Commercial solutions currently available,
including Windows 2000/XP (any version) - ISCII only supported in codepage translation
functions (WideCharToMultiByte, etc.) - All other globalization support handled via
Unicode (rendering, fonts, input, NLS)
21If Unicode works for Indic, how do we get the
word out?
- Commercial vendors need to talk about their
solutions (like this talk!) - Consider better clarification in the UCA
technical report re encoding order vs.
collation? - Consider a technical report on this topic,
perhaps expanded to include other Southeast Asian
scripts?
22Comments? Questions?