Title: EACL 2003, Budapest : April 12 17, 2003
1EACL 2003, Budapest April 12 17,
2003 Computational Linguistics for South Asian
Languages Expanding Synergies with
Europe CORPORA IN MINOR LANGUAGES OF INDIA SOME
ISSUES Dr.B.Mallikarjun Central Institute of
Indian Languages Mysore 570 006,
INDIA mallikarjun_at_ciil.stpmy.soft.net www.ciil.org
/faculty/mallikarjun.html www.ciilcorpora.net
2Overview of the Presentation
1. Current status of corpora major Indian
languages 2. Current status of corpora - minor
Indian languages 3. Importance of minor
languages corpora 4. Objectives 5.
Categorization of minor languages for corpora
building 6. Minor languages A sample 7. Issues
in corpora building 8. Corpus processing tools
a. Basic b. Advanced 9. Conclusion and a
mission
EACL 2003, CLSAL Budapest April 12 17, 2003
3Major Indian Languages
India has 1652 mother tongues of 4 families.The
Constitution of India in 8th Schedule has
recognized 18 languages spoken by 96.29 of the
population.
Current Status of their corpora
Assamese 2,622,836 Bengali 3,535,863
Gujarati Hindi 3,003,004 Kannada
2,239,537 Kashmiri 2,266,588 Konkani
Malayalam 2,349,526 Manipuri
Marathi 2,213,241 Nepali Oriya
2,727,670 Punjabi 1,966,260 Sanskrit Sindhi
Tamil 3,381,525 Telugu
3,967,926 Urdu 1,64,125
EACL 2003, CLSAL Budapest April 12 17, 2003
4 Different quantum. Comparable quality.
Quantum and coverage is inadequate for wider
NLP activities. Needs to be augmented with
wider coverage. Enhancing attempts have some
problems needing immediate solution.
EACL 2003, CLSAL Budapest April 12 17, 2003
5Minor Indian languages and status of corpora
1634 are minor languages spoken by 3.71 of the
population. Indo-Aryan and Dravidian language
families have both major and minor languages.
Almost all the languages of the other two
families, Munda and Tibeto-Burman are minor
languages. Text corpora building has not
taken place in these languages.
EACL 2003, CLSAL Budapest April 12 17, 2003
6Importance of minor languages corpora
- Minor languages hardly attract the
attention of the policy makers anywhere in
the world. - These are endangered in Indian social,
educational and linguistic contexts. - Linguists evince great interest to study
the richness of languages and try to save the
endangered languages from extinction.
EACL 2003, CLSAL Budapest April 12 17, 2003
7- They hardly attract and become source
for technological research. - Technology has made it possible to
empower all languages whether they are major or
minor ones. - Creating corpora in minor languages,
especially those that have small or no written
literature have certain critical advantages for
linguistic computing. - Experimentation with corpora designs and
standards is more easily done in these
languages because of manageable quantum of
data.
EACL 2003, CLSAL Budapest April 12 17, 2003
8Objectives
Archival and cross-linguistic comparison within a
language family and across language families.
Utilize language technology for their
preservation and continued use. Fine-tune
language analysis where grammatical analysis is
available. Use machine readable form of the texts
to produce possibly precise analysis of the
language where ever such analysis is not
available. Also use some of the minor languages
corpora for machine translation purposes. Speech
corpora too has more significance in minor
languages, since most of them exist in spoken
form and many are yet to be rendered into written
form. Indigenous knowledge systems Most of the
minor languages are resources of cultural
heritage and a treasure house of indigenous
knowledge systems. Once the same is available in
the machine readable form by using UNL can be
made available to the universal knowledge base.
EACL 2003, CLSAL Budapest April 12 17, 2003
9Categorization of minor languages
Minor languages can be classified into 3 groups
on the basis of the issues to be tackled while
building corpora. First category Languages
other than the 18 major languages having good
amount of literary and other texts and also used
in wider domains like Bodo, Kurukh, Maithili,
Santhali, Tripuri etc. Second category
Languages are the once with limited quantity of
written texts but not widely used in different
domains such as education, administration etc.
like Kodava, Tulu, etc. Third category
Languages available only in spoken form and yet
to be rendered into written form like Toda,
Kota, Yerava, etc.
EACL 2003, CLSAL Budapest April 12 17, 2003
10Minor languages A sample
These languages are representative of the ground
linguistic reality in India.
EACL 2003, CLSAL Budapest April 12 17, 2003
11Issues in corpora building
EACL 2003, CLSAL Budapest April 12 17, 2003
12Corpus processing tools Basic tools for
statistical analysis
Frequency count of words and syllables The
facilities created for languages like Hindi and
Kannada are there and where ever necessary
language specific modifications are made and used.
EACL 2003, CLSAL Budapest April 12 17, 2003
13Statistical Analysis
Comparison of Maithili, Kodava and Yerava Corpora
Yerava
Kodava
Maithili
Statistical distribution
3881
9432
328146
Corpus size
3030
6050
51902
Word types
ru
ra
ka
Most frequent Syllable
3.10
5.70
3.52
Average Word length
EACL 2003, CLSAL Budapest April 12 17, 2003
14Comparison of Maithili and Hindi Corpora
Hindi (Premchand)
Hindi (India Today)
Hindi (Naiduniya)
Hindi (CIIL)
Maithili
Statistical distribution
671171
1566779
3140729
2327129
328146
Corpus size
24745
47640
71953
189860
51902
Word types
ka
ka
ka
ka
ka
Most frequent Syllable
4.36
4.71
4.96
4.96
3.52
Average Word length
EACL 2003, CLSAL Budapest April 12 17, 2003
15Comparison of Kodagu, Yerava, Kannada and
Malayalam Corpora
Malayalam
Kannada
Yerava
Kodagu
Statistical distribution
2119935
1977987
3881
9432
Corpus size
526802
346850
3030
6050
Word types
10.25
8.68
3.10
5.70
Average Word length
6.93
8.42
4.36
4.64
Average sentence length
r a
r u
r u
r a
Most frequent Syllable
EACL 2003, CLSAL Budapest April 12 17, 2003
16Basic tools for retrieval
- Key Word in Context
- Search by required word
- Sorting and indexing
- The facilities created for languages like Hindi
and Kannada are there and where ever necessary
language specific modifications can be made and
used.
EACL 2003, CLSAL Budapest April 12 17, 2003
17Advanced tools for analysis
- Part-of-speech tagging
- Morphological analyzer
EACL 2003, CLSAL Budapest April 12 17, 2003
18Part of speech tagging
- Non availability of standard basic tag set is one
of the major drawbacks. - Each Institution/group of scholars use their own
notations CLAWS, Research institution in IT, - CIIL(Maj lg.), CIIL(Min lg.)
- 3. The tagging tools being developed even for
major languages are at different stages of
development. - 4. The POS tagging tool developed for Hindi can
be tried out at the first instance on Maithili to
see its viability. Hindi too is not having fully
working POS tagging tool. - 5. Due to limited data in Kodava and Yerava
manual tagging is preferred.
EACL 2003, CLSAL Budapest April 12 17, 2003
19Morphological analyzer
The Morphological Analyzers designed for the
minor languages of India should be sensitive
enough to take care of their specific features.
- Tagged lexicon
- Rules to cover the processes of
- Inflection - Suffixing is normally based on
word ending - Derivation Both prefixing and suffixing are
possible - depends on lexical item
EACL 2003, CLSAL Budapest April 12 17, 2003
20Semantics and Pragmatics
Yerava word -ati has three meanings such as to
sweep, wind blow and bottom for which
meaning has to be taken depending upon the
context. In such of these cases the
morphological analyzer demands a semantic tool.
Kodava word bappe has the meaning I am
coming but when it is used in the context of
leave taking, it means, I am leaving. Cultural
nuances in the context of leave taking do not
allow one to use the word poope going or
leaving because it would only mean that the
person is saying the ultimate good-bye to this
world. It is possible to judge the meaning of
such words only with the knowledge of the culture
represented by a language.
EACL 2003, CLSAL Budapest April 12 17, 2003
21Disambiguation
Ambiguities are seen in three senses - Word
sense, Pronoun sense and Structural sense. Word
sense ambiguities are words having multiple
meanings that will be found in all the languages.
With regard to the second one, pronominal and
adjectival anaphora are also ambiguities. In
English, disambiguation tools have been
developed. After the inception of a few lexical
databases such as Word Net, Euro Net, etc.,
researchers seem to have overcome the ambiguity
problem to certain extent. In the case of
Indian languages, however, in the absence of such
a sensitive tool, one has to work manually in
order to cross over disambiguate even in the case
of major languages. Minor languages need better
linguistic analysis to arrive at tangible and
usable disambiguation procedures.
EACL 2003, CLSAL Budapest April 12 17, 2003
22Conclusion
- India abounds in many endangered languages.
Technology can actually help maintain a language.
- Technology should immediately take into account
the concerns of minority languages. Especially,
major language technologies of the region should
accommodate the needs of the minor languages too. - Corpora building in minor languages poses new
challenges to innovate novel ways to accommodate
and adequately describe the distinctive features
of these languages. - Comparison of corpora studies - within a family
of languages, across the families of languages
and at the international level will be helpful in
bringing out a standard module of developing
corpora.
EACL 2003, CLSAL Budapest April 12 17, 2003
23Thank You
EACL 2003, CLSAL Budapest April 12 17, 2003
248.1 Kannada Code Chart
EACL 2003, CLSAL Budapest April 12 17, 2003
25EACL 2003, CLSAL Budapest April 12 17, 2003
26EACL 2003, CLSAL Budapest April 12 17, 2003
27EACL 2003, CLSAL Budapest April 12 17, 2003
28Demography Astrology Criminology Physical
Education / Sports Health and Family
Welfare Forestry Sexology Culture
Anthropology Commerce Banking Accountancy Industry
handicrafts Finance Textile Technology Official
And Media Languages Mass Media Legislative Admini
strative Translated Material Literature Scientific
Legal Administration Translated Psychology
Film Technology Photography Marine
Biology Fisheries Textile Technology Social
Sciences Sociology Linguistics Psychology Anthropo
logy History, Archeology, Epigraphy Political
Science Home Science Library Science Religion,
Philosophy Economics Logic Journalism Folklore/Myt
hology Public Administration Law Business
Management Education Text Books-Social Science
Natural, Physical And Professional
Sciences Botany Zoology Geology Geography Bio
Chemistry Micro Biology Physics Chemistry Mathemat
ics Statistics Computer Sciences Astronomy Text
book(Science) Medicine Ayurveda Homeopathy Yoga Na
turopathy Engineering Architecture Oceanology Agri
culture Veternary
Aesthetics Literature Novel Short Story
Essays Criticism Humour Children 's
Literature Biographies Autobiographies
Travelogues Letters/Diaries/ Speeches Plays
Science Fiction Folk Tales Text Books(School)
Social Sciences Fine Arts Music
Dance/Impersonations Drawing Sculpture Musical
Instruments Hobbies
EACL 2003, CLSAL Budapest April 12 17, 2003
29EACL 2003, CLSAL Budapest April 12 17, 2003
30EACL 2003, CLSAL Budapest April 12 17, 2003
31EACL 2003, CLSAL Budapest April 12 17, 2003
32EACL 2003, CLSAL Budapest April 12 17, 2003
33EACL 2003, CLSAL Budapest April 12 17, 2003
34EACL 2003, CLSAL Budapest April 12 17, 2003
35Thank You
EACL 2003, CLSAL Budapest April 12 17, 2003