EACL 2003, Budapest : April 12 17, 2003 - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

EACL 2003, Budapest : April 12 17, 2003

Description:

8.1 Kannada Code Chart. 25. EACL 2003, CLSAL: Budapest April 12 17, 2003. 26 ... Astrology. Criminology. Physical Education / Sports. Health and Family ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 36
Provided by: udayadi
Category:

less

Transcript and Presenter's Notes

Title: EACL 2003, Budapest : April 12 17, 2003


1
EACL 2003, Budapest April 12 17,
2003   Computational Linguistics for South Asian
Languages Expanding Synergies with
Europe CORPORA IN MINOR LANGUAGES OF INDIA SOME
ISSUES   Dr.B.Mallikarjun Central Institute of
Indian Languages Mysore 570 006,
INDIA mallikarjun_at_ciil.stpmy.soft.net www.ciil.org
/faculty/mallikarjun.html www.ciilcorpora.net
2
Overview of the Presentation
1. Current status of corpora major Indian
languages 2. Current status of corpora - minor
Indian languages 3. Importance of minor
languages corpora 4. Objectives 5.
Categorization of minor languages for corpora
building 6. Minor languages A sample 7. Issues
in corpora building 8. Corpus processing tools
a. Basic b. Advanced 9. Conclusion and a
mission
EACL 2003, CLSAL Budapest April 12 17, 2003
3
Major Indian Languages
India has 1652 mother tongues of 4 families.The
Constitution of India in 8th Schedule has
recognized 18 languages spoken by 96.29 of the
population.
Current Status of their corpora
Assamese 2,622,836 Bengali 3,535,863
Gujarati Hindi 3,003,004 Kannada
2,239,537 Kashmiri 2,266,588 Konkani
Malayalam 2,349,526 Manipuri
Marathi 2,213,241 Nepali Oriya
2,727,670 Punjabi 1,966,260 Sanskrit Sindhi
Tamil 3,381,525 Telugu
3,967,926 Urdu 1,64,125
EACL 2003, CLSAL Budapest April 12 17, 2003
4
Different quantum. Comparable quality.
Quantum and coverage is inadequate for wider
NLP activities. Needs to be augmented with
wider coverage. Enhancing attempts have some
problems needing immediate solution.
EACL 2003, CLSAL Budapest April 12 17, 2003
5
Minor Indian languages and status of corpora
1634 are minor languages spoken by 3.71 of the
population. Indo-Aryan and Dravidian language
families have both major and minor languages.
Almost all the languages of the other two
families, Munda and Tibeto-Burman are minor
languages.   Text corpora building has not
taken place in these languages.
EACL 2003, CLSAL Budapest April 12 17, 2003
6
Importance of minor languages corpora
  • Minor languages hardly attract the
    attention of the policy makers anywhere in
    the world.
  • These are endangered in Indian social,
    educational and linguistic contexts.
  • Linguists evince great interest to study
    the richness of languages and try to save the
    endangered languages from extinction.

EACL 2003, CLSAL Budapest April 12 17, 2003
7
  • They hardly attract and become source
    for technological research.
  • Technology has made it possible to
    empower all languages whether they are major or
    minor ones.
  • Creating corpora in minor languages,
    especially those that have small or no written
    literature have certain critical advantages for
    linguistic computing.
  • Experimentation with corpora designs and
    standards is more easily done in these
    languages because of manageable quantum of
    data.

EACL 2003, CLSAL Budapest April 12 17, 2003
8
Objectives
Archival and cross-linguistic comparison within a
language family and across language families.
Utilize language technology for their
preservation and continued use. Fine-tune
language analysis where grammatical analysis is
available. Use machine readable form of the texts
to produce possibly precise analysis of the
language where ever such analysis is not
available. Also use some of the minor languages
corpora for machine translation purposes. Speech
corpora too has more significance in minor
languages, since most of them exist in spoken
form and many are yet to be rendered into written
form. Indigenous knowledge systems Most of the
minor languages are resources of cultural
heritage and a treasure house of indigenous
knowledge systems. Once the same is available in
the machine readable form by using UNL can be
made available to the universal knowledge base.
EACL 2003, CLSAL Budapest April 12 17, 2003
9
Categorization of minor languages
Minor languages can be classified into 3 groups
on the basis of the issues to be tackled while
building corpora. First category Languages
other than the 18 major languages having good
amount of literary and other texts and also used
in wider domains like Bodo, Kurukh, Maithili,
Santhali, Tripuri etc. Second category
Languages are the once with limited quantity of
written texts but not widely used in different
domains such as education, administration etc.
like Kodava, Tulu, etc. Third category
Languages available only in spoken form and yet
to be rendered into written form like Toda,
Kota, Yerava, etc.
EACL 2003, CLSAL Budapest April 12 17, 2003
10
Minor languages A sample
These languages are representative of the ground
linguistic reality in India.
EACL 2003, CLSAL Budapest April 12 17, 2003
11
Issues in corpora building
EACL 2003, CLSAL Budapest April 12 17, 2003
12
Corpus processing tools Basic tools for
statistical analysis
Frequency count of words and syllables The
facilities created for languages like Hindi and
Kannada are there and where ever necessary
language specific modifications are made and used.
EACL 2003, CLSAL Budapest April 12 17, 2003
13
Statistical Analysis
Comparison of Maithili, Kodava and Yerava Corpora
Yerava
Kodava
Maithili
Statistical distribution
3881
9432
328146
Corpus size
3030
6050
51902
Word types
ru
ra
ka
Most frequent Syllable
3.10
5.70
3.52
Average Word length
EACL 2003, CLSAL Budapest April 12 17, 2003
14
Comparison of Maithili and Hindi Corpora
Hindi (Premchand)
Hindi (India Today)
Hindi (Naiduniya)
Hindi (CIIL)
Maithili
Statistical distribution
671171
1566779
3140729
2327129
328146
Corpus size
24745
47640
71953
189860
51902
Word types
ka
ka
ka
ka
ka
Most frequent Syllable
4.36
4.71
4.96
4.96
3.52
Average Word length
EACL 2003, CLSAL Budapest April 12 17, 2003
15
Comparison of Kodagu, Yerava, Kannada and
Malayalam Corpora
Malayalam
Kannada
Yerava
Kodagu
Statistical distribution
2119935
1977987
3881
9432
Corpus size
526802
346850
3030
6050
Word types
10.25
8.68
3.10
5.70
Average Word length
6.93
8.42
4.36
4.64
Average sentence length
r a
r u
r u
r a
Most frequent Syllable
EACL 2003, CLSAL Budapest April 12 17, 2003
16
Basic tools for retrieval
  • Key Word in Context
  • Search by required word
  • Sorting and indexing
  • The facilities created for languages like Hindi
    and Kannada are there and where ever necessary
    language specific modifications can be made and
    used.

EACL 2003, CLSAL Budapest April 12 17, 2003
17
Advanced tools for analysis
  • Part-of-speech tagging
  • Morphological analyzer

EACL 2003, CLSAL Budapest April 12 17, 2003
18
Part of speech tagging
  • Non availability of standard basic tag set is one
    of the major drawbacks.
  • Each Institution/group of scholars use their own
    notations CLAWS, Research institution in IT,
  • CIIL(Maj lg.), CIIL(Min lg.)
  • 3. The tagging tools being developed even for
    major languages are at different stages of
    development.
  • 4. The POS tagging tool developed for Hindi can
    be tried out at the first instance on Maithili to
    see its viability. Hindi too is not having fully
    working POS tagging tool.
  • 5. Due to limited data in Kodava and Yerava
    manual tagging is preferred.

EACL 2003, CLSAL Budapest April 12 17, 2003
19
Morphological analyzer
The Morphological Analyzers designed for the
minor languages of India should be sensitive
enough to take care of their specific features.
  • Tagged lexicon
  • Rules to cover the processes of
  • Inflection - Suffixing is normally based on
    word ending
  • Derivation Both prefixing and suffixing are
    possible
  • depends on lexical item

EACL 2003, CLSAL Budapest April 12 17, 2003
20
Semantics and Pragmatics
Yerava word -ati has three meanings such as to
sweep, wind blow and bottom for which
meaning has to be taken depending upon the
context. In such of these cases the
morphological analyzer demands a semantic tool.
Kodava word bappe has the meaning I am
coming but when it is used in the context of
leave taking, it means, I am leaving. Cultural
nuances in the context of leave taking do not
allow one to use the word poope going or
leaving because it would only mean that the
person is saying the ultimate good-bye to this
world. It is possible to judge the meaning of
such words only with the knowledge of the culture
represented by a language.
EACL 2003, CLSAL Budapest April 12 17, 2003
21
Disambiguation
Ambiguities are seen in three senses - Word
sense, Pronoun sense and Structural sense. Word
sense ambiguities are words having multiple
meanings that will be found in all the languages.
With regard to the second one, pronominal and
adjectival anaphora are also ambiguities. In
English, disambiguation tools have been
developed. After the inception of a few lexical
databases such as Word Net, Euro Net, etc.,
researchers seem to have overcome the ambiguity
problem to certain extent. In the case of
Indian languages, however, in the absence of such
a sensitive tool, one has to work manually in
order to cross over disambiguate even in the case
of major languages. Minor languages need better
linguistic analysis to arrive at tangible and
usable disambiguation procedures.
EACL 2003, CLSAL Budapest April 12 17, 2003
22
Conclusion
  • India abounds in many endangered languages.
    Technology can actually help maintain a language.
  • Technology should immediately take into account
    the concerns of minority languages. Especially,
    major language technologies of the region should
    accommodate the needs of the minor languages too.
  • Corpora building in minor languages poses new
    challenges to innovate novel ways to accommodate
    and adequately describe the distinctive features
    of these languages.
  • Comparison of corpora studies - within a family
    of languages, across the families of languages
    and at the international level will be helpful in
    bringing out a standard module of developing
    corpora.

EACL 2003, CLSAL Budapest April 12 17, 2003
23
Thank You
EACL 2003, CLSAL Budapest April 12 17, 2003
24
8.1 Kannada Code Chart
EACL 2003, CLSAL Budapest April 12 17, 2003
25
EACL 2003, CLSAL Budapest April 12 17, 2003
26
EACL 2003, CLSAL Budapest April 12 17, 2003
27
EACL 2003, CLSAL Budapest April 12 17, 2003
28
Demography Astrology Criminology Physical
Education / Sports Health and Family
Welfare Forestry Sexology Culture
Anthropology Commerce Banking Accountancy Industry
handicrafts Finance Textile Technology Official
And Media Languages Mass Media Legislative Admini
strative Translated Material Literature Scientific
Legal Administration Translated Psychology
Film Technology Photography Marine
Biology Fisheries Textile Technology Social
Sciences Sociology Linguistics Psychology Anthropo
logy History, Archeology, Epigraphy Political
Science Home Science Library Science Religion,
Philosophy Economics Logic Journalism Folklore/Myt
hology Public Administration Law Business
Management Education Text Books-Social Science
Natural, Physical And Professional
Sciences Botany Zoology Geology Geography Bio
Chemistry Micro Biology Physics Chemistry Mathemat
ics Statistics Computer Sciences Astronomy Text
book(Science) Medicine Ayurveda Homeopathy Yoga Na
turopathy Engineering Architecture Oceanology Agri
culture Veternary
Aesthetics Literature Novel Short Story
Essays Criticism Humour Children 's
Literature Biographies Autobiographies
Travelogues Letters/Diaries/ Speeches Plays
Science Fiction Folk Tales Text Books(School)
Social Sciences Fine Arts Music
Dance/Impersonations Drawing Sculpture Musical
Instruments Hobbies
EACL 2003, CLSAL Budapest April 12 17, 2003
29
EACL 2003, CLSAL Budapest April 12 17, 2003
30
EACL 2003, CLSAL Budapest April 12 17, 2003
31
EACL 2003, CLSAL Budapest April 12 17, 2003
32
EACL 2003, CLSAL Budapest April 12 17, 2003
33
EACL 2003, CLSAL Budapest April 12 17, 2003
34
EACL 2003, CLSAL Budapest April 12 17, 2003
35
Thank You
EACL 2003, CLSAL Budapest April 12 17, 2003
Write a Comment
User Comments (0)
About PowerShow.com