Title: A PROPOSAL FOR CREATION OF A
1A PROPOSAL FOR CREATION OF A
FOR INDIA
Focus linguistic data
2What is Linguistic Data?
- Printed words - in different scripts, fonts,
platforms environments - Domain-specific texts (e.g. 90-odd ones in
current Indian languages corpora) - Samples of Spoken Corpus telephone talk, public
lectures, formal discussions, in-group
conversations, radio talks, natural language
queries, etc. - Hand-written samples
- Ritualistic Use of languages scriptures,
chanting, etc. - Language of Performance - Reading, recitations,
enactment
But this data is of use only if it comes with
linguistic analysis
Cause it must be tagged and aligned to be of use
THATS WHAT CREATES AN IMPORTANT ROLE FOR
LINGUISTS IN THIS ENTERPRISE
3- The Brown University text corpus was adopted to
build statistical language models. - TI-46 TI DIGITS databases, of Texas Instruments
(early 80's) distributed by NIST. - The LDC at U-Penn was established in 1992.
- CIIL houses 45 million Word Corpora in 15 Indian
lgs with DoE-TDIL support. CIIL has been
distributing it to RD groups the world over. - Now converted into UNICODE jointly with the U of
Lancaster and with another 45 million word
Corpora from five Indian languages under Emille
project coming in, it has been released in early
2004. - CIIL is now working with Universities of Uppsala
on corpora of lesser-known languages of India
See www.ciiluppsala-spokencorpus.net - SO WHAT made us PROPOSE a LDC-IL?
How the Idea of an Indian LDC Came about?
- The giant strides in IT that India has made.
- Because demands were made by several Software and
Telecom giants Reliance, IBM, HPLabs, Modular
Syetems Infosys. - Due to suggestions of the Hindi Committee
- As decided in the 1st ILPC meeting, 2004.
4RECOLLECTING EVOLUTION OFTHEPROPOSAL?
Proposal evolved through discussion held with
many Institutions in India and abroad. August
13, 2003 1st presentation at the MHRD, with the
then ES in the chair, and FA, AS, J.S.(L),
Director (L) and experts from C-DAC and
IIT-Kanpur.
August 17 and 18, 2003 An International Workshop
on LDC was held at the CIIL, Mysore in
collaboration with IIIT-Hyderabad and HPLabs,
India. It was inaugurated by Smt. Kumud Bansal
(the then AS now Secretary, Elementary Ed), and
attended by the J.S. (L). Those who created LDC
in USA had participated.
August 19, 2003 a follow up meeting of a
smaller group was held at the Indian Institute of
Science to thrash out further details. A Project
Committee was set up.
5- Nov 18,03 Modified proposal submitted.
- Dec 19, 2003 During the 2nd ICON,
representatives of lead Institutes met in Mysore
to discuss the draft sent to the Ministry. Prof.
Aravind Joshi also participated. - January, 2004 With additional inputs, the
proposal was modified. - Feb 24, '04 A number of suggestions made (see
minutes) during the 2nd Presentation for ES, AS,
JS(L), IFD. - April 16, 2004 After the presentation before
TDIL Advisory Comm., DoE offers full support.
- The Project Drafting Committee had top NLP
specialists and linguists with the Director CIIL
as the Coordinator. - Five experts from IIT-B, IIT-M, IISc, IIIT- Hyd,
CIIL with inputs from the industry. - All changes were made through email chats and
exchanges, and after four after teleconferencing
during Sept-Oct, 2003.
6Why LDC-IL?
- The importance of creation of a large
data-archive of Indian languages is undeniable.
In fact, it is this realization that resulted in
governments plan for corpora development in
early 90s. - Indian languages often pose a difficult
challenge for the specialists in AI/NLP. - The technology developers building
mass-application tools/products, have for long
been calling for availability of linguistic data
on a large scale. - However, the data should be collected, organized
and stored in a manner that suits different
groups of technology developers. - These issues require us to involve a number of
disciplines like linguistics, statistics, CS. - Further, this data must be of high quality with
defined standards. - Resources must be shared, so that all RD groups
are benefited. - All these are possible with a data consortium.
7Spoken language data importance of phoneticians
- Numerous Indian languages, each with so many
sound patterns identified/studied by phoneticians
for centuries. - The inventory of IPA is invaluable for spoken
language corpus, but their identification from
speech data requires finesse. - For speech technology,we have to create both
phonetics/ acoustics models of languages - Even when it is now aided and eased by Visual
Phonetics technology, as available in CIIL or
TIFR labs, what we need in addition is trained
phoneticians.
8THE MODEL
- An ideal model of Consortium could be seen if we
consider the Linguistic Data Consortium (LDC)
hosted by the University of Pennsylvania. - LDC (USA) is an open consortium of universities,
companies government RD labs that creates,
collects and distributes speech and text
databases, lexicons, and other resources for RD.
- This LDC has 100 plus agencies as its active
users and members. Includes some non-western
languagesArabic,Chinese, Korean. - The core operations of are self-supporting after
ten years. - The activities include maintaining the data
archives, producing and distributing CD-ROMs, and
arranging networked data distribution, etc. - All these have provided a great impetus to RD in
the field of language technology for English and
other European languages. - It is proposed to adopt a similar approach in the
Indian context.
9Who funded LDC in US?
Who managed? 1.Govt 2.Industry 3.University
- LDC was supported initially by US Govt grant
IRI-9528587 from the Information and Intelligent
Systems division - Also by a grant 9982201 from the Human Computer
Interaction Program of the National Science
Foundation - Powered in part by Academic Equipment Grant
7826-990 237-US from Sun Microsystems. - No member institution could afford to produce
this individually.
10Who will set up LDC-IL in India? What will it
do actually?
- The Ministry of HRD through the Central Institute
of Indian Languages (CIIL), Mysore along with
other institutions working on Indian Languages
technology like Indian Institute of Science,
Bangalore, Indian Institutes of Technology at
Mumbai and Chennai, as well as the International
Institute of Information Technology, Hyderabad
propose to set up this LDC-IL. - It is proposed that they will be the Lead
Institutions in this initiative, with CIIL as the
coordinating body.
LDC-IL will be an archive plus. Besides data,
tools and standards of data representation and
analysis must be developed. It will create,
analyze, segment, tag, align, and upload
different kinds of linguistic resources. It
will accept electronic resources from authors,
newspapers, publishers, film, TV, radio process
them for use of the community.
11Potential Participants / Institutions in India
All academic institutes, research organizations
and Corporate RD groups from India and abroad
working on Indian languages will be encouraged to
participate in LDC-IL. The following have already
shown interest
- IISc Bangalore
- All Indian Institutes of Technology
- IIITs at Hyderabad and elsewhere
- ISI Calcutta/Hyderabad/Bangalore
- C-DAC, Pune
- TIFR Mumbai
- Universities like U of Hyderabad DU JNU NEHU
- HP Labs India
- IBM Infosys Reliance Infocom
- Language institutions like CIEFL, KHS, NCPUL
RSKS
12Major areas of Linguistic Resource Development as
proposed
- Speech Recognition and Synthesis
- Character Recognition
- Creation of different kinds of Corpora
- NLP
- By-products Word finders, lexicons of different
kind, thesauri, Usage compilations etc.
13Other possible applications
- Collocational restrictions for OCR building
- TTS Statistical Probabilities models
- Build a speech recognition model
- Auto-summarization
- Develop Tree-bank tools
- Skeletal parses
- Will form a basis of MAT or MT systems
IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY TO
WHAT IS BEING PLANNED / ENCOURAGED BY TDIL of
MCIT, and will complement it perfectly
14Funding Management
- The core funding from the Government of India. It
will span over two plan periods. - All activities will be in a project mode and
through CIILs PL account. - All staff will be on contract.
- All receipts and payments through internet
gateways, or through conventional means, will go
to this special bank account. - Will attempt to leverage expertise already
available to cut avoidable cost and delay. - As the nodal agency, CIIL will further distribute
the relevant funding for specific sub-components
of the scheme to other academic institutions. - An annual progress report will be submitted to
the government.
15Arrangements
16PAC of LDC-IL
17MembershipDifferential rate of annual fee
- Other countries
- 1. Individual Researchers 2,000/- per annum
- 2. Educational Institutions 20,000/- per annum
- 3. Software and related industry
50,000/- per annum
- India
- 1. Individual Researchers Rs.2000/- per annum
- 2. Educational Institutions Rs.20,000/- per
annum - 3. Software and related industry Rs.2,00,000/-
per annum
GOES WITHOUT SAYING THAT THIS WOULD REQUIRE
CONSTANT UPDATION AND UPGRADATION AS WELL AS
EXPANSION OF OUR DATA / TOOLS / PRODUCTS
18Estimation
- It is estimated that by the third year, LDC-IL
will have 50 Institutional members from India,
and 200 Indian scholars as individual members,
contributing to Rs. 12 lakh annually. - In addition, it is estimated to have at least 20
researchers from abroad as individual members,
contributing to 40,000 or Rs. 20 lakhs more. - The attempt will be to secure industrial support
from the IT sector internationally to raise at
least 10 institutional memberships initially,
creating a corpus of 200,000 annually by/during
the third year. Should that happen, it will
generate a substantial amount for LDC-IL.
19Budget A broad indicationRs. 221.60 lakhs per
year. Total Rupees 1772.8 lakhs for the next 8
years.
- 1. Human Resources 69,84,000
- 2. Tasks
64,76,000 - 3. Events (Meetings, workshops,
- seminars Training programs) 50,00,000
- 4. Equipments maintenance 27,00,000
- 5. IPR costs publications 10,00,000
- Total Rs. 2,21,60,000
- NB The Director CIIL on the advise of the
Project Advisory Committee of the LDC-IL may be
authorized to re-appropriate funds from among
the heads indicated here, without exceeding the
overall budget. - In case the people in service in the Government
or Autonomous Institutions in substantial
capacity are selected their service and salary
will be protected.
20(No Transcript)
21Resource Generation- Details
- The first 2 years of the project are incubation
years. It would take time to set up, and test-run
tools and deliverables advertise. - It is estimated that from the third year onwards,
the annual revenue may be 8 to 10 of the annual
investment, i.e. Rs. 17.73 lakhs to Rs. 22.16
lakhs contributing to Corpus Fund. - 6th year on, it will be around 25 to 35 of the
amount invested, i.e. Rs.55.4 lakhs to Rs.66.48
lakhs annually. - At the end of eight years, there will be at least
Rs. 201.66 lakhs to Rs. 243.76 lakhs plus
interests in corpus funds. - Hopefully, there will be new lead institutions to
contribute to corpus fund further, once LDC-IL
works in full swing.
22Core Operations to be self-supporting
- Beyond eight years, Govt may support only events
(Rs.50 lakhs from CIILs OC-Plan), tasks of
software development (Rs.64.76 lakhs from our
OE-Plan), and maintenance of equipments (Rs.15.24
lakhs from OE-Non-Plan), i.e. Rs.130 lakhs a
years. - The services of the personnel and the IPR costs
will be paid from 6 interests of the corpus
funds (Rs.14.63 lakhs) plus anticipated annual
income, i.e. 66.48 lakhs, i.e. Rs.81.11 lakhs
generated annually. With Rs.130 lakhs as above,
the total comes to Rs.211.11 lakhs (approx).
23Thank you
24Speech Recognition and Synthesis Objectives
- 1. Primarily to build speech recognition and
synthesis systems. - 2. Although there are ASR TTS systems for
many western languages, commercially viable
speech systems are unavailable. - 3. Voice User Interfaces for IT applications
and services, useful especially in
telephony-based applications. - 4. If such technology is available in Indian
languages, people in various semi-urban and rural
parts of India will be able to use telephones
and Internet to access a wide range of services
and information on health, agriculture, travel,
etc. - 5. However, for this a computer has to be
able to accept speech input in the users
language and provide natural speech output. - 6. Also in India, if speech technology is
coupled with translation systems between the
various Indian languages. - 7. The main obstacle is to customize this
technology for various Indian languages is the
lack of appropriate annotated speech databases. - 8. Focus (i) to collect data that can be used
for building speech enabled systems in Indian
languages and (ii) to develop tools that
facilitate collection of high quality speech data.
25Goals long short term
26Methodology
27Possible Applications
- Speech to Speech translation for a pair of Indian
languages, namely, Hindi and Telugu. - Command and control applications.
- Multimodal interfaces to the computer in Indian
languages. - E-mail readers over the telephone.
- Readers for the visually disadvantaged.
- Speech enabled Office Suite.
The effort for both Speech Recognition and Speech
Synthesis will be repeated across all 22
Scheduled languages. For Speech Recognition,
spontaneous speech data will be collected along
with read speech. For speech synthesis, data will
be collected from professional speakers, with
very good voice quality. Additional speech data
will be collected to come out with models for
prosody (intonation, duration, etc.) to improve
the naturalness of synthesized speech. A database
(lexicon) of proper names (of Indian origin) will
be created, with the equivalent phonetic
representation for each of the names.
28Character Recognition
- Character Recognition refers to the conversion of
printed or handwritten characters to a
machine-interpretable form. - Online handwriting recognition or Online HWR
refers to the interpretation of handwriting
captured dynamically using a handheld or tablet
device. It allows the creation of more natural
handwriting-based alternatives to keyboards for
data entry in Indian scripts, and also for
imparting of handwriting skills using computers. - Offline handwriting recognition or Offline HWR
refers to the interpretation of handwriting
captured statically as an image. - Optical character recognition or OCR refers to
the interpretation of printed text captured as an
image. It can be used for conversion of printed
or typewritten material such as books and
documents into electronic form. - These different areas of language technology
require different algorithms and linguistic
resources. - They are all hard research problems because of
the variety of writing styles and fonts
encountered. - Of these, OCR has seen some research in a few
Indian scripts because of support from the TDIL
program. However the technology is not yet mature
and there is only one commercial offering.
29Possible Applications
30Natural Language Processing
- Electronic dictionaries
- Electronic dictionaries are a primary requisite
for developing any software in NLP. - ED 1 Monolingual/bilingual dictionaries
- 25,000 words per year (per language)
- ED 2. Transfer Lexicon and Grammar(TransLexGram)
(per language) - Transfer Lexicon and Grammar above involves
developing a language resource which would
contain - English Headwords
- Their grammatical category
- Their various senses in Hindi
- Corresponding sense in the other Indian language
- An example sentence in English for each sense of
a word - Corresponding translation in the concerned Indian
language - o In case of verbs, parallel verb-frames from
English to Indian language. - As is obvious from the above, TransLexGram will
be a rich lexicon which will not only contain the
word level information but also the crucial
information of verb-argument structure and the
vibhaktis with specific senses of a verb. - The resource, once created will be a parallel
resource not only between English and Indian
languages but also across all Indian languages.
31Creation of Corpora
- Domain Specific Corpora
-
- Apart from these basic text corpora creation an
attempt will be made to create domain specific
corpora in the following areas -
- a. Newspaper corpora
- b. Child language corpus
- c. Pathological speech/language data
- d. Speech error Data
- e. Historical/Inscriptional databases of
Indian languages which is one of the most
important to trace not only as the living
documents of Indian History but also historical
linguistics of Indian languages. - f. Grammars of comparative/descriptive/refe
rence are needed to be considered as corpus of
databases. - g. Morphological Analyzers and
morphological generators.
32POS tagged corpora
- Part-of-speech (or POS) tagged corpora are
collections of texts in which part of speech
category for each word is marked. - To be developed in a bootstrapping manner.
- First, manual tagging will be done on some amount
of text. - Then, a POS tagger which uses learning techniques
will be used to learn from the tagged data. - After the training, the tool will automatically
tag another set of the raw corpus. - Automatically tagged corpus will then be manually
validated which will be used as additional
training data for enhancing the performance of
the tool.
33Other kinds of Corpora
Semantically tagged corpora The real challenge
in any NLP and text information processing
application is the task of disambiguating senses.
In spite of long years of R D in this area,
fully automatic WSD with 100 accuracy has
remained an elusive goal. One of the reasons for
this shortcoming is understood to be the lack of
appropriate and adequate lexical resources and
tools. One such resource is the "semantically
tagged corpora".
- Chunked corpora
- The chunked corpora will also be prepared in a
manner similar to the POS tagging. Here also the
initial training set will be a complete manual
effort. Thereafter, it will be a man-machine
effort. That is why, the target in the first year
is less and double in the successive years.
Chunked corpora is a useful resource for various
applications.
34- Syntactic tree bank
- Preparation of this resource requires higher
level of linguistic expertise and needs more
human effort. First, experts will manually tag
the data for syntactic parsing. - Since, a crucial point related to this task is
to arrive at a consensus regarding the tags,
degree of fineness in analysis and the
methodology to be followed. This calls for some
discussions amongst the scholars from varying
fields such as Sanskritists, linguistics and
computer scientists . It will be achieved through
conduct of workshops and meetings.
- Parallel aligned corpora
- A text available in multiple languages through
translation constitutes parallel corpora. - NBT Sahitya Akademi are some of the official
agencies who develop parallel texts in different
languages through translation. - Such Institutions have given permission to CIIL
to use their works for creation of electronic
versions of the same as parallel corpora. - The literary magazines and news paper houses with
multiple language editions will have to be
approached for parallel corpora. - Computer programmes have to be written for
creating - I Aligned texts II Aligned sentences and
III Aligned chunks.
35Corpora Tools
- 1.Tools for Transfer Lexicon Grammar (including
creation of interface for building Transfer
Lexicon Grammar) - 2. Spellchecker and corrector tools
- 3. Tools for POS tagging. (Trainable tagging tool
an Interface for editing POS - tagged corpora)
- 4. Tools for chunking (Rule-based
language-independent chunkers) - 5. Interface for chunking (Building an interface
for editing and validating the chunked corpora) - 6.Tools for syntactic tree bank, incl. interface
for developing syntactic tree bank - 7. Tools for semantic tagging with basic
resources are the Indian language WordNets
showing a browser that has two windows one
showing the senses (i.e., synsets) from the
WordNet appear in the other window, after which a
manual selection of the sense can be done - 8. (Semi) automatic tagger based on statistical
NLP (the preliminary version of - which is ready in IITB)
- 9. Tools for text alignment, including Text
alignment tool, Sentence alignment tool and Chunk
alignment tool as well as an interface for
aligning corpora