A PROPOSAL FOR CREATION OF A - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

A PROPOSAL FOR CREATION OF A

Description:

Printed words - in different scripts, fonts, platforms & environments ... out with models for prosody (intonation, duration, etc.) to improve the ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 36
Provided by: pc7558
Category:

less

Transcript and Presenter's Notes

Title: A PROPOSAL FOR CREATION OF A


1
A PROPOSAL FOR CREATION OF A
FOR INDIA
Focus linguistic data
2
What is Linguistic Data?
  • Printed words - in different scripts, fonts,
    platforms environments
  • Domain-specific texts (e.g. 90-odd ones in
    current Indian languages corpora)
  • Samples of Spoken Corpus telephone talk, public
    lectures, formal discussions, in-group
    conversations, radio talks, natural language
    queries, etc.
  • Hand-written samples
  • Ritualistic Use of languages scriptures,
    chanting, etc.
  • Language of Performance - Reading, recitations,
    enactment

But this data is of use only if it comes with
linguistic analysis
Cause it must be tagged and aligned to be of use
THATS WHAT CREATES AN IMPORTANT ROLE FOR
LINGUISTS IN THIS ENTERPRISE
3
  • The Brown University text corpus was adopted to
    build statistical language models.
  • TI-46 TI DIGITS databases, of Texas Instruments
    (early 80's) distributed by NIST.
  • The LDC at U-Penn was established in 1992.
  • CIIL houses 45 million Word Corpora in 15 Indian
    lgs with DoE-TDIL support. CIIL has been
    distributing it to RD groups the world over.
  • Now converted into UNICODE jointly with the U of
    Lancaster and with another 45 million word
    Corpora from five Indian languages under Emille
    project coming in, it has been released in early
    2004.
  • CIIL is now working with Universities of Uppsala
    on corpora of lesser-known languages of India
    See www.ciiluppsala-spokencorpus.net
  • SO WHAT made us PROPOSE a LDC-IL?

How the Idea of an Indian LDC Came about?
  • The giant strides in IT that India has made.
  • Because demands were made by several Software and
    Telecom giants Reliance, IBM, HPLabs, Modular
    Syetems Infosys.
  • Due to suggestions of the Hindi Committee
  • As decided in the 1st ILPC meeting, 2004.

4
RECOLLECTING EVOLUTION OFTHEPROPOSAL?
Proposal evolved through discussion held with
many Institutions in India and abroad. August
13, 2003 1st presentation at the MHRD, with the
then ES in the chair, and FA, AS, J.S.(L),
Director (L) and experts from C-DAC and
IIT-Kanpur.
August 17 and 18, 2003 An International Workshop
on LDC was held at the CIIL, Mysore in
collaboration with IIIT-Hyderabad and HPLabs,
India. It was inaugurated by Smt. Kumud Bansal
(the then AS now Secretary, Elementary Ed), and
attended by the J.S. (L). Those who created LDC
in USA had participated.
August 19, 2003 a follow up meeting of a
smaller group was held at the Indian Institute of
Science to thrash out further details. A Project
Committee was set up.
5
  • Nov 18,03 Modified proposal submitted.
  • Dec 19, 2003 During the 2nd ICON,
    representatives of lead Institutes met in Mysore
    to discuss the draft sent to the Ministry. Prof.
    Aravind Joshi also participated.
  • January, 2004 With additional inputs, the
    proposal was modified.
  • Feb 24, '04 A number of suggestions made (see
    minutes) during the 2nd Presentation for ES, AS,
    JS(L), IFD.
  • April 16, 2004 After the presentation before
    TDIL Advisory Comm., DoE offers full support.
  • The Project Drafting Committee had top NLP
    specialists and linguists with the Director CIIL
    as the Coordinator.
  • Five experts from IIT-B, IIT-M, IISc, IIIT- Hyd,
    CIIL with inputs from the industry.
  • All changes were made through email chats and
    exchanges, and after four after teleconferencing
    during Sept-Oct, 2003.

6
Why LDC-IL?
  • The importance of creation of a large
    data-archive of Indian languages is undeniable.
    In fact, it is this realization that resulted in
    governments plan for corpora development in
    early 90s.
  • Indian languages often pose a difficult
    challenge for the specialists in AI/NLP.
  • The technology developers building
    mass-application tools/products, have for long
    been calling for availability of linguistic data
    on a large scale.
  • However, the data should be collected, organized
    and stored in a manner that suits different
    groups of technology developers.
  • These issues require us to involve a number of
    disciplines like linguistics, statistics, CS.
  • Further, this data must be of high quality with
    defined standards.
  • Resources must be shared, so that all RD groups
    are benefited.
  • All these are possible with a data consortium.

7
Spoken language data importance of phoneticians
  • Numerous Indian languages, each with so many
    sound patterns identified/studied by phoneticians
    for centuries.
  • The inventory of IPA is invaluable for spoken
    language corpus, but their identification from
    speech data requires finesse.
  • For speech technology,we have to create both
    phonetics/ acoustics models of languages
  • Even when it is now aided and eased by Visual
    Phonetics technology, as available in CIIL or
    TIFR labs, what we need in addition is trained
    phoneticians.

8
THE MODEL
  • An ideal model of Consortium could be seen if we
    consider the Linguistic Data Consortium (LDC)
    hosted by the University of Pennsylvania.
  • LDC (USA) is an open consortium of universities,
    companies government RD labs that creates,
    collects and distributes speech and text
    databases, lexicons, and other resources for RD.
  • This LDC has 100 plus agencies as its active
    users and members. Includes some non-western
    languagesArabic,Chinese, Korean.
  • The core operations of are self-supporting after
    ten years.
  • The activities include maintaining the data
    archives, producing and distributing CD-ROMs, and
    arranging networked data distribution, etc.
  • All these have provided a great impetus to RD in
    the field of language technology for English and
    other European languages.
  • It is proposed to adopt a similar approach in the
    Indian context.

9
Who funded LDC in US?
Who managed? 1.Govt 2.Industry 3.University
  • LDC was supported initially by US Govt grant
    IRI-9528587 from the Information and Intelligent
    Systems division
  • Also by a grant 9982201 from the Human Computer
    Interaction Program of the National Science
    Foundation
  • Powered in part by Academic Equipment Grant
    7826-990 237-US from Sun Microsystems.
  • No member institution could afford to produce
    this individually.

10
Who will set up LDC-IL in India? What will it
do actually?
  • The Ministry of HRD through the Central Institute
    of Indian Languages (CIIL), Mysore along with
    other institutions working on Indian Languages
    technology like Indian Institute of Science,
    Bangalore, Indian Institutes of Technology at
    Mumbai and Chennai, as well as the International
    Institute of Information Technology, Hyderabad
    propose to set up this LDC-IL.
  • It is proposed that they will be the Lead
    Institutions in this initiative, with CIIL as the
    coordinating body.

LDC-IL will be an archive plus. Besides data,
tools and standards of data representation and
analysis must be developed. It will create,
analyze, segment, tag, align, and upload
different kinds of linguistic resources. It
will accept electronic resources from authors,
newspapers, publishers, film, TV, radio process
them for use of the community.
11
Potential Participants / Institutions in India
All academic institutes, research organizations
and Corporate RD groups from India and abroad
working on Indian languages will be encouraged to
participate in LDC-IL. The following have already
shown interest
  • IISc Bangalore
  • All Indian Institutes of Technology
  • IIITs at Hyderabad and elsewhere
  • ISI Calcutta/Hyderabad/Bangalore
  • C-DAC, Pune
  • TIFR Mumbai
  • Universities like U of Hyderabad DU JNU NEHU
  • HP Labs India
  • IBM Infosys Reliance Infocom
  • Language institutions like CIEFL, KHS, NCPUL
    RSKS

12
Major areas of Linguistic Resource Development as
proposed
  • Speech Recognition and Synthesis
  • Character Recognition
  • Creation of different kinds of Corpora
  • NLP
  • By-products Word finders, lexicons of different
    kind, thesauri, Usage compilations etc.

13
Other possible applications
  • Collocational restrictions for OCR building
  • TTS Statistical Probabilities models
  • Build a speech recognition model
  • Auto-summarization
  • Develop Tree-bank tools
  • Skeletal parses
  • Will form a basis of MAT or MT systems

IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY TO
WHAT IS BEING PLANNED / ENCOURAGED BY TDIL of
MCIT, and will complement it perfectly
14
Funding Management
  • The core funding from the Government of India. It
    will span over two plan periods.
  • All activities will be in a project mode and
    through CIILs PL account.
  • All staff will be on contract.
  • All receipts and payments through internet
    gateways, or through conventional means, will go
    to this special bank account.
  • Will attempt to leverage expertise already
    available to cut avoidable cost and delay.
  • As the nodal agency, CIIL will further distribute
    the relevant funding for specific sub-components
    of the scheme to other academic institutions.
  • An annual progress report will be submitted to
    the government.

15
Arrangements
16
PAC of LDC-IL
17
MembershipDifferential rate of annual fee
  • Other countries
  • 1. Individual Researchers 2,000/- per annum
  • 2. Educational Institutions 20,000/- per annum
  • 3. Software and related industry
    50,000/- per annum
  • India
  • 1. Individual Researchers Rs.2000/- per annum
  • 2. Educational Institutions Rs.20,000/- per
    annum
  • 3. Software and related industry Rs.2,00,000/-
    per annum

GOES WITHOUT SAYING THAT THIS WOULD REQUIRE
CONSTANT UPDATION AND UPGRADATION AS WELL AS
EXPANSION OF OUR DATA / TOOLS / PRODUCTS
18
Estimation
  • It is estimated that by the third year, LDC-IL
    will have 50 Institutional members from India,
    and 200 Indian scholars as individual members,
    contributing to Rs. 12 lakh annually.
  • In addition, it is estimated to have at least 20
    researchers from abroad as individual members,
    contributing to 40,000 or Rs. 20 lakhs more.
  • The attempt will be to secure industrial support
    from the IT sector internationally to raise at
    least 10 institutional memberships initially,
    creating a corpus of 200,000 annually by/during
    the third year. Should that happen, it will
    generate a substantial amount for LDC-IL.

19
Budget A broad indicationRs. 221.60 lakhs per
year. Total Rupees 1772.8 lakhs for the next 8
years.
  • 1. Human Resources 69,84,000
  • 2. Tasks
    64,76,000
  • 3. Events (Meetings, workshops,
  • seminars Training programs) 50,00,000
  • 4. Equipments maintenance 27,00,000
  • 5. IPR costs publications 10,00,000
  • Total Rs. 2,21,60,000
  • NB The Director CIIL on the advise of the
    Project Advisory Committee of the LDC-IL may be
    authorized to re-appropriate funds from among
    the heads indicated here, without exceeding the
    overall budget.
  • In case the people in service in the Government
    or Autonomous Institutions in substantial
    capacity are selected their service and salary
    will be protected.

20
(No Transcript)
21
Resource Generation- Details
  • The first 2 years of the project are incubation
    years. It would take time to set up, and test-run
    tools and deliverables advertise.
  • It is estimated that from the third year onwards,
    the annual revenue may be 8 to 10 of the annual
    investment, i.e. Rs. 17.73 lakhs to Rs. 22.16
    lakhs contributing to Corpus Fund.
  • 6th year on, it will be around 25 to 35 of the
    amount invested, i.e. Rs.55.4 lakhs to Rs.66.48
    lakhs annually.
  • At the end of eight years, there will be at least
    Rs. 201.66 lakhs to Rs. 243.76 lakhs plus
    interests in corpus funds.
  • Hopefully, there will be new lead institutions to
    contribute to corpus fund further, once LDC-IL
    works in full swing.

22
Core Operations to be self-supporting
  • Beyond eight years, Govt may support only events
    (Rs.50 lakhs from CIILs OC-Plan), tasks of
    software development (Rs.64.76 lakhs from our
    OE-Plan), and maintenance of equipments (Rs.15.24
    lakhs from OE-Non-Plan), i.e. Rs.130 lakhs a
    years.
  • The services of the personnel and the IPR costs
    will be paid from 6 interests of the corpus
    funds (Rs.14.63 lakhs) plus anticipated annual
    income, i.e. 66.48 lakhs, i.e. Rs.81.11 lakhs
    generated annually. With Rs.130 lakhs as above,
    the total comes to Rs.211.11 lakhs (approx).

23
Thank you
24
Speech Recognition and Synthesis Objectives
  • 1.      Primarily to build speech recognition and
    synthesis systems.
  • 2.   Although there are ASR TTS systems for
    many western languages, commercially viable
    speech systems are unavailable.
  • 3.     Voice User Interfaces for IT applications
    and services, useful especially in
    telephony-based applications.
  • 4.      If such technology is available in Indian
    languages, people in various semi-urban and rural
    parts of India will be able to use telephones
    and Internet to access a wide range of services
    and information on health, agriculture, travel,
    etc.
  • 5.     However, for this a computer has to be
    able to accept speech input in the users
    language and provide natural speech output.
  • 6.   Also in India, if speech technology is
    coupled with translation systems between the
    various Indian languages.
  • 7.   The main obstacle is to customize this
    technology for various Indian languages is the
    lack of appropriate annotated speech databases.
  • 8.    Focus (i) to collect data that can be used
    for building speech enabled systems in Indian
    languages and (ii) to develop tools that
    facilitate collection of high quality speech data.

25
Goals long short term
26
Methodology
27
Possible Applications
  • Speech to Speech translation for a pair of Indian
    languages, namely, Hindi and Telugu.
  • Command and control applications.
  • Multimodal interfaces to the computer in Indian
    languages.
  • E-mail readers over the telephone.
  • Readers for the visually disadvantaged.
  • Speech enabled Office Suite.

The effort for both Speech Recognition and Speech
Synthesis will be repeated across all 22
Scheduled languages. For Speech Recognition,
spontaneous speech data will be collected along
with read speech. For speech synthesis, data will
be collected from professional speakers, with
very good voice quality. Additional speech data
will be collected to come out with models for
prosody (intonation, duration, etc.) to improve
the naturalness of synthesized speech. A database
(lexicon) of proper names (of Indian origin) will
be created, with the equivalent phonetic
representation for each of the names.
28
Character Recognition
  • Character Recognition refers to the conversion of
    printed or handwritten characters to a
    machine-interpretable form.
  • Online handwriting recognition or Online HWR
    refers to the interpretation of handwriting
    captured dynamically using a handheld or tablet
    device. It allows the creation of more natural
    handwriting-based alternatives to keyboards for
    data entry in Indian scripts, and also for
    imparting of handwriting skills using computers.
  • Offline handwriting recognition or Offline HWR
    refers to the interpretation of handwriting
    captured statically as an image.
  • Optical character recognition or OCR refers to
    the interpretation of printed text captured as an
    image. It can be used for conversion of printed
    or typewritten material such as books and
    documents into electronic form.
  • These different areas of language technology
    require different algorithms and linguistic
    resources.
  • They are all hard research problems because of
    the variety of writing styles and fonts
    encountered.
  • Of these, OCR has seen some research in a few
    Indian scripts because of support from the TDIL
    program. However the technology is not yet mature
    and there is only one commercial offering.

29
Possible Applications
30
Natural Language Processing
  • Electronic dictionaries
  • Electronic dictionaries are a primary requisite
    for developing any software in NLP.
  • ED 1 Monolingual/bilingual dictionaries
  • 25,000 words per year (per language)
  • ED 2. Transfer Lexicon and Grammar(TransLexGram)
    (per language)
  • Transfer Lexicon and Grammar above involves
    developing a language resource which would
    contain
  • English Headwords
  • Their grammatical category
  • Their various senses in Hindi
  • Corresponding sense in the other Indian language
  • An example sentence in English for each sense of
    a word
  • Corresponding translation in the concerned Indian
    language
  • o In case of verbs, parallel verb-frames from
    English to Indian language.
  • As is obvious from the above, TransLexGram will
    be a rich lexicon which will not only contain the
    word level information but also the crucial
    information of verb-argument structure and the
    vibhaktis with specific senses of a verb.
  • The resource, once created will be a parallel
    resource not only between English and Indian
    languages but also across all Indian languages.

31
Creation of Corpora
  • Domain Specific Corpora
  •  
  • Apart from these basic text corpora creation an
    attempt will be made to create domain specific
    corpora in the following areas
  •  
  • a.       Newspaper corpora
  • b.      Child language corpus
  • c.       Pathological speech/language data
  • d.      Speech error Data
  • e.       Historical/Inscriptional databases of
    Indian languages which is one of the most
    important to trace not only as the living
    documents of Indian History but also historical
    linguistics of Indian languages.
  • f.        Grammars of comparative/descriptive/refe
    rence are needed to be considered as corpus of
    databases.
  • g.       Morphological Analyzers and
    morphological generators.

32
POS tagged corpora
  • Part-of-speech (or POS) tagged corpora are
    collections of texts in which part of speech
    category for each word is marked.
  • To be developed in a bootstrapping manner.
  • First, manual tagging will be done on some amount
    of text.
  • Then, a POS tagger which uses learning techniques
    will be used to learn from the tagged data.
  • After the training, the tool will automatically
    tag another set of the raw corpus.
  • Automatically tagged corpus will then be manually
    validated which will be used as additional
    training data for enhancing the performance of
    the tool.

33
Other kinds of Corpora
Semantically tagged corpora The real challenge
in any NLP and text information processing
application is the task of disambiguating senses.
In spite of long years of R D in this area,
fully automatic WSD with 100 accuracy has
remained an elusive goal. One of the reasons for
this shortcoming is understood to be the lack of
appropriate and adequate lexical resources and
tools. One such resource is the "semantically
tagged corpora".
  • Chunked corpora
  • The chunked corpora will also be prepared in a
    manner similar to the POS tagging. Here also the
    initial training set will be a complete manual
    effort. Thereafter, it will be a man-machine
    effort. That is why, the target in the first year
    is less and double in the successive years.
    Chunked corpora is a useful resource for various
    applications.

34
  • Syntactic tree bank
  • Preparation of this resource requires higher
    level of linguistic expertise and needs more
    human effort. First, experts will manually tag
    the data for syntactic parsing.
  • Since, a crucial point related to this task is
    to arrive at a consensus regarding the tags,
    degree of fineness in analysis and the
    methodology to be followed. This calls for some
    discussions amongst the scholars from varying
    fields such as Sanskritists, linguistics and
    computer scientists . It will be achieved through
    conduct of workshops and meetings.
  • Parallel aligned corpora
  • A text available in multiple languages through
    translation constitutes parallel corpora.
  • NBT Sahitya Akademi are some of the official
    agencies who develop parallel texts in different
    languages through translation.
  • Such Institutions have given permission to CIIL
    to use their works for creation of electronic
    versions of the same as parallel corpora.
  • The literary magazines and news paper houses with
    multiple language editions will have to be
    approached for parallel corpora.
  • Computer programmes have to be written for
    creating
  • I Aligned texts II Aligned sentences and
    III Aligned chunks.

35
Corpora Tools
  • 1.Tools for Transfer Lexicon Grammar (including
    creation of interface for building Transfer
    Lexicon Grammar)
  • 2. Spellchecker and corrector tools
  • 3. Tools for POS tagging. (Trainable tagging tool
    an Interface for editing POS
  • tagged corpora)
  • 4. Tools for chunking (Rule-based
    language-independent chunkers)
  • 5. Interface for chunking (Building an interface
    for editing and validating the chunked corpora)
  • 6.Tools for syntactic tree bank, incl. interface
    for developing syntactic tree bank
  • 7. Tools for semantic tagging with basic
    resources are the Indian language WordNets
    showing a browser that has two windows one
    showing the senses (i.e., synsets) from the
    WordNet appear in the other window, after which a
    manual selection of the sense can be done
  • 8. (Semi) automatic tagger based on statistical
    NLP (the preliminary version of
  • which is ready in IITB)
  • 9. Tools for text alignment, including Text
    alignment tool, Sentence alignment tool and Chunk
    alignment tool as well as an interface for
    aligning corpora
Write a Comment
User Comments (0)
About PowerShow.com