Title: What is the expected category of contribution: CLIR
1Title Cross Lingual Information Retrieval for
Indian Languages
Proposer CDAC Noida
Name of the company CDAC Noida
Language/Language pair English, Hindi, Punjabi
and Marathi
What is the expected category of contribution
CLIR
2- Strength of CDAC Noida
-
- Technical Capabilities NLP Lab equipped with
necessary software - and 50
trained software professional - Manpower Involved
- Technical
- K K Arora
- Vijay Kumar
- Manish Kumar
- Pragdeeshvaran
- Linguistic Support
- Prof K K Goswami
- Prof Thakur Das
- Mrs Ragini
- Prof Jagannathan, Mr Chandra Mohan and others
3List previous collaboration with universities/RD
institutions
- IIT Kanpur CHD, Delhi CSIO, Chandigarh
- ISI Kolkata Kendriya Hindi Sansthan
- DRDO, Delhi Delhi Press Prakashan Pustak
Mahal - IISc Bangalore CSTT, New Delhi COCOSDA, Japan
- Sahitya Akademi IIT Roorkee ELDA, France
- MGAHV, Wardha Abbyy, Russia W3C
- Jamia-Milia GKV, Hardwar BITS Pilani
- GBPUAT, PantNagar Kumaon Univ, Nainital
Banasthali Vidyapeeth
4- Previous work done in this or similar areas
- Machine Translation
- Parallel Corpus for Indian languages
- Dictionaries / Terminologies like Shabdika,
Lexicon for - MAT, IT Terminology
- Prototype development for CLIR
5Need of the CLIR Indian Context
- Availability of Content on Internet in multiple
Indian languages - People in India generally are familiar with more
than one language - To retrieve the related information that may be
available in any of the known set of languages by
querying in one of these languages.
6Block diagram for CLIR
7Sample entry in Dictionary
8Language Resources needed CLIR
- Dictionary of words in English, Hindi, Punjabi
and Marathi - Root words dictionaries
- Database of phrases and collocations
- A stop word lists in Hindi, Marathi and Punjabi
- List of Hyphenated words
- Proper Names database
9Language Tools needed CLIR
- Inflator/Stemmer routines for English, Hindi,
Punjabi and Marathi - Transliteration routine for language pairs of
targeted languages. This will be based on hybrid
approach of Rule based and dictionary based
- Inflator/Stemmer routines for English, Hindi,
Punjabi and Marathi - Transliteration routine for language pairs of
targeted languages. This will be based on hybrid
approach of Rule based and dictionary based - N-gram Analyzer tool
- Spell variant generator
- Proper name extractor from Parallel corpus
10