Center for Excellence in Computational Engineering and Networking CEN - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Center for Excellence in Computational Engineering and Networking CEN

Description:

English to Tamil Statistical Machine Translation. A Proposal. Amrita Vishwa Vidyapeetham. Ettimadai, Coimbatore 641 105. v a l l u v a n ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 24
Provided by: amr69
Category:

less

Transcript and Presenter's Notes

Title: Center for Excellence in Computational Engineering and Networking CEN


1
  • Center for Excellence in Computational
    Engineering and Networking (CEN)

Amrita Vishwa Vidyapeetham Ettimadai, Coimbatore
641 105.
v a l l u v a n English to Tamil Statistical
Machine Translation
A Proposal
2
Investigators
  • Prof. K.P. Soman (PI)
  • Prof. P. Venkat Rangan
  • Prof. Srinivas Narayanan
  • Dr. Harini Jayaraman
  • Mr. C.J. Srinivasan

3
  • Excerpts from Businessline, Jun 21, 2005.
  • The Minister for Communications and IT, Mr
    Dayanidhi Maran, said efforts were on to make the
    fonts available in all the 22 languages within
    six months. This would open the door for computer
    penetration in villages where 95 per cent of the
    people speak in their mother tongue, he pointed
    out.
  • "We have 16 million computers and net
    penetration of five million. There is much to
    achieve. First we should have local language
    content, which is at a minimal level at present,"
    Mr Maran said. Mr Maran said the idea was to
    create a translation browser in different
    languages so that news provided by different
    organizations such as CNN could be translated.
  • In a lighter vein, he referred to his plight
    during the speeches made by the Railway Minister,
    Mr Lalu Prasad. "When Laluji spoke in Hindi, I
    used to look for people who would translate his
    speeches for me.... In future, there could be a
    software which can be worn like a watch and
    plugged to the ear so that when Laluji speaks, I
    can understand," he said evoking laughter from
    the audience.

4
valluvan our objectives
  • To build a machine translation (MT) working
    framework to provide a reasonable translation of
    English documents in Tamil.
  • To build and release a English-Tamil parallel
    corpus for MT research in English-Tamil.
  • To build and release a balanced monolingual
    corpora for Tamil.
  • To deliver various NLP tools like
  • Named Entity Recognizer
  • Morphological Analyzer
  • Language models
  • Sentence aligners for parallel corpora

5
Task 1 Resources
  • Resources to be developed include monolingual
    bilingual corpora.
  • Parallel texts in English-Tamil have been
    identified. They need to be OCRed.
  • Team of linguists will supervise the annotation
    and correction efforts if any for development of
    other tools.
  • Annotation of any kind will be done by trained
    students.
  • Consultations from Experienced hands in
    Linguistics in other universities.

6
Task 2 NLP Tools
  • Various tools for Tamil like
  • Named Entity Recognizer
  • Morphological Analyzer
  • Language Model
  • Sentence Aligner
  • Rule based method for bootstrapping.
  • Support Vector Machines/ Large Margin methods
    have been used for a variety of tasks in Natural
    Language Processing.
  • We have an multi-classifier implementation of SVM
    and we intend to use it to develop the tools.

7
Task 3 Translation System
  • Statistical Framework (Brown et al)

P ( te ) P( et ) P(t) / P(e) T argmax
P(et) P(t)
  • P (et) Translation Model
  • Built and trained using the
  • Bilingual corpus
  • English Tamil corpus
  • Bigger corpus trains better
  • Translation Model
  • P(t) Language Model
  • Built and trained using
  • Monolingual corpus
  • Tamil corpus (EMILLE)
  • Bigger corpus better model

8
Fluency Model
  • Language Model for target language.
  • Fluency Model measures the fluency of the
    utterance.
  • the car hit me is more English than car the
    hit me.
  • Fluency model assigns more probability to the
    former using N-grams.
  • N-grams can be learned from large monolingual
    corpus.
  • More sophisticated grammar based models can be
    used in place of N-grams.
  • This model doesnt care about the translation
    part.

9
Faithfulness Model
  • Translation Model.
  • How faithful is the translation?
  • Measures the degree to which words in target
    sentence are plausible translations of the words
    in the source sentences.
  • We need to know, the probability of mapping each
    source word to one or more target word.
  • These probabilities can be learned from parallel
    texts.

10
Faithfulness Model
  • Sentence Alignment
  • Figuring out which source language sentence maps
    to which target language sentence.
  • Word alignment
  • Figuring out which word in source language
    sentence maps to which word in the target
    language sentence.
  • When learning, initially, all word alignments are
    assumed to be equally likely and as the training
    proceeds, the assumptions are modified using
    Expectation Maximization algorithm.

11
Decoding
  • Finding out the target language sentence that
    maximises for unseen s
  • P(t) P(st)
  • Input Language Model, Translation Model Test
    set.
  • Stack based decoding, Greedy Decoding Beam
    search decoding algorithms are available.

12
German English MT
  • Demo version.
  • Created using de-news parallel corpus.
  • Used existing tools for lm, tm and decoding.
  • Results were comparable to Googles MT output for
    the same German passage.
  • Proposal English-Tamil MT in the same lines !

13
Downloadable Tools for SMT
  • Language Modeling
  • SRI LM Toolkit
  • CMU Cambridge LM Toolkit
  • Translation Modeling
  • Giza
  • Decoder
  • ISI ReWrite Decoder
  • Pharaoh (Phrase based decoder)

14
English-Tamil MT
  • Most tools are available.
  • Improve language modeling
  • Tamil is distinctively different from European
    languages. E.g Word-order, clause constructions,
    rich morphology etc.
  • Language model should be modified. Preferably to
    work on sequence of morphemes instead of words.
    This will increase the frequency of stems.
  • Improve translation modeling
  • Some times whole English clauses can be
    translated into just one word in Tamil. E.g One
    who came can be translated as vandhavar

15
English-Tamil corpus
  • Parallel corpus is the most important resource
    for SMTs.
  • Manual translation for corpus development -
    costly.
  • Using existing printed text available in English
    Tamil more viable and cheaper option.
  • OCRs can be employed to scan and make electronic
    versions of English Tamil texts.
  • SVM is being used in OCRs for classifying the
    characters recognized.
  • SVM based OCRs have been reported to have been
    successful for Kannada (T V ASHWIN et al) and
    Tamil (Seethalakshmi et al).
  • Although English OCRs are readily available in
    market, Tamil OCRs are not. We propose to build
    new Tamil OCR using SVMs if its not available to
    us.

16
English-Tamil MT
  • Morphological Analyzer
  • Segment the morphemes of the word.
  • Identify the stem (Stemmer).
  • PoS Tag the sequence.
  • Make words when presented the morphemes.
  • Use Tamil language rules and automatic discovery
    of morphemes (Goldsmith) from large Tamil corpus.
  • PoS Tagging can be done using SVMs. (E.g. SVMTool
    by Giménez et al) Accuracy of 98.86 has been
    demonstrated on Spanisch corpus by SVMTool.
  • No open source tool available.

17
English-Tamil MT
  • Sentence Aligner
  • Given a large parallel corpus of English Tamil
    data, this tools should identify aligned
    English-Tamil pairs.
  • Gale Church (Character length) Brown (Word
    length) wont help for E-T pair.
  • Bootstrapping rule lexicon based.
  • Improvements if necessary SVM based (Ceausu et
    al)

18
English-Tamil MT
  • Named Entity Recognizer
  • Identifies person names, institutions and
    locations in source text.
  • Named entities should be appropriately translated
    / transliterated.
  • Bootstrap Build a rule based NER that uses
    contextual cues to identify the named entities
    from morphologically analyzed text.
  • Improvement Improve the rule based NER using
    SVM.
  • SVM for NER in Japanese have shown an F-score of
    90.3. (Isozaki et al)

19
NLP _at_ CEN
  • Workshop on Machine Translation in Indian
    Languages.
  • 03-Jan-2006.
  • Towards Semantically Oriented Statistical
    Machine Translation by Prof. Srinivas Narayanan,
    UC-Berkeley.
  • Activities of TDIL by Dr. B.K. Murthy. MCIT,
    GoI.
  • Mantra-Raajbhasha MT system by Dr. Darbari
    Dr. Pandey, CDAC-Pune.
  • Using Sanskrit Shaastras in NLP by Pandit
    Shrinivasa Varkhedi.

20
NLP _at_ CEN
  • Current Projects
  • Plagiarism Detection in Patents
  • Patent Classification using SVM
  • Funded by IPR Cell, MCIT, GoI.
  • Morpheme segmentation Stemming for Tamil.
  • Sentence Algnment for English-Tamil Language
    pair.

21
NLP _at_ CEN
  • Strong team in Kernel Methods, SVMs, Data Text
    mining
  • Computational Linguistics - electives for
    Undergrad (proposed)
  • M.Tech Informatics (proposed)
  • - Computational Linguistics Bioinformatics
  • Undergrad Textbooks
  • - Insight into Datamining by Dr. K.P. Soman,
    Shyam Diwakar, V. Ajay (PHI Publication)

22
References
  • Peter Brown, Stephen Della Pietra, Vincent Della
    Pietra, and Robert Mercer. 1993. The mathematics
    of machine translation Parameter estimation.
    Computational Linguistics, 19(2)263311, June.
  • Philip Clarkson and Ronald Rosenfeld. 1997.
    Statistical language modeling using the
    CMU-Cambridge toolkit. In ESCA Eurospeech
    Proceedings.
  • Gale, W.A., Church, K.W. A program for Aligning
    Sentences in Bilingual Corpora. Proceedings of
    the 29th Annual Meeting of the ACL, Berkeley,
    California, 1991.
  • Goldsmith, John. 2001. The unsupervised learning
    of natural language morphology. Computational
    Linguistics 272
  • Alexandru Ceausu, Dan Stefanescu, Dan Tufis,
    Acquis Communautaire sentence alignment using
    Support Vector Machines.
  • J esús Giménez and Lluís Márquez . SVMTool A
    general POS tagger generator based on Support
    Vector Machines Proceedings of the LREC'04.
  • T V ASHWIN and P S SASTRY ,A font and
    size-independent OCR system for printed Kannada
    documents using support vector machines,
    Sadhana Vol. 27, Part 1, February 2002
  • N. Cristianini and J. Shawe-Taylor. Support
    Vector Machines. Cambridge University Press,
    2000.
  • C.J.C. Burges. A tutorial on support vector
    machines for pattern recognition. Data Mining and
    Knowledge Discovery, 2(2)955-974, 1998.

23
Thank you.
Write a Comment
User Comments (0)
About PowerShow.com