Center for Excellence in Computational Engineering and Networking CEN - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Center for Excellence in Computational Engineering and Networking CEN

Description:

English to Tamil Statistical Machine Translation. A Proposal. Amrita Vishwa Vidyapeetham. Ettimadai, Coimbatore 641 105. v a l l u v a n ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 24

Provided by: amr69

Category:

more less

Transcript and Presenter's Notes

Title: Center for Excellence in Computational Engineering and Networking CEN

1

Center for Excellence in Computational
Engineering and Networking (CEN)

Amrita Vishwa Vidyapeetham Ettimadai, Coimbatore
641 105.
v a l l u v a n English to Tamil Statistical
Machine Translation
A Proposal
2
Investigators

Prof. K.P. Soman (PI)
Prof. P. Venkat Rangan
Prof. Srinivas Narayanan
Dr. Harini Jayaraman
Mr. C.J. Srinivasan

Excerpts from Businessline, Jun 21, 2005.
The Minister for Communications and IT, Mr
Dayanidhi Maran, said efforts were on to make the
fonts available in all the 22 languages within
six months. This would open the door for computer
penetration in villages where 95 per cent of the
people speak in their mother tongue, he pointed
out.
"We have 16 million computers and net
penetration of five million. There is much to
achieve. First we should have local language
content, which is at a minimal level at present,"
Mr Maran said. Mr Maran said the idea was to
create a translation browser in different
languages so that news provided by different
organizations such as CNN could be translated.
In a lighter vein, he referred to his plight
during the speeches made by the Railway Minister,
Mr Lalu Prasad. "When Laluji spoke in Hindi, I
used to look for people who would translate his
speeches for me.... In future, there could be a
software which can be worn like a watch and
plugged to the ear so that when Laluji speaks, I
can understand," he said evoking laughter from
the audience.

4
valluvan our objectives

To build a machine translation (MT) working
framework to provide a reasonable translation of
English documents in Tamil.
To build and release a English-Tamil parallel
corpus for MT research in English-Tamil.
To build and release a balanced monolingual
corpora for Tamil.
To deliver various NLP tools like
Named Entity Recognizer
Morphological Analyzer
Language models
Sentence aligners for parallel corpora

5
Task 1 Resources

Resources to be developed include monolingual
bilingual corpora.
Parallel texts in English-Tamil have been
identified. They need to be OCRed.
Team of linguists will supervise the annotation
and correction efforts if any for development of
other tools.
Annotation of any kind will be done by trained
students.
Consultations from Experienced hands in
Linguistics in other universities.

6
Task 2 NLP Tools

Various tools for Tamil like
Named Entity Recognizer
Morphological Analyzer
Language Model
Sentence Aligner
Rule based method for bootstrapping.
Support Vector Machines/ Large Margin methods
have been used for a variety of tasks in Natural
Language Processing.
We have an multi-classifier implementation of SVM
and we intend to use it to develop the tools.

7
Task 3 Translation System

Statistical Framework (Brown et al)

P ( te ) P( et ) P(t) / P(e) T argmax
P(et) P(t)

P (et) Translation Model
Built and trained using the
Bilingual corpus
English Tamil corpus
Bigger corpus trains better
Translation Model

P(t) Language Model
Built and trained using
Monolingual corpus
Tamil corpus (EMILLE)
Bigger corpus better model

8
Fluency Model

Language Model for target language.
Fluency Model measures the fluency of the
utterance.
the car hit me is more English than car the
hit me.
Fluency model assigns more probability to the
former using N-grams.
N-grams can be learned from large monolingual
corpus.
More sophisticated grammar based models can be
used in place of N-grams.
This model doesnt care about the translation
part.

9
Faithfulness Model

Translation Model.
How faithful is the translation?
Measures the degree to which words in target
sentence are plausible translations of the words
in the source sentences.
We need to know, the probability of mapping each
source word to one or more target word.
These probabilities can be learned from parallel
texts.

10
Faithfulness Model

Sentence Alignment
Figuring out which source language sentence maps
to which target language sentence.
Word alignment
Figuring out which word in source language
sentence maps to which word in the target
language sentence.
When learning, initially, all word alignments are
assumed to be equally likely and as the training
proceeds, the assumptions are modified using
Expectation Maximization algorithm.

11
Decoding

Finding out the target language sentence that
maximises for unseen s
P(t) P(st)
Input Language Model, Translation Model Test
set.
Stack based decoding, Greedy Decoding Beam
search decoding algorithms are available.

12
German English MT

Demo version.
Created using de-news parallel corpus.
Used existing tools for lm, tm and decoding.
Results were comparable to Googles MT output for
the same German passage.
Proposal English-Tamil MT in the same lines !

13
Downloadable Tools for SMT

Language Modeling
SRI LM Toolkit
CMU Cambridge LM Toolkit
Translation Modeling
Giza
Decoder
ISI ReWrite Decoder
Pharaoh (Phrase based decoder)

14
English-Tamil MT

Most tools are available.
Improve language modeling
Tamil is distinctively different from European
languages. E.g Word-order, clause constructions,
rich morphology etc.
Language model should be modified. Preferably to
work on sequence of morphemes instead of words.
This will increase the frequency of stems.
Improve translation modeling
Some times whole English clauses can be
translated into just one word in Tamil. E.g One
who came can be translated as vandhavar

15
English-Tamil corpus

Parallel corpus is the most important resource
for SMTs.
Manual translation for corpus development -
costly.
Using existing printed text available in English
Tamil more viable and cheaper option.
OCRs can be employed to scan and make electronic
versions of English Tamil texts.
SVM is being used in OCRs for classifying the
characters recognized.
SVM based OCRs have been reported to have been
successful for Kannada (T V ASHWIN et al) and
Tamil (Seethalakshmi et al).
Although English OCRs are readily available in
market, Tamil OCRs are not. We propose to build
new Tamil OCR using SVMs if its not available to
us.

16
English-Tamil MT

Morphological Analyzer
Segment the morphemes of the word.
Identify the stem (Stemmer).
PoS Tag the sequence.
Make words when presented the morphemes.
Use Tamil language rules and automatic discovery
of morphemes (Goldsmith) from large Tamil corpus.
PoS Tagging can be done using SVMs. (E.g. SVMTool
by Giménez et al) Accuracy of 98.86 has been
demonstrated on Spanisch corpus by SVMTool.
No open source tool available.

17
English-Tamil MT

Sentence Aligner
Given a large parallel corpus of English Tamil
data, this tools should identify aligned
English-Tamil pairs.
Gale Church (Character length) Brown (Word
length) wont help for E-T pair.
Bootstrapping rule lexicon based.
Improvements if necessary SVM based (Ceausu et
al)

18
English-Tamil MT

Named Entity Recognizer
Identifies person names, institutions and
locations in source text.
Named entities should be appropriately translated
/ transliterated.
Bootstrap Build a rule based NER that uses
contextual cues to identify the named entities
from morphologically analyzed text.
Improvement Improve the rule based NER using
SVM.
SVM for NER in Japanese have shown an F-score of
90.3. (Isozaki et al)

19
NLP _at_ CEN

Workshop on Machine Translation in Indian
Languages.
03-Jan-2006.
Towards Semantically Oriented Statistical
Machine Translation by Prof. Srinivas Narayanan,
UC-Berkeley.
Activities of TDIL by Dr. B.K. Murthy. MCIT,
GoI.
Mantra-Raajbhasha MT system by Dr. Darbari
Dr. Pandey, CDAC-Pune.
Using Sanskrit Shaastras in NLP by Pandit
Shrinivasa Varkhedi.

20
NLP _at_ CEN

Current Projects
Plagiarism Detection in Patents
Patent Classification using SVM
Funded by IPR Cell, MCIT, GoI.
Morpheme segmentation Stemming for Tamil.
Sentence Algnment for English-Tamil Language
pair.

21
NLP _at_ CEN

Strong team in Kernel Methods, SVMs, Data Text
mining
Computational Linguistics - electives for
Undergrad (proposed)
M.Tech Informatics (proposed)
- Computational Linguistics Bioinformatics
Undergrad Textbooks
- Insight into Datamining by Dr. K.P. Soman,
Shyam Diwakar, V. Ajay (PHI Publication)

22
References

Peter Brown, Stephen Della Pietra, Vincent Della
Pietra, and Robert Mercer. 1993. The mathematics
of machine translation Parameter estimation.
Computational Linguistics, 19(2)263311, June.
Philip Clarkson and Ronald Rosenfeld. 1997.
Statistical language modeling using the
CMU-Cambridge toolkit. In ESCA Eurospeech
Proceedings.
Gale, W.A., Church, K.W. A program for Aligning
Sentences in Bilingual Corpora. Proceedings of
the 29th Annual Meeting of the ACL, Berkeley,
California, 1991.
Goldsmith, John. 2001. The unsupervised learning
of natural language morphology. Computational
Linguistics 272
Alexandru Ceausu, Dan Stefanescu, Dan Tufis,
Acquis Communautaire sentence alignment using
Support Vector Machines.
J esús Giménez and Lluís Márquez . SVMTool A
general POS tagger generator based on Support
Vector Machines Proceedings of the LREC'04.
T V ASHWIN and P S SASTRY ,A font and
size-independent OCR system for printed Kannada
documents using support vector machines,
Sadhana Vol. 27, Part 1, February 2002
N. Cristianini and J. Shawe-Taylor. Support
Vector Machines. Cambridge University Press,
2000.
C.J.C. Burges. A tutorial on support vector
machines for pattern recognition. Data Mining and
Knowledge Discovery, 2(2)955-974, 1998.