Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarka - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarka

Description:

Statistical machine translation (SMT) is an approach to MT that is characterized ... In general, in statistical machine translation, if more data will be provided ... – PowerPoint PPT presentation

Number of Views:418
Avg rating:3.0/5.0
Slides: 23
Provided by: lewish9
Category:

less

Transcript and Presenter's Notes

Title: Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarka


1
Statistical Machine Translation English to
HindiSumit GoswamiNirav ShahDevshri
RoySudeshna Sarkar
2
Introduction
  • Machine translation (MT) is the automatic
    translation from one natural language into
    another language using computers.
  • Statistical machine translation (SMT) is an
    approach to MT that is characterized by the use
    of machine learning methods.

3
Objective
  • Objective of our work is to explore the different
    ways of statistical techniques with linguistic
    inputs to improve a baseline statistical machine
    translation system from English to Hindi.

4
Problem with the limited training data
  • In general, in statistical machine translation,
    if more data will be provided for learning
    higher will be the quality of translation.
  • The size of the training data is limited to 50k
    sentence pair.
  • As a result there are limits in the availability
    of the bilingual data in the created phrase
    translation table.

5
Proposed Methodology
  • We propose a method of increasing the bilingual
    data in the phrase translation table by appending
    it with the freely available dictionary.

6
Problem with the word translation
  • The main problem with the word translation is
    that words with equivalent meanings may not
    appear in the same order in both sentences.
  • For example
  • In English I read a book.
  • In Hindi, Maine Pustak Padhi.
  • In English sentences, the typical sentence order
    is subject-verb-object (SVO). In Hindi, it is
    subject-object-verb (SOV).

7
Methodologies
  • We have used the phrase based modeling concept
    instead of word based modeling.
  • Again, the phrase-based models are limited to the
    mapping of small text chunks (phrases) without
    any explicit use of linguistic information, may
    it be morphological, syntactic, or semantic. Such
    additional information are valuable by
    integrating it in pre-processing or
    post-processing.
  • We extended the phrase-based statistical machine
    translation models using a factored
    representation.

8
Resources
  • Statistical machine translation
  • Moses
  • Hindi Dictionary
  • Sabdakosh
  • PoS Tagger
  • Hindi Morphological Analyser
  • Training and Testing
  • TIDES IIIT Dataset
  • EILMT Tourism Dataset

9
Addition of new source of knowledge
  • Downloaded Hindi dictionary Sabdhakosh
  • Preprocessed the parallel data
  • Obtained a maximum likelihood lexical translation
    table.
  • Generated a phrase translation table.

10
Aligned word table
  • Get the English Hindi equivalent of almost all
    major and frequently used root words
  • We insure that each hindi translated word is next
    to the english word.
  • Example
  • want AvaSyakawA (0 0)
  • want cAhanA (0 0)
  • war yuxXa (0 0)

11
Lexical translation table
  • We assign the w(eh) as well as the inverse
    w(he) word translation table.
  • Example
  • English Hindi w(eh)
  • want AvaSyakawA 0.5
  • want cAhanA 0.5
  • war yuxXa 0.5

12
Scoring of words in the translation table
  • Five different phrase translation scores are
    assigned and phrase translation table is
    generated.
  • phrase translation probability f(eh)
  • lexical weighting lex(eh)
  • phrase translation probability f(he)
  • lexical weighting lex(he)
  • phrase penalty (always exp(1) 2.718)
  • Example
  • want AvaSyakawA (0) (0) 0.5 0.5
    1 0.5 2.718
  • want cAhanA (0) (0) 0.5 0.5 1
    0.5 2.718
  • war yuxXa (0) (0) 0.5 0.5 0.5
    0.5 2.718

13
Phrase Based Statistical Machine Translation
  • We have used the Phrased based modeling and
    generated the phrase table on the given 50k
    parallel corpus provided by ICON and appended
    that phrase translation table by the preprocessed
    English-Hindi dictionary.
  • Example Output of the phrase translation table
  • ! time came for samaya AyA waba waka
    parisZWiwiyAM () (0) (1) (3) (1) (2) ()
    (3) () 0.5 8.11523e-08 0.5 5.50722e-12 2.718
  • ! time came for samaya AyA waba waka ()
    (0) (1) (3) (1) (2) () (3) 0.5
    8.11523e-08 0.5 5.14693e-07 2.718
  • ! time came samaya AyA waba () (0) (1)
    (1) (2) () 0.5 2.21728e-06 0.5
    4.83502e-05 2.718
  • ! time came samaya AyA () (0) (1)
    (1) (2) 0.333333 2.21728e-06 0.5 0.0328891
    2.718
  • time samaya (0) (0) 0.5 0.5
    0.5 0.5 2.718

14
Factored Translation Model
  • Extended phrase-based ST model using a factored
    representation
  • We annotate each word with a feature vector
  • The feature vector includes the
  • surface form
  • root
  • part-of-speech tag
  • the morphological information
  • Annotation is used to construct ST models that
    can be combined together to maximize translation
    quality

15
Sentence Analysis
  • The sentence analysis is broken up into
  • Tokenize the given sentence.
  • Take the surface form of the word
  • Generate the root word
  • Generate the part of speech factor
  • Obtain the other morphological information like
    gender, number

16
Sample of Factored Translation
  • ltSentence id"2"gt
  • 1 ye
  • ltfsaf'yaha,P,any,p,a,,0,'gt
  • 2 loga
  • ltfsaf'loga,n,m,s,,0,,'gt
  • ltfsaf'loga,n,m,p,,0,,'gt
  • ltfs af'loga,n,m,s,,1,,'gt
  • kAPI
  • ltfs af'kAPI,n,f,s,,0,,'gt
  • ltfs af'kAPI,n,f,p,,0,,'gt
  • ltfs af'kAPI,n,f,s,,1,,'gt
  • ltfs af'kAPI,n,f,p,,1,,'gt
  • ltfs af'kAPI,D,,,,,,'gt
  • 4 pariSramI ltfsaf'pariSramI,n,f,s,,0,,'gt
  • ltfs af'pariSramI,n,f,p,,0,,'gt
  • ltfs af'pariSramI,n,f,s,,1,,'gt
  • ltfs af'pariSramI,n,f,p,,1,,'gt
  • hEM ltfsaf'hE,v,any,p,u,,,hE'gt
  • ltfs af'hE,v,any,p,a,,,hE'gt

17
Sample of Hindi Tagged File
apaneapanAadjadj.m.s 3030 sAWiyoMsAWInn.m.p
kokopp.null.null sAWasAWaDD.null.null
lekaralevv.any.any saraxArasaraxArann.m.s
guraxIwasiMhasiMhann.m.s kalakawAkIkAsh_nsh_
n.f.s galiyoMgalInn.f.p meMmeMpp.null.null
vilInavilInann.m.s hohovv.any.any
gaejAvv.m.p .unakevahash_Psh_P.m.a
anyaanyaadjadj.any.any kRewrakRewrann.m.s
hEhEvv.any.s raMgawaraMgawann.f.s
vavaAvyAvy.null.null gretagretaadjadj.any.an
y nikobAra.unakIvahash_Psh_P.f.a
muKyamuKyaadjadj.any.any gawiviXiyAMportableyar
ameMmeMpp.null.null keMxriwakeMxriwaadjadj.a
ny.any hEMhEvv.any.p jahAMjahAzDD.null.null
paraparapp.null.null unakIvahash_Psh_P.f.a
anekaanekann.m.s sAmAjikasAmAjikaadjadj.any.
any waWAwaWADD.null.null sAMskqwikasAMskqwika
adjadj.any.any saMsWAeMsaMsWAnn.f.p
hEMhEvv.any.p .
18
Sample of English Tagged File
city_NN palace_NN is_VBZ a_DT magnificent_JJ
structure_NN ,_, the_DT palace_NN occupies_VBZ
one_CD seventh_JJ of_IN the_DT walled_JJ city_NN
of_IN jaipur_NN and_CC is_VBZ a_DT wonderful_JJ
blend_VB of_IN rajput_NN and_CC mughal_JJ
architecture_NN ._.the_DT jeep_FW safari_FW
not_RB only_RB refreshes_VBZ and_CC
revitalizes_VBZ but_CC one_PRP feels_VBZ close_RB
to_TO nature_NN while_IN diving_NN through_IN
the_DT quiet_JJ and_CC beautiful_JJ
countryside_NN ._.boparais_NN organization_NN
is_VBZ running_VBG a_DT camping_NN site_NN at_IN
barog_NN in_IN district_NN solan_NN ._.this_DT
was_VBD to_TO prevent_VB tobacco_NN smuggling_NN
from_IN coimbatore_NN ._.shimla_NN is_VBZ
surrounded_VBN by_IN pine_VB ,_, cedar_NN ,_,
oak_NN and_CC rhododendron_NN forests_NNS
._.the_DT monumental_JJ labor_NN of_IN love_NN
of_IN a_DT great_JJ ruler_NN for_IN his_PRP
19
Evaluation on TIDES IIIT Test Dataset
NIST score 4.3187  BLEU score 0.0976 for
system "iit kgp" ------------------------------
------------------------------------------Indivi
dual N-gram scoring        1-gram  2-gram 
3-gram  4-gram  5-gram  6-gram  7-gram  8-gram 
9-gram        ------  ------  ------  ------ 
------  ------  ------  ------  ------NIST 
3.7437  0.5263  0.0433  0.0048  0.0006  0.0001 
0.0000  0.0000  0.0000  " iit kgp "BLEU 
0.4408  0.1444  0.0575  0.0248  0.0119  0.0062 
0.0034  0.0020  0.0013  " iit kgp "
--------------------------------------------------
----------------------Cumulative N-gram
scoring        1-gram  2-gram  3-gram  4-gram 
5-gram  6-gram  7-gram  8-gram  9-gram       
------  ------  ------  ------  ------  ------ 
------  ------  ------NIST  3.7437  4.2700 
4.3133  4.3181  4.3187  4.3188  4.3188  4.3188 
4.3188  " iit kgp t"BLEU  0.4408  0.2523 
0.1541  0.0976  0.0641  0.0435  0.0302  0.0215 
0.0157  " iit kgp "MT evaluation scorer ended on
2008 Dec 1 at 113848
20
Evaluation on EILMT Tourism Test Dataset
MT evaluation scorer began on 2008 Dec 11 at
002811command line NIST score 5.1704
BLEU score 0.1873 for system "iit kgp"
-------------------------------------------------
----------------------- Individual N-gram scoring
1-gram 2-gram 3-gram 4-gram
5-gram 6-gram 7-gram 8-gram 9-gram
------ ------ ------ ------ ------
------ ------ ------ ------ NIST 4.3795
0.7168 0.0645 0.0080 0.0016 0.0002
0.0001 0.0001 0.0000 "iit kgp BLEU
0.5282 0.2334 0.1269 0.0837 0.0607
0.0469 0.0364 0.0290 0.0246 "iit kgp
--------------------------------------------------
---------------------- Cumulative N-gram scoring
1-gram 2-gram 3-gram 4-gram
5-gram 6-gram 7-gram 8-gram 9-gram
------ ------ ------ ------ ------
------ ------ ------ ------ NIST 4.3795
5.0963 5.1608 5.1688 5.1704 5.1706
5.1706 5.1707 5.1707 "iit kgp BLEU
0.5200 0.3457 0.2462 0.1873 0.1490
0.1226 0.1028 0.0876 0.0759 "iit kgp MT
evaluation scorer ended on 2008 Dec 11 at 002814
21
Comparison of Results
  • IIIT baseline
  • BLEU score 1-gram 0.1059
  • 3-gram 0.1649
  • IIT, Kgp score 1-gram 0.5282
  • 3-gram 0.2462
  • IIIT NIST score 3.9
  • IIT, Kgp NIST score 5.1

22
Future Work
  • NER followed by a transliteration system
  • Increasing the amount of parallel text by
    paraphrasing
  • Extract parallel data from the comparable
    bilingual corpora so as to increase our training
    corpus
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com