CSCI 5832 Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

CSCI 5832 Natural Language Processing

Description:

insistent Wednesday may recurred her trips to Libya tomorrow for flying ... Egyptair Has Tomorrow to Resume Its Flights to Libya ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 58
Provided by: jamesm5
Category:

less

Transcript and Presenter's Notes

Title: CSCI 5832 Natural Language Processing


1
CSCI 5832Natural Language Processing
  • Jim Martin
  • Lecture 23

2
Change in plans
  • Going straight to Chapter 25 (translation)
  • Ill come back to 23 (q/a, summarization)

3
Machine Translation
  • Slides mostly stolen from Kevin Knight (USC/ISI)

4
Today 4/22
  • Machine translation framework
  • State of the art results
  • Evaluation methods
  • Word-based models

5
Progress
2002
2003
  • insistent Wednesday may recurred her trips to
    Libya tomorrow for flying
  • Cairo 6-4 ( AFP ) - An official announced
    today in the Egyptian lines company for flying
    Tuesday is a company "insistent for flying" may
    resumed a consideration of a day Wednesday
    tomorrow her trips to Libya of Security Council
    decision trace international the imposed ban
    comment.
  • Egyptair Has Tomorrow to Resume Its Flights to
    Libya
  • Cairo 4-6 (AFP) - Said an official at the
    Egyptian Aviation Company today that the company
    egyptair may resume as of tomorrow, Wednesday its
    flights to Libya after the International Security
    Council resolution to the suspension of the
    embargo imposed on Libya.

6
Commercial Applications
news broadcast
foreign language speech recognition
English translation
searchable archive
7
Commercial Applications
8
Statistical Machine Translation
Hmm, every time he sees banco, he either types
bank or bench but if he sees banco de
,
he always types bank, never bench
Man, this is so boring.
Translated documents
9
Things are Consistently Improving
Annual evaluation of Arabic-to-English MT system
s
Translation quality
70
60
50
40
30
20
Exceeded commercial-grade translation here.
10
2004
2005
2006
2002
2003
10
Progress Driven by Empirical Measures of Success
Translation quality
35
30
25
20
USC/ISI Syntax-Based MT System. Chinese/English
NIST 2002 Test Set
15
Mar 1
Apr 1
May 1
2005
11
Current Approaches
  • Same old noisy channel model
  • If were translating French to English the French
    were seeing is just a weird garbled version of
    English
  • There must have been some process that generated
    the French from the original English
  • The key is to decode the garbles back into the
    original English by
  • Argmax P(E F) by Bayes
  • A very old idea

12
Warren Weaver (1947)
When I look at an article in Russian, I say to
myself This is really written in English, but it
has been coded in some strange symbols. I will
now proceed to decode.
13
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
14
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
15
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
16
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
???
17
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
18
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
19
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
20
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
???
21
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
22
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
process of elimination
23
Centauri/Arcturan Knight, 1997
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
cognate?
24
Centauri/Arcturan Knight, 1997
Your assignment, put these words in order
jjat, arrat, mat, bat, oloat, at-yurp
zero fertility
25
Spanish/English text
Translate Clients do not sell pharmaceuticals
in Europe.
 
26
Bilingual Training Data
Millions of words (English side)
1m-20m words for many language pairs
(Data stripped of formatting, in sentence-pair
format, available from the Linguistic Data Con
sortium at UPenn).
27
Sample Learning Curves
Swedish/English French/English German/English F
innish/English
BLEU score
of sentence pairs used in training
Experiments by Philipp Koehn
28
MT Evaluation
  • Traditionally difficult because there is no
    single right answer.
  • 20 human translators will translate the same
    sentence 20 different ways.

29
Evaluation Metric (BLEU)
  • N-gram precision (score is between 0 1)
  • What percentage of machine n-grams can be found
    in the reference translation?
  • An n-gram is an sequence of n words
  • Not allowed to use same portion of reference
    translation twice (cant cheat by typing out the
    the the the the)
  • Brevity penalty
  • Cant just type out single word the (precision
    1.0!)
  • Amazingly hard to game the system (i.e., find a
    way to change machine output so that BLEU goes
    up, but quality doesnt)
  • Contra doesnt hold. Can find perfectly good
    improvements that hurt, or dont help, BLEU

30
Multiple Reference Translations
31
BLEU in Action
???????? (Foreign Original)
the gunman was shot to death by the police .
(Reference Translation) System Outputs the g
unman was police kill . 1wounded police jaya
of 2the gunman was shot dead by the police
. 3the gunman arrested by police kill .
4the gunmen were killed . 5the gunman
was shot to death by the police . 6
gunmen were killed by police 7
al by the police . 8the ringer is killed by
the police . 9police killed the gunman .
10
32
BLEU in Action
???????? (Foreign Original)
the gunman was shot to death by the police .
(Reference Translation) System Outputs the gun
man was police kill . 1wounded police jaya
of 2the gunman was shot dead by the police
. 3the gunman arrested by police kill .
4the gunmen were killed . 5the gunman
was shot to death by the police . 6
gunmen were killed by police 7
al by the police . 8the ringer is killed by
the police . 9police killed the gunman .
10
green 4-gram match (good!)
red word not matched (bad!)
33
NIST 2006 Results
34
Statistical MT Systems
What hunger have I, Hungry I am so, I am so hung
ry,
Have I that hunger
Que hambre tengo yo
I am so hungry
35
Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Garbled English
Spanish
English
Translation Model P(se)
Language Model P(e)
Que hambre tengo yo
I am so hungry
Decoding algorithm argmax P(e) P(se) e
36
Bayes Rule/Noisy Channel
Garbled English
Spanish
English
Translation Model P(se)
Language Model P(e)
Que hambre tengo yo
I am so hungry
Decoding algorithm argmax P(e) P(se) e
Given a source sentence s, the decoder should
consider many possible translations and return
the target string e that maximizes
P(e s) By Bayes Rule, we can also write this a
s P(e) x P(s e) / P(s) and maximize that in
stead. P(s) never changes while we compare
different es, so we can equivalently maximize
this P(e) x P(s e)
37
Three Sub-Problems of Statistical MT
  • Language model
  • Given an English string e, assigns P(e) by
    formula
  • good English string - high P(e)
  • random word sequence - low P(e)
  • Translation model
  • Given a pair of strings , assigns P(f e)
    by formula
  • look like translations - high P(f e)
  • dont look like translations - low P(f
    e)
  • Decoding algorithm
  • Given a language model, a translation model, and
    a new sentence f find translation e maximizing
    P(e) P(f e)

38
Translation Model
Generative story
Mary did not slap the green witch
Source-language morphological analysis
Source parse tree Semantic representation
Generate target structure
Maria no dió una botefada a la bruja verde
39
Translation Model?
Generative story
Mary did not slap the green witch
Source-language morphological analysis
Source parse tree Semantic representation
Generate target structure
Way too hard.
Maria no dió una botefada a la bruja verde
40
The Classic Translation ModelWord
Substitution/Permutation IBM Model 3, Brown et
al., 1993
Generative story
Mary did not slap the green witch
n(3slap)
Mary not slap slap slap the green witch
p-Null
Mary not slap slap slap NULL the green witch
t(lathe)
Maria no dió una botefada a la verde bruja
d(ji)
Maria no dió una botefada a la bruja verde
41
Parts List
  • We need probabilities for
  • n (xy) The probability that word y will yield x
    outputs in the translation (fertility)
  • p The probability of a null insertion
  • t The actual word translation probability table
  • d(ji) the probability that a word at position i
    will make an appearance at position j in the
    translation

42
Parts List
  • Every one of these can be learned from a sentence
    aligned corpus
  • Ie. A corpus where sentences are paired but
    nothing else is specified
  • And the EM algorithm

43
Word Alignment
la maison la maison bleue la fleur

the house the blue house the flower
  • Assume that All word alignments equally likely.
  • That is, that all P(french-word english-word)
    are equal
  • Recall that we want P(fe)

44
Word Alignment
la maison la maison bleue la fleur

the house the blue house the flower
la and the observed to co-occur frequently,
so P(la the) is increased.
45
Word Alignment
la maison la maison bleue la fleur

the house the blue house the flower
46
Word Alignment
la maison la maison bleue la fleur

the house the blue house the flower
settling down after another iteration
47
Word Alignment
la maison la maison bleue la fleur

the house the blue house the flower
Inherent hidden structure revealed by EM
training

48
Parts List
  • Given a sentence alignment we can induce a word
    alignment
  • Given that word alignment we can get the p, t, d
    and n parameters we need for the model.
  • Ie. We can argmax P(ef) by maxing over
    P(fe)P(e) and we can do that by iterating over
    some large space of f possibilities.

49
Decoding
  • Remember Viterbi? Just a fancier Viterbi
  • Given foreign sentence f, find English sentence e
    that maximizes P(e) x P(f e)

50
Decoding
Que hambre tengo yo what hunger have I that
hungry am me
so make where
51
Decoding
Que hambre tengo yo what hunger have I that
hungry am me
so make where
52
Decoding
Que hambre tengo yo what hunger have I that
hungry am me
so make where
53
Decoding
Que hambre tengo yo what hunger have I that
hungry am me
so make where
54
Decoder Actually Translates New Sentences
1st target word
2nd target word
3rd target word
4th target word
start
end
all source words covered
Each partial translation hypothesis contains
- Last English word chosen source words
covered by it - Next-to-last English word chose
n - Entire coverage vector (so far) of source s
entence - Language model and translation model
scores (so far)
55
Dynamic Programming Beam Search
1st target word
2nd target word
3rd target word
4th target word
best predecessor link
start
end
all source words covered
Each partial translation hypothesis contains
- Last English word chosen source words
covered by it - Next-to-last English word chose
n - Entire coverage vector (so far) of source s
entence - Language model and translation model
scores (so far)
Jelinek, 1969 Brown et al, 1996 US Patent (
Och, Ueffing, and Ney, 2001
56
Flaws of Word-Based MT
  • Multiple English words for one foreign word
  • IBM models can do one-to-many (fertility) but not
    many-to-one
  • Phrasal Translation
  • real estate, note that, interest in
  • Syntactic Transformations
  • Verb at the beginning in Arabic
  • Translation model penalizes any proposed
    re-ordering
  • Language model not strong enough to force the
    verb to move to the right place

57
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com