Building Lexicons

About This Presentation

Title:

Building Lexicons

Description:

Empirical estimation of statistical translation models is typically based on ... models benefit from the best of both the empiricist and rationalist traditions ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 70

Provided by: Matthi80

Category:

more less

Transcript and Presenter's Notes

Title: Building Lexicons

1
Building Lexicons

Jae Dong Kim
Matthias Eck

2
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

3
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

4
Definitions

Translational equivalence A relation that holds
between two expressions with the same meaning,
where two expressions are in different languages.
Statistical Translation Models statistical
models of translational equivalence
Empirical estimation of statistical translation
models is typically based on parallel texts or
bitexts
Word-to-Word Lexicon
A list of word pairs
(source word, target word )
Bidirectional
Probabilistic word-to-word lexicon (source word,
target word, prob.)

5
Additional Universal Property

Translation models benefit from the best of both
the empiricist and rationalist traditions
Models to be proposed
Most word tokens translate to only one word
token. Approximated by one-to-one assumption -
Method A
Most text segments are not translated word for
word. Explicit Noise Model - Method B
Different linguistic objects have statistically
different behavior in translation. Translation
models on different word classes. - Method C
Human judgment has shown that each of three
estimation biases improves translation model
accuracy over a baseline knowledge-free model

6
Applications of Translation Models

Where word order is not important
Cross-language information retrieval
Multilingual document filtering
Computer-assisted language learning
Certain machine-assisted translation tools
Concordancing for bilingual lexicography
Corpus linguistics
crummy machine translation
Where word order is important
Speech transcription for translation
Bootstrapping of OCR systems for new languages
Interactive translation
Fully automatic high-quality machine translation

7
Advantages of translation models

Compared to handcrafted models
The possibility of better coverage
The possibility of frequent updates
More accurate information about relative
importance of different translations

IRDB
Q
T
Qi
IR
Uniform Importance?
8
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

9
Models of Co-occurrence

Intuition words that are translations of each
other are more likely to appear in corresponding
bitext regions than other pairs of words.
A boundary-based model assumes that both halves
of the bitext have been segmented into s
segments, so that segment Ui in one half of the
bitext and segment Vi in the other half are
mutual translations, 1ltilts
Co-occurrence count by Brown et al
Co-occurrence count by Melamed

10
Nonprobabilistic Translation Lexicons (1)

Summary of non-probabilistic translation lexicon
algorithms
Choose a similarity function S between word types
in L1 and word types L2
Compute association scores S(u,v) for a set of
word type pairs (u,v) ? (L1 x L2) that occur in
training data
Sort the word pairs in descending order of their
association scores
Discard all word pairs for which S(u,v) is less
than a chosen threshold. The remaining word pairs
become the entries in the translation lexicon
Main difference choice of similarity function
Those functions are based on a model of
co-occurrence with some linguistically motivated
filtering

11
Nonprobabilistic Translation Lexicons (2)

Problem independence assumption in step 2
Models of translational equivalence that are
ignorant of indirect association have a tendency
to be confused by collocates
If all the entries in a translation lexicon are
sorted by their association scores, the direct
associations will be very dense near the top of
the list, and sparser towards the bottom

He nods his
head Il hoche la
tete
Direct association
Indirect association
12
Nonprobabilistic Translation Lexicons (3)

The very top of the list can be over 98 correct
- Gale and Church (1991)
Gleaned lexicon entries for about 61 of the word
tokens in a sample of 800 English sentences
Selected only entries with high association score
61 word tokens represent 4.5word types
71.6 precision with top 23.8 of noun-noun
entries - Fung(1995)
Automatic acquisition of 6,517 lexicon entries
with 86 precision from 3.3-million-word corpus -
Wu Xia (1994)
19 recall
Weighted precision in (E1,C1,0.533),
(E1,C2,0.277), (E1,C3,0.190), if (E1,C3,0.190)
is wrong, we have precision of 0.810
Higher than unweighted one

13
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

14
Decomposition of Translation Model (1)

Two stage decomposition of sequence-to-sequence
model
First stage
Every sequence L is just an ordered bag, and the
bag B can be modeled independently of its order O

15
Decomposition of Translation Model (2)

First Stage
Let L1 and L2 be two sequences and let A be a
one-to-one mapping between the elements of L1 and
the elements of L2

16
Decomposition of Translation Model (2)

First Stage
Let L1 and L2 be two sequences and let A be a
one-to-one mapping between the elements of L1 and
the elements of L2

17
Decomposition of Translation Model (3)

First Stage
Bag-to-bag translation model

18
Decomposition of Translation Model (4)

Second Stage
From bags of words to the words that they contain
Bag pair generation process - how word-to-word
model is embeded
Generate a bag size l. l is also the assignment
size
Generate l language-independent concepts C1,,Cl.
From each concept Ci, 1ltiltl, generate a pair of
word sequences from L1 x L2, according to
the distribution , to lexicalize
the concept in the two languages. Some concepts
are not lexicalized in some languages, so one of
ui and vi may be empty.
Bags
An assignment (i1,j1),,(il,jl)

19
Decomposition of Translation Model (5)

Second Stage
The probability of generating a pair of bags
(B1,B2)

20
Decomposition of Translation Model (5)

Second Stage
The probability of generating a pair of bags
(B1,B2)
is zero for all concepts
except one
is symmetric unlike the models
of Brown et al.

21
The One-to-One Assumption

and may consist of at most one word each
A pair of bags containing m and n nonempty words
can be generated by a process where the bag size
l is anywhere between max(m,n) and mn
Not as restrictive as it may appear. What if we
extend a word to include spaces?

22
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

23
Reestimated Seq.-to-Seq. Trans. Model (1)

Variations on the theme proposed by Brown et al.
Conditional probabilities, but can be compared to
symmetric models if the letter are normalized
marginally
Only Co-occurrence Information
EM
When information about segment lengths is not
available

24
Reestimated Seq.-to-Seq. Trans. Model (2)

Word Order Correlation Biases
In any bitext, the positions of words relative to
the true bitext map correlate with the positions
of their translations
The word order correlation bias is most useful
when it has high predictive power
Absolute word positions - Brown et al. 1988
A much smaller set of relative offset parameters
- Dagan, Church, and Gale. 1993
Even more efficient parameter estimation using
HMM with some additional assumptions - Vogel,
Ney, and Tillman. 1996

25
Reestimated Bag-to-Bag Trans. Models

Another Bag-to-Bag model by Hiemstra. 1996
The same one-to-one assumption
The difference empty words are allowed in only
one of the two bags, the one representing the
shorter sentence
Iterative Proportional Fitting Procedure(IPFP)
for parameter estimation
IPFP is subjective to initial conditions
With the most advantageous, more accurate than
Model 1

26
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

27
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

28
Parameter Estimation

Methods for estimating the parameters of a
symmetric word-to-word translation model from a
bitext.
Interested in probability trans(u,v) Probability
to jointly generate the pair of words (u,v)
trans(u,v) cannot be directly inferred It is
unknown which words were generated together
Observable in bitext is only cooc(u,v)
(co-occurrence count)

29
Definitions

Link counts links(u,v) hypothesis about the
number of times u and v were generated together
Link token Ordered Pair of word tokens
Link type Ordered Pair of word types
links(u,v) ranges over Link types
trans(u,v) can be calculated using links(u,v)

30
Definitions (continued)

score(u,v) chance u and v can ever be mutual
translationssimilar to trans(u,v), convenient
for estimation
Relationship between trans(u,v) and score(u,v)
can be direct (depending on model)

31
General outline for all Methods

Initialize the score parameter to a first
approximation based only on cooc(u,v)
REPEAT
Approximate links(u,v) based on score and cooc
Calculate trans(u,v), Stop if only little change
Reestimate score(u,v) based on links and cooc

32
EM-Algorithm!

Initialize the score parameter to a first
approximation based only on cooc(u,v)
REPEAT
Approximate links(u,v) based on score and cooc
Calculate trans(u,v), Stop if only little change
Re-estimate score(u,v) based on links and cooc

Initial E-Step
M-Step
E-Step
33
EM Maximum Likelihood Approach

Find the parameters that maximize the probability
of the given bitext
Assignments cannot be decomposed due to the
one-to-one assumption (compare to Brown et al.
1993)
MLE approach is infeasible
Approximating EM is necessary

34
Maximum a Posteriori

Evaluate Expectations using the single most
probable assignment only (Maximum a posteriori
(MAP) assignment)

35
Maximum a Posteriori

Evaluate Expectations using the single most
probable assignment (Maximum a posteriori (MAP)
assignment)
l number of Concepts, number of produced words

36
Maximum a Posteriori

Evaluate Expectations using the single most
probable assignment (Maximum a posteriori (MAP)
assignment)

37
Maximum a Posteriori

Evaluate Expectations using the single most
probable assignment (Maximum a posteriori (MAP)
assignment)
l, Pr(l) constant

38
Maximum a Posteriori

Evaluate Expectations using the single most
probable assignment (Maximum a posteriori (MAP)
assignment)

39
Bipartite Graph

Represent bitext as bipartite graph
Find solution for weighted maximum matching
Still too expensive to solve
Competitive Linking Algorithm approximates

u

log(trans(u,v))
v

40
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

41
Method A Competitive Linking

Step 1
Co-occurrence counts
Use whole table information
Initialize score(u,v) to G2(u,v) (similar to
Chi-square)
Good-Turing Smoothing gives improvements

42
Step 2 Estimation of link counts

Competitive Linking algorithm is employed
Greedy approximation of the MAP approximation

Algorithm
Sort all score(u,v) from the highest to the
lowest
For each score(u,v) in order
Link all co-occurring token pairs (u,v) in the
bitext(If u is NULL consider all tokens of v in
the bitext linked to NULL and vice versa)
One-to-One assumption Linked words cannot be
linked againRemove all linked words from the
bitext

43
Example Competitive Linking
u
a
b
c
d
v
44
Competitive Linking
X
X
X
u
X
X
X
X
X
X
a
X
b
X
X
X
c
d
v
45
Competitive Linking
X
X
X
X
X
X
u
X
X
X
X
X
X
a
X
X
X
X
X
X
b
X
X
X
X
X
X
c
d
v
46
Competitve Linking per sentence
b
a

links(a,c) links(b,d)
c
d

a
b

links(a,d) links(b,e)
c
d
e

47
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

48
Method B

Most texts are not translated word-for-word
Why is that a problem with Method A?

a
b
x

c
d
e
f

49
Method B

Most texts are not translated word-for-word
Why is that a problem with Method A?

a
b
x

Competitive Linking
c
d
e
f

a
b
x

We are forced to connect (b,d)!
c
d
e
f

50
Method B

After one iteration of Method A on 300k sentences
Hansard

links cooc
often, probably correct
links lt cooc
rare, might be correct
links ltlt cooc
often, probably incorrect

51
Method B

Use information links(u,v)/cooc(u,v) to bias
parameter estimation
Introduce p(u,v) as the probability of u and v
being linked when they co-occur.
Leads to binomial process for each co-occurrence
(either linked or not linked)
Too sparse data to model p(u,v)
Just 2 cases

If u,v are mutual translations (Rate of true
positives)
If u,v are not mutual translations (Rate of false
positives)
52
Method B
53
Maximum Likelihood Estimation
54
Maximum Likelihood Estimation

on 300k sentences Hansard

55
Method B

Overall score calculation for Method B
Probability for generating correct links(u,v)
given cooc(u,v)
Probability for generating incorrect links(u,v)
given cooc(u,v)
Score is ratio

56
Building Lexicons

Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion

57
Method C

Improved Estimation using Preexisting Word
Classes
Method A, B
All word pairs that co-occur the same number of
times and are linked the same number of times are
assigned the same score
But Frequent words are translated less
consistently than rare words
Introduce classes to get Statistics per class

58
Building Lexicons