Cognates and Word Alignment in Bitexts - PowerPoint PPT Presentation

About This Presentation

Title:

Cognates and Word Alignment in Bitexts

Description:

Reason: existence of cognates, which are usually orthographically and ... Similar in orthography or pronunciation. Often mutual translations. May include: ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 36

Provided by: wesleyc3

Category:

more less

Transcript and Presenter's Notes

Title: Cognates and Word Alignment in Bitexts

1
Cognates and Word Alignment in Bitexts

Greg Kondrak
University of Alberta

2
Outline

Background
Improving LCSR
Cognates vs. word alignment links
Experiments results

3
Motivation

Claim words that are orthographically similar
are more likely to be mutual translations than
words that are not similar.
Reason existence of cognates, which are usually
orthographically and semantically similar.
Use Considering cognates can improve word
alignment and translation models.

4
Objective

Evaluation of orthographic similarity measures in
the context of word alignment in bitexts.

5
MT applications

sentence alignment
word alignment
improving translation models
inducing translation lexicons
aid in manual alignment

6
Cognates

Similar in orthography or pronunciation.
Often mutual translations.
May include
genetic cognates
lexical loans
names
numbers
punctuation

7
The task of cognate identification

Input two words
Output the likelihood that they are cognate
One method compute their orthographic/phonetic/se
mantic similarity

8
Scope

The measures that we consider are
language-independent
orthography-based
operate on the level of individual letters
binary identity function

9
Similarity measures

Prefix method
Dice coefficient
Longest Common Subsequence Ratio (LCSR)
Edit distance
Phonetic alignment
Many other methods

10
IDENT

1 if two words are identical, 0 otherwise
The simplest similarity measure
e.g. IDENT(colour, couleur) 0

11
PREFIX

The ratio of the longest common prefix of two
words to the length of the longer word
e.g. PREFIX(colour, couleur) 2/7 0.28

12
DICE coefficient

The ratio of the number of common letter bigrams
to the total number of letter bigrams
e.g. DICE(colour, couleur) 6/11 0.55

13
Longest Common Sub-sequence Ratio (LCSR)

The ratio of the longest common subsequence of
two words to the length of the longer word.
e.g. LCSR(colour, couleur) 5/7 0.71

14
LCSR

Method of choice in several papers
Weak point insensitive to word length
Example
LCSR(walls, allés) 0.8
LCSR(sanctuary, sanctuaire) 0.8
Sometimes a minimal word length imposed
A principled solution?

15
The random model

Assumption strings are generated randomly from a
given distribution of letters.
Problem what is the probability of seeing k
matches between two strings of length m and n?

16
A special case

Assumption k0 (no matches)
t alphabet size
S(n,i) - Stirling number of the second kind

17
The problem

What is the probability of seeing k matches
between two strings of length m and n?
An exact analytical formula is unlikely to exist.
A very similar problem has been studied in
bioinformatics as statistical significance of
alignment scores.
Approximations developed in bioinformatics are
not applicable to words because of length
differences.

18
Solutions for the general case

Sampling
Not reliable for small probability values
Works well for low k/n ratios (uninteresting)
Depends on a given alphabet size and letter
frequencies
No insight
Inexact approximation
Works well for high k/n ratios (interesting)
Easy to use

19
Formula 1

- probability of a match

20
Formula 1

Exact for kmn
Inexact in general
Reason implicit independence assumption
Lower bound for the actual probability
Good approximation for high k/n ratios.
Runs into numerical problems for larger n

21
Formula 2

Expected number of pairs of k-letter substrings.
Approximates the required probability for high
k/n ratios.

22
Formula 2

Does not work for low k/n ratios.
Not monotonic.
Simpler than Formula 1.
More robust against numerical underflow for very
long words.

23
Comparison of both formulas

Both are exact for kmn
For k close to max(m,n)
both formulas are good approximations
their values are very close
Both can be quickly computed using dynamic
programming.

24
LCSF

A new similarity measure based on Formula 2.
LCSR(X,Y) k/n
LCSF(X,Y)
LCSF is as fast as LCSR because its values that
depend only on k and n can be pre-computed and
stored

25
Evaluation - motivation

Intrinsic evaluation of orthographic similarity
is difficult and subjective.
My idea extrinsic evaluation on cognates and
word aligned bitexts.
Most cross-language cognates are orthographically
similar and vice-versa.
Cognation is binary and not subjective

26
Cognates vs alignment links

Manual identification of cognates is tedious.
Manually word-aligned bitexts are available, but
only some of the links are between cognates.
Question 1 can we use manually-constructed word
alignment links instead?

27
Manual vs automatic alignment links

Automatically word-aligned bitext are easily
obtainable, but a good fraction of the links are
wrong.
Question 2 can we use machine-generated word
alignment links instead?

28
Evaluation methodology

Assumption a word aligned bitext
Treat aligned sentences as bags of words
Compute similarity for all word pairs
Order word pairs by their similarity value
Compute precision against a gold standard
either a cognate list or alignment links

29
Test data

Blinker bitext (French-English)
250 Bible verse pairs
manual word alignment
all cognates manually identified
Hansards (French-English)
500 sentences
manual and automatic word-alignment
Romanian-English
248 sentences
manually aligned

30
Blinker results
31
Hansards results
32
Romanian-English results
33
Contributions

We showed that word alignment links can be used
instead of cognates for evaluating word
similarity measures.
We proposed a new similarity measure which
outperforms LCSR.

34
Future work

Extend our approach to length normalization to
edit distance and other similarity measures.
Incorporate cognate information into statistical
MT models as an additional feature function.

35
Thank you
36
Applications

Recognition of Cognates
Historical linguistics
Machine translation
Sentence and word alignment
Confusable Drug Names
Edit Distance Tasks
Spelling error correction

37
Improved word alignment quality

GIZA trained on 50,000 sentences from Hansards.
Tested on 500 manually aligned sentences
10 reduction of the error rate when cognates are
added.

38
Blinker results
39
Problems with links (1)

The lion(1) killed enough for his cubs and
strangled the prey for his mate(2)
Le lion(1) déchirait pour ses petits, etranglait
pour ses lionnes(2)

40
Problems with links (2)

But let justice(1) roll on like a river ,
righteousness(2) like a never -failing stream.
Mais que la droiture(1) soit comme un courant de
eau , et la justice(2) comme un torrent qui
jamais ne tarit.

Write a Comment

User Comments (0)