Cognates and Word Alignment in Bitexts - PowerPoint PPT Presentation

About This Presentation
Title:

Cognates and Word Alignment in Bitexts

Description:

Reason: existence of cognates, which are usually orthographically and ... Similar in orthography or pronunciation. Often mutual translations. May include: ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 36
Provided by: wesleyc3
Category:

less

Transcript and Presenter's Notes

Title: Cognates and Word Alignment in Bitexts


1
Cognates and Word Alignment in Bitexts
  • Greg Kondrak
  • University of Alberta

2
Outline
  • Background
  • Improving LCSR
  • Cognates vs. word alignment links
  • Experiments results

3
Motivation
  • Claim words that are orthographically similar
    are more likely to be mutual translations than
    words that are not similar.
  • Reason existence of cognates, which are usually
    orthographically and semantically similar.
  • Use Considering cognates can improve word
    alignment and translation models.

4
Objective
  • Evaluation of orthographic similarity measures in
    the context of word alignment in bitexts.

5
MT applications
  • sentence alignment
  • word alignment
  • improving translation models
  • inducing translation lexicons
  • aid in manual alignment

6
Cognates
  • Similar in orthography or pronunciation.
  • Often mutual translations.
  • May include
  • genetic cognates
  • lexical loans
  • names
  • numbers
  • punctuation

7
The task of cognate identification
  • Input two words
  • Output the likelihood that they are cognate
  • One method compute their orthographic/phonetic/se
    mantic similarity

8
Scope
  • The measures that we consider are
  • language-independent
  • orthography-based
  • operate on the level of individual letters
  • binary identity function

9
Similarity measures
  • Prefix method
  • Dice coefficient
  • Longest Common Subsequence Ratio (LCSR)
  • Edit distance
  • Phonetic alignment
  • Many other methods

10
IDENT
  • 1 if two words are identical, 0 otherwise
  • The simplest similarity measure
  • e.g. IDENT(colour, couleur) 0

11
PREFIX
  • The ratio of the longest common prefix of two
    words to the length of the longer word
  • e.g. PREFIX(colour, couleur) 2/7 0.28

12
DICE coefficient
  • The ratio of the number of common letter bigrams
    to the total number of letter bigrams
  • e.g. DICE(colour, couleur) 6/11 0.55

13
Longest Common Sub-sequence Ratio (LCSR)
  • The ratio of the longest common subsequence of
    two words to the length of the longer word.
  • e.g. LCSR(colour, couleur) 5/7 0.71

14
LCSR
  • Method of choice in several papers
  • Weak point insensitive to word length
  • Example
  • LCSR(walls, allés) 0.8
  • LCSR(sanctuary, sanctuaire) 0.8
  • Sometimes a minimal word length imposed
  • A principled solution?

15
The random model
  • Assumption strings are generated randomly from a
    given distribution of letters.
  • Problem what is the probability of seeing k
    matches between two strings of length m and n?

16
A special case
  • Assumption k0 (no matches)
  • t alphabet size
  • S(n,i) - Stirling number of the second kind

17
The problem
  • What is the probability of seeing k matches
    between two strings of length m and n?
  • An exact analytical formula is unlikely to exist.
  • A very similar problem has been studied in
    bioinformatics as statistical significance of
    alignment scores.
  • Approximations developed in bioinformatics are
    not applicable to words because of length
    differences.

18
Solutions for the general case
  • Sampling
  • Not reliable for small probability values
  • Works well for low k/n ratios (uninteresting)
  • Depends on a given alphabet size and letter
    frequencies
  • No insight
  • Inexact approximation
  • Works well for high k/n ratios (interesting)
  • Easy to use

19
Formula 1
  • - probability of a match

20
Formula 1
  • Exact for kmn
  • Inexact in general
  • Reason implicit independence assumption
  • Lower bound for the actual probability
  • Good approximation for high k/n ratios.
  • Runs into numerical problems for larger n

21
Formula 2
  • Expected number of pairs of k-letter substrings.
  • Approximates the required probability for high
    k/n ratios.

22
Formula 2
  • Does not work for low k/n ratios.
  • Not monotonic.
  • Simpler than Formula 1.
  • More robust against numerical underflow for very
    long words.

23
Comparison of both formulas
  • Both are exact for kmn
  • For k close to max(m,n)
  • both formulas are good approximations
  • their values are very close
  • Both can be quickly computed using dynamic
    programming.

24
LCSF
  • A new similarity measure based on Formula 2.
  • LCSR(X,Y) k/n
  • LCSF(X,Y)
  • LCSF is as fast as LCSR because its values that
    depend only on k and n can be pre-computed and
    stored

25
Evaluation - motivation
  • Intrinsic evaluation of orthographic similarity
    is difficult and subjective.
  • My idea extrinsic evaluation on cognates and
    word aligned bitexts.
  • Most cross-language cognates are orthographically
    similar and vice-versa.
  • Cognation is binary and not subjective

26
Cognates vs alignment links
  • Manual identification of cognates is tedious.
  • Manually word-aligned bitexts are available, but
    only some of the links are between cognates.
  • Question 1 can we use manually-constructed word
    alignment links instead?

27
Manual vs automatic alignment links
  • Automatically word-aligned bitext are easily
    obtainable, but a good fraction of the links are
    wrong.
  • Question 2 can we use machine-generated word
    alignment links instead?

28
Evaluation methodology
  • Assumption a word aligned bitext
  • Treat aligned sentences as bags of words
  • Compute similarity for all word pairs
  • Order word pairs by their similarity value
  • Compute precision against a gold standard
  • either a cognate list or alignment links

29
Test data
  • Blinker bitext (French-English)
  • 250 Bible verse pairs
  • manual word alignment
  • all cognates manually identified
  • Hansards (French-English)
  • 500 sentences
  • manual and automatic word-alignment
  • Romanian-English
  • 248 sentences
  • manually aligned

30
Blinker results
31
Hansards results
32
Romanian-English results
33
Contributions
  • We showed that word alignment links can be used
    instead of cognates for evaluating word
    similarity measures.
  • We proposed a new similarity measure which
    outperforms LCSR.

34
Future work
  • Extend our approach to length normalization to
    edit distance and other similarity measures.
  • Incorporate cognate information into statistical
    MT models as an additional feature function.

35
Thank you
36
Applications
  • Recognition of Cognates
  • Historical linguistics
  • Machine translation
  • Sentence and word alignment
  • Confusable Drug Names
  • Edit Distance Tasks
  • Spelling error correction

37
Improved word alignment quality
  • GIZA trained on 50,000 sentences from Hansards.
  • Tested on 500 manually aligned sentences
  • 10 reduction of the error rate when cognates are
    added.

38
Blinker results
39
Problems with links (1)
  • The lion(1) killed enough for his cubs and
    strangled the prey for his mate(2)
  • Le lion(1) déchirait pour ses petits, etranglait
    pour ses lionnes(2)

40
Problems with links (2)
  • But let justice(1) roll on like a river ,
    righteousness(2) like a never -failing stream.
  • Mais que la droiture(1) soit comme un courant de
    eau , et la justice(2) comme un torrent qui
    jamais ne tarit.
Write a Comment
User Comments (0)
About PowerShow.com