Alignment - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Alignment

Description:

Brown, Lai & Mercer / Gale & Church algorithm. Sentence alignment as a precursor of ... Cognates: similar orthography, similar meaning, historically related. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 21
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Alignment


1
Alignment

2
Contents
  • Uses of sentence and character alignment.
  • Brown, Lai Mercer / Gale Church algorithm.
  • Sentence alignment as a precursor of word
    alignment.
  • Discovery and incorporation of cognates into
    sentence alignment Dice, dynamic programming.
  • Cognates and the discovery of regular sound
    changes.
  • Conclusion.

3
Sentence Alignment
  • The task of sentence alignment is to discover
    exactly which sentence or sentences in the first
    language of a parallel corpus correspond to which
    sentence or sentences in the other language.
  • Uses
  • Example-Based Machine Translation (Nagao, 1984).
  • Terminology extraction (Daille, 1994)
  • Computer Assisted Language Learning (Catizone et
    al., 1989)
  • Detection of Plagiarism (Debili Sammouda, 1992,
    Piao)

4
Character alignment
  • Character alignment which characters correspond
    in a pair of cognate words?
  • t e l e f o n
  • t e l e ph o n e
  • Uses
  • Confirmation of cognateness.
  • Spelling checkers (Wagner Fisher, 1974).
  • Discovery of regular sound changes.

5
Statistical sentence alignment
  • The Brown, Lai Mercer / Gale Church
    statistical sentence alignment algorithm depends
    on
  • Relative sentence lengths longer sentences tend
    to be translated by longer sentences, shorter
    sentences tend to be translated by shorter
    sentences (see next slide).
  • Certain alignment types are more common than
    others
  • 11 gt 21 or 12 gt 22 gt 10 or 01 (see slide 7)
  • See examples on slides 8 to 10
  • Here there are only 3 possible alignments, so we
    can examine them all. If there are many possible
    alignments, use dynamic programming, which
    examines only alignments with a high a priori
    probability (see slide 11).

6
(No Transcript)
7
(No Transcript)
8
Cost 12 (alignment type) 7164 (length
ratio) 250 29 279
9
Cost 11 7125 01 039 0
451 450 725 1626
10
Cost 01 025 11 7539 450 500
0 232 1182
11
(No Transcript)
12
Sentence alignment as a precursor of word
alignment
  • Gaussier, Langé Daille 1995 compare with K-vec
    (Fung Church 1994).
  • In the aligned corpus on the next slide, what is
    the most probable translation of chat?
  • Simple co-occurrence frequency the 3, cat 3, is
    2, grey 1, small 1, black 1, other 1.

13
(No Transcript)
14
The contingency table
Table for chat, cat
Table for chat, the
N a b c d 5 Cubic association
coefficient (Daille, 1995), MI³. MI³ log2 (a³N
/ (ab) (ac)) cat 3.91, the 3.17, is 2.15, grey
0.74, small 0.74, other 0.74, black 0.26.
15
Incorporation of cognates into sentence alignment
  • Cognates similar orthography, similar meaning,
    historically related.
  • To find cognate anchors, McEnery Oakes (1996)
    used Approximate String Matching, e.g. Dices
    similarity coefficient
  • Co-ou-ul-le-eu-ur
  • Co-ol-lo-ou-ur
  • Dice (2 x matches / total bigrams)
  • 6/11 0.55
  • Incorporation of cognate anchors into Gale
    Church algorithm
  • Improvement for English-Norwegian (Hofland,
    1996)
  • No improvement for English-Polish (Lewandowska et
    al., 1999)
  • Simard (2001) suggested looking for isolated
    cognates.

16
Dynamic programming for alignment at the
character level
  • The difference between two word forms (edit
    distance) is the smallest number of operations
    (insertions, deletions and substitutions)
    required to transform one word form into the
    other.
  • e.g. Malay and Tagalog words for egg
  • t e l u r
  • i t l o g

17
  • DP is much slower than Dice, but has the
    advantage that it gives not only edit distance
    but alignment.
  • Can we find other examples in Malay and Tagalog
    where u ? o, r ? g (regular sound changes) ?
  • The smaller the edit distance, the more likely
    two words are to be translations of each other
    (See next slide).
  • To discover regular sound changes for each pair
    of words in Swadeshs list, if edit distance lt
    2, consider the word pair cognate, and keep a
    tally of all aligned character pairs.
  • Regular sound changes found between Javanese and
    Malay (see next slide but one).

18
Malay 195-word list compared with
Less than 100 at edit distance 0 due to
semantic drift, e.g. kembang, kulit.
19
(No Transcript)
20
Conclusions
  • Word order is less well preserved in translation
    than sentence order or the order of the
    characters in cognate words.
  • Dynamic programming assumes no crossover.
  • Thus dynamic programming is suitable for sentence
    level and character level alignment, but indirect
    methods are required for word level alignment,
    e.g
  • 1. Sentence level alignment, then co-occurrence
    statistics,
  • 2. Character level alignment, then regular sound
    substitutions.
Write a Comment
User Comments (0)
About PowerShow.com