Title: Alignment
1Alignment
2Contents
- Uses of sentence and character alignment.
- Brown, Lai Mercer / Gale Church algorithm.
- Sentence alignment as a precursor of word
alignment. - Discovery and incorporation of cognates into
sentence alignment Dice, dynamic programming. - Cognates and the discovery of regular sound
changes. - Conclusion.
3Sentence Alignment
- The task of sentence alignment is to discover
exactly which sentence or sentences in the first
language of a parallel corpus correspond to which
sentence or sentences in the other language. - Uses
- Example-Based Machine Translation (Nagao, 1984).
- Terminology extraction (Daille, 1994)
- Computer Assisted Language Learning (Catizone et
al., 1989) - Detection of Plagiarism (Debili Sammouda, 1992,
Piao)
4Character alignment
- Character alignment which characters correspond
in a pair of cognate words? - t e l e f o n
- t e l e ph o n e
- Uses
- Confirmation of cognateness.
- Spelling checkers (Wagner Fisher, 1974).
- Discovery of regular sound changes.
5Statistical sentence alignment
- The Brown, Lai Mercer / Gale Church
statistical sentence alignment algorithm depends
on - Relative sentence lengths longer sentences tend
to be translated by longer sentences, shorter
sentences tend to be translated by shorter
sentences (see next slide). - Certain alignment types are more common than
others - 11 gt 21 or 12 gt 22 gt 10 or 01 (see slide 7)
- See examples on slides 8 to 10
- Here there are only 3 possible alignments, so we
can examine them all. If there are many possible
alignments, use dynamic programming, which
examines only alignments with a high a priori
probability (see slide 11).
6(No Transcript)
7(No Transcript)
8Cost 12 (alignment type) 7164 (length
ratio) 250 29 279
9Cost 11 7125 01 039 0
451 450 725 1626
10Cost 01 025 11 7539 450 500
0 232 1182
11(No Transcript)
12Sentence alignment as a precursor of word
alignment
- Gaussier, Langé Daille 1995 compare with K-vec
(Fung Church 1994). - In the aligned corpus on the next slide, what is
the most probable translation of chat? - Simple co-occurrence frequency the 3, cat 3, is
2, grey 1, small 1, black 1, other 1.
13(No Transcript)
14The contingency table
Table for chat, cat
Table for chat, the
N a b c d 5 Cubic association
coefficient (Daille, 1995), MI³. MI³ log2 (a³N
/ (ab) (ac)) cat 3.91, the 3.17, is 2.15, grey
0.74, small 0.74, other 0.74, black 0.26.
15Incorporation of cognates into sentence alignment
- Cognates similar orthography, similar meaning,
historically related. - To find cognate anchors, McEnery Oakes (1996)
used Approximate String Matching, e.g. Dices
similarity coefficient - Co-ou-ul-le-eu-ur
- Co-ol-lo-ou-ur
- Dice (2 x matches / total bigrams)
- 6/11 0.55
- Incorporation of cognate anchors into Gale
Church algorithm - Improvement for English-Norwegian (Hofland,
1996) - No improvement for English-Polish (Lewandowska et
al., 1999) - Simard (2001) suggested looking for isolated
cognates.
16Dynamic programming for alignment at the
character level
- The difference between two word forms (edit
distance) is the smallest number of operations
(insertions, deletions and substitutions)
required to transform one word form into the
other. - e.g. Malay and Tagalog words for egg
- t e l u r
- i t l o g
17 - DP is much slower than Dice, but has the
advantage that it gives not only edit distance
but alignment. - Can we find other examples in Malay and Tagalog
where u ? o, r ? g (regular sound changes) ? - The smaller the edit distance, the more likely
two words are to be translations of each other
(See next slide). - To discover regular sound changes for each pair
of words in Swadeshs list, if edit distance lt
2, consider the word pair cognate, and keep a
tally of all aligned character pairs. - Regular sound changes found between Javanese and
Malay (see next slide but one).
18Malay 195-word list compared with
Less than 100 at edit distance 0 due to
semantic drift, e.g. kembang, kulit.
19(No Transcript)
20Conclusions
- Word order is less well preserved in translation
than sentence order or the order of the
characters in cognate words. - Dynamic programming assumes no crossover.
- Thus dynamic programming is suitable for sentence
level and character level alignment, but indirect
methods are required for word level alignment,
e.g - 1. Sentence level alignment, then co-occurrence
statistics, - 2. Character level alignment, then regular sound
substitutions.