An integrated platform for high-accuracy word alignment presentation

About This Presentation

Transcript and Presenter's Notes

Title: An integrated platform for high-accuracy word alignment

1
An integrated platform for high-accuracy word
alignment

Dan Tufis, Alexandru Ceausu, Radu Ion, Dan
Stefanescu
RACAI Research Institute for Artificial
Intelligence, Bucharest

2
COWAL

The main task of COWAL is to combine the output
of two or more comparable word-aligners
In order to achieve this task, COWAL is also an
integrated platform with modules for
tokenization, POS-tagging, lemmatization,
collocation detection, dependency annotation,
chunking and word sense disambiguation.

3
Word alignment algorithms (YAWA)

YAWA starts with all plausible links (those with
ll-score higher than 11)
Then, using a competitive linking strategy,
retains the links that maximizes sentence
translation equivalence score, and minimizing the
number of crossing links
In this way, it generates only 1-1 alignments.
N-M alignments are possible only with chunking
and/or dependency linking available.

4
Word alignment algorithms (MEBA)

MEBA iterates several times over each pair of
aligned sentences, at each iteration adding only
the highest score links.
The links already established in previous
iterations give support or create restrictions
for the links to be added in a subsequent
iteration.
MEBA uses different weights and different
significance thresholds on each feature and
iteration step.

5
Features characterizing a link

A link ltToken1 Token2gt is characterized by a set
of features, the values of which are real numbers
in the 0,1 interval.
context independent features CIF, they refer
only to the tokens of the current link
context dependent features CDF, they refer to
the properties of the current link with respect
to the rest of links in a bi-text

6
Context independent features

Translation equivalents (lemma and/or wordform )
Translation equivalents entropy (lemma)
Part-of-Speech affinity
Cognates

7
Translation equivalents (TE)

YAWA, TREQ-AL use competitive linking based on
ll-scores, plus the Ro-En aligned wordnets
MEBA uses GIZA generated candidates filtered
with a log-likelihood threshold (11).
The TE candidates search space is limited by
lemmatization and POS meta-classes (e.g.
meta-class 1 includes only N, V, Aj and Adv
meta-class 8 includes only proper names)
For a pair of languages translation equivalents
are computed in both directions. The value of the
TE feature of a candidate link ltTOKEN1 TOKEN2gt is
1/2 (PTR(TOKEN1, TOKEN2) PTR(TOKEN2, TOKEN1).

8
Entropy Score (ES)

The entropy of a word's translation equivalents
distribution proved to be an important hint on
identifying highly reliable links (anchoring
links)
Skewed distributions favored against uniform ones
For a link ltA Bgt, the link feature value is
0.5(ES(A)ES(B))

9
Part-of-speech affinity (PA)

An important clue in word alignment is the fact
that the translated words tend to keep their
part-of-speech and when they have different
POSes, this is not arbitrary.
Tried to use GIZA (replacing tokens with their
respective POSes) but there was too much noise!
The information was computed based on a gold
standard (the revised NAACL2003), in both
directions (source-target and target-source).
For a link ltA,Bgt PA0.5(P(cat(A)cat(B))P(cat(B)
cat(A))

10
Cognates (COG)

The cognates feature assigns a string similarity
(using Levenstein distance) to the tokens of a
candidate link
We estimated the probability of a pair of
orthographically similar words, appearing in
aligned sentences, to be cognates, with different
string similarity thresholds. For the threshold
0.6 we didnt find any exception. Therefore, the
value of this feature is either 1 (if the
similarity score is above the threshold or 0
otherwise).
Before computing the string similarity score, the
words are normalized (duplicate letters are
removed, diacritics are removed, some suffixes
are discarded).

11
Context dependent features

Locality
Links crossed
Relative position/Distortion
Collocation/Fertility
Coherence

12
Collocation

Bi-gram lists (only content words) were built
from each monolingual part of the training
corpus, using the log-likelihood score (threshold
of 10) and minimal occurrence frequency (3) for
candidates filtering. Collocation probabilities
are estimated for each surviving bi-gram.
If neither token of a candidate link has a
relevant collocation score with the tokens in its
neighborhood, the link value of this feature is
0. Otherwise the value is the maximum of the
collocation probabilities of the links tokens.
Competing links (starting or finishing in the
same token) are licensed only and only if at
least one of them have a non-null collocation
score

13
Distorsion/Relative position

Each token in both sides of a bi-text is
characterized by a position index, computed as
the ratio between the relative position in the
sentence and the length of the sentence. The
absolute value of the difference between tokens
position indexes, gives the links obliqueness
The distorsion feature of a link is its
obliqueness D(link)OBL(SWi, TWj)

14
Localization

This feature is relevant with or without chunking
or dependency parsing modules. It accounts for
the degree of the cohesion of links.
With the chunking module is available, and the
chunks are aligned via the linking of their
respective heads, the links starting in one chunk
should finish in the aligned chunk.
When chunking information is not available, the
link localization is judged against a window, the
span of which is dependent on the aligned
sentences length.
Maximum localization (1) is the one with all the
tokens in the source window are linked to all
tokens in the target window

15
Crossed links

The crossed links feature computes (for a window
size depending on the categories of the
candidates and the sentences lengths) the links
that were crossed.
The normalization factor (maximum number of
crossable links) is empirically set, based on
categories of the links tokens

16
EVALUATIONOfficial ranking
U.RACAI.Combined L.ISI.Run5.vocab.grow
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Word alignment combiners

The COWAL(ACL2005) combiner is fine-tuned for the
concerned language pair (rule-based)
The SMV filter is a language independent combiner
(trainable on positive and negative examples)
Trade-off between human introspection and
performance

21
SVM filter

Combining word alignments requires the ability to
distinguish among correct links and incorrect
links of the two ore more merged alignments. SVM
technology is specifically adequate for this
task
The SVM combiner is a classifier trained on both
positive and negative examples.

22
SVM filter evaluation
MEBA COWAL MEBA filtered YAWA MEBA filtered
Precision 0.9122 0.8795 0.9315 0.8830
Recall 0.6976 0.7775 0.6712 0.7713
F-measure 0.7924 0.8254 0.7802 0.8234
SVM filtering results.The SVM model was trained
on NAACL 2003 gold standard.
23
Romanian Acquis

The available Romanian documents were downloaded
from CCVISTA (over 12000 Microsoft word
documents)
We kept only 11228 files (some of them were
different versions of the same document)
The remaining documents were converted into the
same XML format of the ACQUIS corpus
From the 11228 Romanian files only 6256 are
available for English in the JRC distribution

24
Romanian Acquis

Tokenization
Sentence splitting
POS-tagging
Lemmatization
Chunking
Sentence aligning

25
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

An integrated platform for high-accuracy word alignment PowerPoint PPT Presentation