An integrated platform for high-accuracy word alignment PowerPoint PPT Presentation

presentation player overlay
1 / 25
About This Presentation
Transcript and Presenter's Notes

Title: An integrated platform for high-accuracy word alignment


1
An integrated platform for high-accuracy word
alignment
  • Dan Tufis, Alexandru Ceausu, Radu Ion, Dan
    Stefanescu
  • RACAI Research Institute for Artificial
    Intelligence, Bucharest

2
COWAL
  • The main task of COWAL is to combine the output
    of two or more comparable word-aligners
  • In order to achieve this task, COWAL is also an
    integrated platform with modules for
    tokenization, POS-tagging, lemmatization,
    collocation detection, dependency annotation,
    chunking and word sense disambiguation.

3
Word alignment algorithms (YAWA)
  • YAWA starts with all plausible links (those with
    ll-score higher than 11)
  • Then, using a competitive linking strategy,
    retains the links that maximizes sentence
    translation equivalence score, and minimizing the
    number of crossing links
  • In this way, it generates only 1-1 alignments.
    N-M alignments are possible only with chunking
    and/or dependency linking available.

4
Word alignment algorithms (MEBA)
  • MEBA iterates several times over each pair of
    aligned sentences, at each iteration adding only
    the highest score links.
  • The links already established in previous
    iterations give support or create restrictions
    for the links to be added in a subsequent
    iteration.
  • MEBA uses different weights and different
    significance thresholds on each feature and
    iteration step.

5
Features characterizing a link
  • A link ltToken1 Token2gt is characterized by a set
    of features, the values of which are real numbers
    in the 0,1 interval.
  • context independent features CIF, they refer
    only to the tokens of the current link
  • context dependent features CDF, they refer to
    the properties of the current link with respect
    to the rest of links in a bi-text

6
Context independent features
  • Translation equivalents (lemma and/or wordform )
  • Translation equivalents entropy (lemma)
  • Part-of-Speech affinity
  • Cognates

7
Translation equivalents (TE)
  • YAWA, TREQ-AL use competitive linking based on
    ll-scores, plus the Ro-En aligned wordnets
  • MEBA uses GIZA generated candidates filtered
    with a log-likelihood threshold (11).
  • The TE candidates search space is limited by
    lemmatization and POS meta-classes (e.g.
    meta-class 1 includes only N, V, Aj and Adv
    meta-class 8 includes only proper names)
  • For a pair of languages translation equivalents
    are computed in both directions. The value of the
    TE feature of a candidate link ltTOKEN1 TOKEN2gt is
    1/2 (PTR(TOKEN1, TOKEN2) PTR(TOKEN2, TOKEN1).

8
Entropy Score (ES)
  • The entropy of a word's translation equivalents
    distribution proved to be an important hint on
    identifying highly reliable links (anchoring
    links)
  • Skewed distributions favored against uniform ones
  • For a link ltA Bgt, the link feature value is
    0.5(ES(A)ES(B))

9
Part-of-speech affinity (PA)
  • An important clue in word alignment is the fact
    that the translated words tend to keep their
    part-of-speech and when they have different
    POSes, this is not arbitrary.
  • Tried to use GIZA (replacing tokens with their
    respective POSes) but there was too much noise!
  • The information was computed based on a gold
    standard (the revised NAACL2003), in both
    directions (source-target and target-source).
  • For a link ltA,Bgt PA0.5(P(cat(A)cat(B))P(cat(B)
    cat(A))

10
Cognates (COG)
  • The cognates feature assigns a string similarity
    (using Levenstein distance) to the tokens of a
    candidate link
  • We estimated the probability of a pair of
    orthographically similar words, appearing in
    aligned sentences, to be cognates, with different
    string similarity thresholds. For the threshold
    0.6 we didnt find any exception. Therefore, the
    value of this feature is either 1 (if the
    similarity score is above the threshold or 0
    otherwise).
  • Before computing the string similarity score, the
    words are normalized (duplicate letters are
    removed, diacritics are removed, some suffixes
    are discarded).

11
Context dependent features
  • Locality
  • Links crossed
  • Relative position/Distortion
  • Collocation/Fertility
  • Coherence

12
Collocation
  • Bi-gram lists (only content words) were built
    from each monolingual part of the training
    corpus, using the log-likelihood score (threshold
    of 10) and minimal occurrence frequency (3) for
    candidates filtering. Collocation probabilities
    are estimated for each surviving bi-gram.
  • If neither token of a candidate link has a
    relevant collocation score with the tokens in its
    neighborhood, the link value of this feature is
    0. Otherwise the value is the maximum of the
    collocation probabilities of the links tokens.
    Competing links (starting or finishing in the
    same token) are licensed only and only if at
    least one of them have a non-null collocation
    score

13
Distorsion/Relative position
  • Each token in both sides of a bi-text is
    characterized by a position index, computed as
    the ratio between the relative position in the
    sentence and the length of the sentence. The
    absolute value of the difference between tokens
    position indexes, gives the links obliqueness
  • The distorsion feature of a link is its
    obliqueness D(link)OBL(SWi, TWj)

14
Localization
  • This feature is relevant with or without chunking
    or dependency parsing modules. It accounts for
    the degree of the cohesion of links.
  • With the chunking module is available, and the
    chunks are aligned via the linking of their
    respective heads, the links starting in one chunk
    should finish in the aligned chunk.
  • When chunking information is not available, the
    link localization is judged against a window, the
    span of which is dependent on the aligned
    sentences length.
  • Maximum localization (1) is the one with all the
    tokens in the source window are linked to all
    tokens in the target window

15
Crossed links
  • The crossed links feature computes (for a window
    size depending on the categories of the
    candidates and the sentences lengths) the links
    that were crossed.
  • The normalization factor (maximum number of
    crossable links) is empirically set, based on
    categories of the links tokens

16
EVALUATIONOfficial ranking
U.RACAI.Combined L.ISI.Run5.vocab.grow
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Word alignment combiners
  • The COWAL(ACL2005) combiner is fine-tuned for the
    concerned language pair (rule-based)
  • The SMV filter is a language independent combiner
    (trainable on positive and negative examples)
  • Trade-off between human introspection and
    performance

21
SVM filter
  • Combining word alignments requires the ability to
    distinguish among correct links and incorrect
    links of the two ore more merged alignments. SVM
    technology is specifically adequate for this
    task
  • The SVM combiner is a classifier trained on both
    positive and negative examples.

22
SVM filter evaluation
MEBA COWAL MEBA filtered YAWA MEBA filtered
Precision 0.9122 0.8795 0.9315 0.8830
Recall 0.6976 0.7775 0.6712 0.7713
F-measure 0.7924 0.8254 0.7802 0.8234
SVM filtering results.The SVM model was trained
on NAACL 2003 gold standard.
23
Romanian Acquis
  • The available Romanian documents were downloaded
    from CCVISTA (over 12000 Microsoft word
    documents)
  • We kept only 11228 files (some of them were
    different versions of the same document)
  • The remaining documents were converted into the
    same XML format of the ACQUIS corpus
  • From the 11228 Romanian files only 6256 are
    available for English in the JRC distribution

24
Romanian Acquis
  • Tokenization
  • Sentence splitting
  • POS-tagging
  • Lemmatization
  • Chunking
  • Sentence aligning

25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com