Combining Rules and Statistics in the GermantoEnglish MT System METISII PowerPoint PPT Presentation

presentation player overlay
1 / 48
About This Presentation
Transcript and Presenter's Notes

Title: Combining Rules and Statistics in the GermantoEnglish MT System METISII


1
Combining Rules and Statistics in the
German-to-English MT System METIS-II
  • Michael Carl
  • IAI

2
Overview
  • METIS-II General Introduction
  • Project description
  • Architecture of German to English
  • Detailed Description
  • Source language analysis
  • Dictionary matching and lookup
  • Target language adjustment
  • Translation ranking and selection
  • Target language token generation.

3
METIS-II Time Table
  • EU Project within IST-STREP
  • Started October 2004
  • Duration 3 Jahre
  • Until June 2006 Exploration of various methods
  • Since June 2006 refinement/integration of
    modules interface for user adaptation

4
Participants
  • ILSP, Athen Greek -gt English
  • CCL, Leuven Dutch -gt English
  • IAI, Saarbrücken German -gt English
  • UPF, Barcelona Spanish -gt English

5
METIS-II Goals
  • METIS-II is the continuation of METIS-I
  • exploit entities below sentence border
  • METIS-II resources
  • use only basic' tools and resources
  • METIS-II does not need a particular format
  • enable different tag-set for SL and TL
  • plug in different taggers
  • METIS-II can be used for several languages

6
Required Resources
  • Bilingual Dictionary
  • German to English
  • Basic 'linguistic' tools
  • SL and TL tagger, chunker
  • Monolingual TL Corpus
  • BNC (108 words, 106 sentences)
  • Parallel Corpora (SMT/EBMT) not required
  • avoid data-acquisition bottleneck

7
METIS-II German to English
  • Generate partial translation hypotheses
  • Store hypotheses in an AND/OR graph
  • Rank best combination of partial translation
    hypotheses
  • Resources needed
  • Bilingual German to English Dictionary
  • Basic 'linguistic' tools SL and TL tagger,
    chunker
  • Permutation/reordering rules
  • Monolingual TL Corpus BNC (108 words, 106
    sentences)
  • Parallel Corpora (SMT/EBMT) not required
  • avoid data-acquisition bottleneck

8
Overview of the System
9
SL Model German Analysis
  • MPRO
  • lemmatization
  • morphological analyser
  • KURD / FRED (shallow syntax analysis)
  • grammar is basis for
  • Duden Korrektor (German grammar checker)
  • CLAT (Controlled language technology)
  • text indexation
  • pattern-based formalism to detect and mark
    phrases, clauses, topological fields

10
German Grammar
  • Recognised Constituents (flat representation)
  • NPs, PPs, Verbal groups
  • clauses
  • topological fields
  • does not detect/mark relation between
    constituents
  • Method
  • originates in requirements for grammar correction
  • iterative process
  • mark 'secure' patterns
  • disambiguate the pattern

11
Input/Output of German Analyser
  • Das Haus wurde von Hans gekauft The house was
    from Hans boughtLemma wnr PoS
    phrase clause/fieldludas, wnr1,
    cw,scart, phrnpsubjF, clhsvf
    ,luhaus, wnr2, cnoun, phrnpsubj,
    clhsvf,luwerden,wnr3, cverb,vtfiv,
    phrvg fiv, clhslk,luvon, wnr4,
    cw,scp, phrnpnosubjF, clhsmf,luHans,
    wnr5, cnoun, phrnpnosubj,
    clhsmf,lukaufen, wnr6, cverb,vtptc2,
    phrvg ptc, clhsrk.

12
Translation ModelDictionary Look-up
13
German-to-English Dictionary
  • gt 600.000 Entries
  • Independent tag sets in SL and TL
  • Single- and multi word units, phrase translations
  • Represented as flat trees
  • leaves contain lexical information
  • mother node contains meta information

14
Goals of Dictionary Look-up
  • Discontinuous Entries
  • separable prefix lehnt ... ab lt--gt reject
  • reflexive verbes sich ... beeilen lt--gt hurry
    up
  • support verbes in Gefahr bringen lt--gt
    endanger
  • idioms vom Mund ablesen lt--gt lip-read
  • Lexical Overgeneration
  • lex.-sem. ambiguities Bank lt--gt bankbench
  • main/aux.verb werden lt--gt willbebecome
  • negation nicht lt--gt do notnot
  • magniers/intensifers stark lt--gt
    stronggoodheavy ...
  • prepositions auf lt--gt oninuponto ...

15
Types of Discontinuous Verbal Realisations
  • Dictionary entry
  • Anweisung ausführen lt--gt execute statement
  • Realisation in a subordinate clause (en bloc)
  • dass er sofort die Anweisungen ausführt that
    he immediately the statements executes ...
  • Realisation in a main clause (left Klammer
    Mittelfeld)
  • Er führt die Anweisung sofort aus He executes
    the statement immediately VPREF ...
  • Realisation in a modal main clause (Mittelfeld
    right Klammer)
  • Er will die Anweisung sofort ausführen.He will
    the statement immediately execute.

16
Dictionary Maintainance and Look-up
  • Structure and Maintenance of the dictionary
  • lemmatisation and morphological analysis of
    entries
  • consistency of entries
  • generation of variants
  • indexation of morphemes
  • Dictionary look-up
  • retrieve entries and filter 'best' matches
  • lexical similarity
  • contextual consolidation

17
Structure of Dictionary Entry
  • deeinsperren,mdecverb, enlock_ltso.gt_away,
    mencverb.
  • deausführen, mdecverb, enexecute,mencve
    rb.
  • Structure of dictionary entries
  • represented as flat trees
  • contain lexical information and meta information
  • leaves follow canonical representation
  • Problem
  • inflexion, derivation, variationgt
    lemmatization, analyse morphological structure

18
Canonical Forms of Dictionary Entries (German
side)
  • cverb last word of entry is infinite
    verb e.g. tanzen gehen (dancing go)
  • cnoun last word of entry is noun/sing/nominative
    e.g. dritte Welt (third world)
  • cadj last word of entry is
    adjective e.g. hell grün (light green)
  • cp last word of entry is preposition e.g.
    in Bezug auf (with respect to)

19
Morphological Analysis of German Entries
  • deausführen, mdecverb,enexecute,mencverb
    .
  • ausführen has morphological structure
    lsaus_führen
  • and several morphological analyses
  • luausführen,cnoun,eheadnbsg,caseaccdatnom
    ,gn
  • luausführen,cverb,vtypfiv,nbplu,per13,tnsp
    res
  • luausführen,cverb,vtypinf.
  • Disambiguated entry (according to canonical
    form)
  • cverb_at_luausführen,lsaus_führen,cverb.

20
Lexical Variation
  • Abfertigung des Gepäcks --gt?Gepäckabfertigungche
    ck-in of the luggage --gt luggage check-in
  • Anzahl der Mitarbeiter --gt?Mitarbeiteranzahlnum
    ber of worker --gt worker number cnoun_at_cn
    oun,lsanzahl,cart,lsart,cnoun,lsmit_arbe
    iten. --gt cnoun_at_cnoun,lsmit_arbeiten
    anzahl.
  • ausführen --gt führen aus cverb,typens_at_c
    verb,lsaus_führen. --gt cverb,typehs_at_
    cverb,lsführen,cvpref,lsaus.

21
Dictionary-lookupRetrieval and filtering
  • Retrieve entries which share morphological
    structure
  • Filter best candidates
  • consolidate word order
  • dictionary entry and match in same word-order
  • compute lexical delta
  • find most 'similar' word form
  • contextual consolidation
  • check 'internal' and external context of match

22
Some Surface Realisations of aus_führen
  • Word Lemma PoS Derivation Feature Info
  • Ausführbarkeit ausführbarkeit noun barheit
    nbsg
  • Ausführer ausführer noun er
    nbsg
  • Ausführung ausführung noun ung
    nbsg
  • ausführen ausführen verb ---
    per13, tnspres
  • ausführbar ausführbar adv bar
    degbase
  • ausführbarer ausführbar advadj bar
    degcomp
  • ausgeführten ausgeführt adj ptc2
    degbase
  • ausgeführteren ausgeführt adj ptc2
    degcomp
  • ausgeführt ausführen adjverb ptc2
  • Ausführender ausführend adj ptc1
    degbase
  • ausführend ausführend adv ptc1

23
Lexical delta forMorph. Structure aus_führen
  • Dictionary entries
  • ausführen lt--gt export (verb) luausführen,cver
    b,nbplu,per13,tnspres
  • ausgeführt lt--gt executed (participle) luausgef
    ührt,cadj,ptc2,degbase
  • Inflected German forms in sentence
  • ausführst (inflected verb) luausführen,cverb,n
    bsg,per2,tnspres --gt match ausführen lt--gt
    export
  • ausführende (present participle) luausführend,c
    adj,ptc1,degbase --gt match ausgeführt lt--gt
    executed

24
Contextual Consolidationof 'verbal' Entries (1)
  • Anweisung ausführen lt--gt execute statementmain
    clause
  • If ( (verbal part of entry is left Klammer) and
    (nominal part of entry is in Mittelfeld))
    then consolidate matchend
  • Er führt die Anweisung sofort ausHe executes
    the statement immediately VPREF ...

25
Contextual Consolidationof 'verbal' Entries (2)
  • Anweisung ausführen lt--gt execute statementmodal
    main clause
  • If ( (verbal part of entry is right Klammer) and
    (nominal part of entry is in Mittelfeld))
    then consolidate matchend
  • Er will die Anweisung sofort ausführen.He will
    the statement immediately execute.

26
Contextual Consolidationof nominal Entries
  • Abbau der Ozonschicht lt--gt depletion of
    ozonewithin a noun phrase (np)
  • Only additional adjectives may modify the
    entryAbbau der arktischen Ozonschichtdepletion
    of arctic ozone

27
Output of Dictionary Look-up
ludas,wnrr1,cw,scart, ...
_at_cart,n146471_at_luthe,cAT0. .
,luHaus,wnrr2,cnoun, ... _at_cnoun,n268244
_at_lucompany,cNN1. , cnoun,n268246_at_lu
home,cNN1. , cnoun,n268247_at_luhouse,cN
N1. , cnoun,n268249_at_lusite,cNN1. .
,luwerden,wnrr3,cverb,vtypfiv,
... _at_cverb,n604071_at_lube,cVBD . ,
cverb,n604076_at_luwill,cVM0 . . ...
28
Discontinuous Match
Das geht, solange es Frauen gibt, nie vor die
Hunde. vor die Hunde gehen lt---gt go to the dogs
be buggered ,lugehen...vorderhund,wnrr2
101112,cverb,markclhs _at_cverb,n13_at_lugo
,cVVBVVDVVIVVNVVZ
, luto,cTO0PRP , luthe,cAT0
, ludog,cNN2NN1 . ,
cverb,n14_at_lube,cVBBVBDVBIVBNVBZ
, lubugger,cVVNVVD . . ...
29
Translation Model Expander
30
Expander
  • Rule-based device to adjust word order
  • insert/delete/permute/modify translation
    hypotheses in AND/OR graph
  • insert article Hans ist Lehrer --gt Hans is a
    teacher
  • verbal group Das Haus wurde von Hans gekauft
    --gt The house was bought by Hans.
  • add hypotheses Die Milch trinkt die Katze.
    --gt (The cat drinks the milk. The milk drinks
    the cat.) Peters Auto --gt (Peters' car the
    car of Peter)

31
Example of an Expander Rule
Hans hat das Haus gekauft. --gt Hans hat gekauft
das Haus. Hans has the house bought. --gt Hans
has bought the house. V N P --gt V
P NReorderFinVerb_hs Vemarkhsemark
vg_fiv, Nemarkhsamarkvg_ptcvg_inf,
Pemarkhsemarkvg_ptc
p(moveV-gtVPN).
32
Negation
  • Negation Hans kommt nicht. --gt Hans does
    not come. Hans comes not.
  • Dictionary nicht --gt not do not
  • Expander Rule Negation_hs2 Aemarkhs,
    markvg_fiv, Bemarkhs,lunicht,
    Nemarkhs,lunicht
    p(moveA-gtANB).

33
AND/OR Graph forHans kommt nicht
  • luHans,cnoun, wnr1 _at_cnoun_at_luhans,cNP
    0. .,lunicht,cadv,wnr3
    _at_cverb_at_ludo,cVDZ,lunot,cXX0. ,
    cadv_at_lunot,cXX0..,lukommen,cverb,wnr2
    _at_cverb_at_lucome,cVVB. ,
    cverb_at_lucome,cVVB,lualong,cAVP. ,
    cverb_at_lucome,cVVB,luoff,cAVP. ,
    cverb_at_lucome,cVVB,luup,cAVP...

34
TL Model Search Engine
35
Search EngineScoring n-best Translations
  • Beam-search algorithm (breadth first)
  • Traverses AND/OR graph to score n-best
    Translations
  • Heuristic Function
  • Feature Funktion
  • weigting
  • Log-linear Combination of feature functions

36
Heuristic Function
  • trained on BNC (108 words, 106 sentences)
  • Lemma Language Model (3-gram,
    4-gram)
  • Tag Language Model (5-gram to
    7-gram)
  • Lemma/tag co-ocurrence modell

lemma tag w(lem tag) tape-recorder
AJ0 3 1.003 tape-recorder NN1 87
22.080 tape-recorder NN2 13 3.512
tape-recorder ltgt 0 0.250
37
Search Engine Output
lemma, tag, dictionary, expander rule lts
id3-0 lp"-9.227912"gt the AT0 146471
company NN1 268244 is VBD 604071
PermFinVerb_hs buy VVN 307263
PermFinVerb_hs by PRP 587268
PermFinVerb_hs hans NP0 265524
PermFinVerb_hs . PUN 367491 lt/sgt
38
TL Model Token Generation
39
Reversible Lemmatiser and Token Generator
  • Token Generator (for English) Lemma Tag --gt
    Tokentrained on reversible Lemmatiser Token
    Tag --gt Lemma
  • Reversible lemmatizer
  • based on BNC Lemmatiser (108 words)
  • inflection rules are regular-expressions
  • augmented with additional tags (for
    reversibility)
  • 10 types of inflection rules for ADJ, NN2, VVG
    ...
  • ca 200 rules and ca. 1500 lexical exceptions

40
Reversible Lemmatiser
  • Token Tag lt---gt LemmaTagOIRSetting VVG
    lt---gt set VVG_f_4
  • Lemma normalised lemma
  • Tag BNC /CLAWS5 tag set
  • O orthographic properties of token
  • IR inflexion rule
  • 100 reversibel no loss of information--gt but O
    and IR are not known when generating

41
Lemmatisation and Token Generation Rules
  • Lemmatisation knowing token tag
  • apply first matching rule
  • Example
  • Tag token suffix lem. Suffix token --gt
    lemma
  • 1 VVG ffing --gt ff stuffing --gt
    stuff
  • 2 VVG (.1,3ll)ing --gt 1 selling
    --gt sell
  • 3 VVG ssing --gt ss kissing --gt
    kiss
  • ....
  • Token generation knowing lemma tag
  • guess lemmatisation rule
  • apply inverse lemmatisation rule

42
Token Generation
  • Generation while guessing inflexion rule
  • abort VVG ---gt aborting VVG
  • Method guess inflexion rule from suffix of
    lemma.
  • Collect 27.000 suffixes from lemmatised BNCTag
    suffix inflection rules
  • VVG 28 unknown lemma suffixVVG t
    5 VVG rt 2 VVG ort 2 VVG
    bort 1 deterministic token generation
  • The longer the known lemma suffix the better the
    guess

43
Evaluation ofReversible Lemmatiser
  • Lemmatiser
  • 96,18 correct lemmas
  • incorrect mostly for closed class words (he, the,
    a, ...)
  • Token Generator
  • 99.5 correct reproduction of original
    tokentested on 244,500 different wordforms
  • incorrect for writing variantsburned / burnt
    VVN --gt burn VVNBNC british English burned more
    likely

44
Conclusion
  • METIS-II German-to-English Basic Idea
  • First use 'secure' symbolic resources
  • generate partial translation hypotheses
  • store hypotheses in an AND/OR graph
  • Then use statistical resources
  • rank best combination of partial translation
    hypotheses
  • integrate various global resources with feature
    functions

45
Main Components
  • Lexicon
  • basic translation equivalences
  • match phrases and discontinuous entries
  • overgeneration
  • Expander
  • structural adjustment
  • permute, insert, delete translation units
  • Search engine
  • rank translation hypotheses
  • use target language knowledge

46
Distribution of Information
  • Search Engine vs. Lexicon stark lt--gt?heavy,str
    ong,large,big Raucher lt--gt?smoker or stark
    er Raucher lt??? heavy smoker
  • Expander vs. Lexicon guerra ???? war civil
    lt--gt?civil española lt--gt?spanish or guerra
    civil española lt--gt spanish civil war

47
Evaluation
  • Evaluation depends on
  • Dictionary and performance of matching algorithm
  • Expander rules
  • Number and weights of feature functions
  • Impact of modifications on BLEU score
  • Changing weights of feature functions BLEU
    scores from 1.6 to 1.8
  • Modifying expander rules BLEU scores from 1.6 to
    2.2
  • Test set of 200 German sentences NIST
    BLEU lemma LM tag LM 5.4801 0.1861
    6M-n3 100K-n4 5.3004 0.2030 5M-n3
    5M-n7

48
Future prospectives
  • Enhancement of components
  • Dictionary lookup (maintainance matching)
  • Generation of discontinuous English
    fragmentse.g. give ltsthgt away, make ltsthgt easy
  • Testing more feature functions (lexical weight)
  • Dynamic Adaption to user needs
  • explore automatised weighing strategies for
  • Dictionary entries
  • Expander rules
Write a Comment
User Comments (0)
About PowerShow.com