Title: Combining Rules and Statistics in the GermantoEnglish MT System METISII
1Combining Rules and Statistics in the
German-to-English MT System METIS-II
2Overview
- METIS-II General Introduction
- Project description
- Architecture of German to English
- Detailed Description
- Source language analysis
- Dictionary matching and lookup
- Target language adjustment
- Translation ranking and selection
- Target language token generation.
3METIS-II Time Table
- EU Project within IST-STREP
- Started October 2004
- Duration 3 Jahre
- Until June 2006 Exploration of various methods
- Since June 2006 refinement/integration of
modules interface for user adaptation
4Participants
- ILSP, Athen Greek -gt English
- CCL, Leuven Dutch -gt English
- IAI, Saarbrücken German -gt English
- UPF, Barcelona Spanish -gt English
5METIS-II Goals
- METIS-II is the continuation of METIS-I
- exploit entities below sentence border
- METIS-II resources
- use only basic' tools and resources
- METIS-II does not need a particular format
- enable different tag-set for SL and TL
- plug in different taggers
- METIS-II can be used for several languages
6Required Resources
- Bilingual Dictionary
- German to English
- Basic 'linguistic' tools
- SL and TL tagger, chunker
- Monolingual TL Corpus
- BNC (108 words, 106 sentences)
- Parallel Corpora (SMT/EBMT) not required
- avoid data-acquisition bottleneck
7METIS-II German to English
- Generate partial translation hypotheses
- Store hypotheses in an AND/OR graph
- Rank best combination of partial translation
hypotheses - Resources needed
- Bilingual German to English Dictionary
- Basic 'linguistic' tools SL and TL tagger,
chunker - Permutation/reordering rules
- Monolingual TL Corpus BNC (108 words, 106
sentences) - Parallel Corpora (SMT/EBMT) not required
- avoid data-acquisition bottleneck
8 Overview of the System
9SL Model German Analysis
- MPRO
- lemmatization
- morphological analyser
- KURD / FRED (shallow syntax analysis)
- grammar is basis for
- Duden Korrektor (German grammar checker)
- CLAT (Controlled language technology)
- text indexation
- pattern-based formalism to detect and mark
phrases, clauses, topological fields
10German Grammar
- Recognised Constituents (flat representation)
- NPs, PPs, Verbal groups
- clauses
- topological fields
- does not detect/mark relation between
constituents - Method
- originates in requirements for grammar correction
- iterative process
- mark 'secure' patterns
- disambiguate the pattern
11Input/Output of German Analyser
- Das Haus wurde von Hans gekauft The house was
from Hans boughtLemma wnr PoS
phrase clause/fieldludas, wnr1,
cw,scart, phrnpsubjF, clhsvf
,luhaus, wnr2, cnoun, phrnpsubj,
clhsvf,luwerden,wnr3, cverb,vtfiv,
phrvg fiv, clhslk,luvon, wnr4,
cw,scp, phrnpnosubjF, clhsmf,luHans,
wnr5, cnoun, phrnpnosubj,
clhsmf,lukaufen, wnr6, cverb,vtptc2,
phrvg ptc, clhsrk.
12Translation ModelDictionary Look-up
13German-to-English Dictionary
- gt 600.000 Entries
- Independent tag sets in SL and TL
- Single- and multi word units, phrase translations
- Represented as flat trees
- leaves contain lexical information
- mother node contains meta information
14Goals of Dictionary Look-up
- Discontinuous Entries
- separable prefix lehnt ... ab lt--gt reject
- reflexive verbes sich ... beeilen lt--gt hurry
up - support verbes in Gefahr bringen lt--gt
endanger - idioms vom Mund ablesen lt--gt lip-read
- Lexical Overgeneration
- lex.-sem. ambiguities Bank lt--gt bankbench
- main/aux.verb werden lt--gt willbebecome
- negation nicht lt--gt do notnot
- magniers/intensifers stark lt--gt
stronggoodheavy ... - prepositions auf lt--gt oninuponto ...
15Types of Discontinuous Verbal Realisations
- Dictionary entry
- Anweisung ausführen lt--gt execute statement
- Realisation in a subordinate clause (en bloc)
- dass er sofort die Anweisungen ausführt that
he immediately the statements executes ... - Realisation in a main clause (left Klammer
Mittelfeld) - Er führt die Anweisung sofort aus He executes
the statement immediately VPREF ... - Realisation in a modal main clause (Mittelfeld
right Klammer) - Er will die Anweisung sofort ausführen.He will
the statement immediately execute.
16Dictionary Maintainance and Look-up
- Structure and Maintenance of the dictionary
- lemmatisation and morphological analysis of
entries - consistency of entries
- generation of variants
- indexation of morphemes
- Dictionary look-up
- retrieve entries and filter 'best' matches
- lexical similarity
- contextual consolidation
17Structure of Dictionary Entry
- deeinsperren,mdecverb, enlock_ltso.gt_away,
mencverb. - deausführen, mdecverb, enexecute,mencve
rb. - Structure of dictionary entries
- represented as flat trees
- contain lexical information and meta information
- leaves follow canonical representation
- Problem
- inflexion, derivation, variationgt
lemmatization, analyse morphological structure
18Canonical Forms of Dictionary Entries (German
side)
- cverb last word of entry is infinite
verb e.g. tanzen gehen (dancing go) - cnoun last word of entry is noun/sing/nominative
e.g. dritte Welt (third world) - cadj last word of entry is
adjective e.g. hell grün (light green) - cp last word of entry is preposition e.g.
in Bezug auf (with respect to)
19Morphological Analysis of German Entries
- deausführen, mdecverb,enexecute,mencverb
. - ausführen has morphological structure
lsaus_führen - and several morphological analyses
- luausführen,cnoun,eheadnbsg,caseaccdatnom
,gn - luausführen,cverb,vtypfiv,nbplu,per13,tnsp
res - luausführen,cverb,vtypinf.
- Disambiguated entry (according to canonical
form) - cverb_at_luausführen,lsaus_führen,cverb.
20Lexical Variation
- Abfertigung des Gepäcks --gt?Gepäckabfertigungche
ck-in of the luggage --gt luggage check-in - Anzahl der Mitarbeiter --gt?Mitarbeiteranzahlnum
ber of worker --gt worker number cnoun_at_cn
oun,lsanzahl,cart,lsart,cnoun,lsmit_arbe
iten. --gt cnoun_at_cnoun,lsmit_arbeiten
anzahl. - ausführen --gt führen aus cverb,typens_at_c
verb,lsaus_führen. --gt cverb,typehs_at_
cverb,lsführen,cvpref,lsaus.
21Dictionary-lookupRetrieval and filtering
- Retrieve entries which share morphological
structure - Filter best candidates
- consolidate word order
- dictionary entry and match in same word-order
- compute lexical delta
- find most 'similar' word form
- contextual consolidation
- check 'internal' and external context of match
22Some Surface Realisations of aus_führen
- Word Lemma PoS Derivation Feature Info
- Ausführbarkeit ausführbarkeit noun barheit
nbsg - Ausführer ausführer noun er
nbsg - Ausführung ausführung noun ung
nbsg - ausführen ausführen verb ---
per13, tnspres - ausführbar ausführbar adv bar
degbase - ausführbarer ausführbar advadj bar
degcomp - ausgeführten ausgeführt adj ptc2
degbase - ausgeführteren ausgeführt adj ptc2
degcomp - ausgeführt ausführen adjverb ptc2
- Ausführender ausführend adj ptc1
degbase - ausführend ausführend adv ptc1
23Lexical delta forMorph. Structure aus_führen
- Dictionary entries
- ausführen lt--gt export (verb) luausführen,cver
b,nbplu,per13,tnspres - ausgeführt lt--gt executed (participle) luausgef
ührt,cadj,ptc2,degbase - Inflected German forms in sentence
- ausführst (inflected verb) luausführen,cverb,n
bsg,per2,tnspres --gt match ausführen lt--gt
export - ausführende (present participle) luausführend,c
adj,ptc1,degbase --gt match ausgeführt lt--gt
executed
24Contextual Consolidationof 'verbal' Entries (1)
- Anweisung ausführen lt--gt execute statementmain
clause - If ( (verbal part of entry is left Klammer) and
(nominal part of entry is in Mittelfeld))
then consolidate matchend - Er führt die Anweisung sofort ausHe executes
the statement immediately VPREF ...
25Contextual Consolidationof 'verbal' Entries (2)
- Anweisung ausführen lt--gt execute statementmodal
main clause - If ( (verbal part of entry is right Klammer) and
(nominal part of entry is in Mittelfeld))
then consolidate matchend - Er will die Anweisung sofort ausführen.He will
the statement immediately execute.
26Contextual Consolidationof nominal Entries
- Abbau der Ozonschicht lt--gt depletion of
ozonewithin a noun phrase (np) - Only additional adjectives may modify the
entryAbbau der arktischen Ozonschichtdepletion
of arctic ozone
27Output of Dictionary Look-up
ludas,wnrr1,cw,scart, ...
_at_cart,n146471_at_luthe,cAT0. .
,luHaus,wnrr2,cnoun, ... _at_cnoun,n268244
_at_lucompany,cNN1. , cnoun,n268246_at_lu
home,cNN1. , cnoun,n268247_at_luhouse,cN
N1. , cnoun,n268249_at_lusite,cNN1. .
,luwerden,wnrr3,cverb,vtypfiv,
... _at_cverb,n604071_at_lube,cVBD . ,
cverb,n604076_at_luwill,cVM0 . . ...
28Discontinuous Match
Das geht, solange es Frauen gibt, nie vor die
Hunde. vor die Hunde gehen lt---gt go to the dogs
be buggered ,lugehen...vorderhund,wnrr2
101112,cverb,markclhs _at_cverb,n13_at_lugo
,cVVBVVDVVIVVNVVZ
, luto,cTO0PRP , luthe,cAT0
, ludog,cNN2NN1 . ,
cverb,n14_at_lube,cVBBVBDVBIVBNVBZ
, lubugger,cVVNVVD . . ...
29Translation Model Expander
30Expander
- Rule-based device to adjust word order
- insert/delete/permute/modify translation
hypotheses in AND/OR graph - insert article Hans ist Lehrer --gt Hans is a
teacher - verbal group Das Haus wurde von Hans gekauft
--gt The house was bought by Hans. - add hypotheses Die Milch trinkt die Katze.
--gt (The cat drinks the milk. The milk drinks
the cat.) Peters Auto --gt (Peters' car the
car of Peter)
31Example of an Expander Rule
Hans hat das Haus gekauft. --gt Hans hat gekauft
das Haus. Hans has the house bought. --gt Hans
has bought the house. V N P --gt V
P NReorderFinVerb_hs Vemarkhsemark
vg_fiv, Nemarkhsamarkvg_ptcvg_inf,
Pemarkhsemarkvg_ptc
p(moveV-gtVPN).
32Negation
- Negation Hans kommt nicht. --gt Hans does
not come. Hans comes not. - Dictionary nicht --gt not do not
- Expander Rule Negation_hs2 Aemarkhs,
markvg_fiv, Bemarkhs,lunicht,
Nemarkhs,lunicht
p(moveA-gtANB).
33AND/OR Graph forHans kommt nicht
- luHans,cnoun, wnr1 _at_cnoun_at_luhans,cNP
0. .,lunicht,cadv,wnr3
_at_cverb_at_ludo,cVDZ,lunot,cXX0. ,
cadv_at_lunot,cXX0..,lukommen,cverb,wnr2
_at_cverb_at_lucome,cVVB. ,
cverb_at_lucome,cVVB,lualong,cAVP. ,
cverb_at_lucome,cVVB,luoff,cAVP. ,
cverb_at_lucome,cVVB,luup,cAVP...
34 TL Model Search Engine
35Search EngineScoring n-best Translations
- Beam-search algorithm (breadth first)
- Traverses AND/OR graph to score n-best
Translations - Heuristic Function
- Feature Funktion
- weigting
- Log-linear Combination of feature functions
36Heuristic Function
- trained on BNC (108 words, 106 sentences)
- Lemma Language Model (3-gram,
4-gram) - Tag Language Model (5-gram to
7-gram) - Lemma/tag co-ocurrence modell
lemma tag w(lem tag) tape-recorder
AJ0 3 1.003 tape-recorder NN1 87
22.080 tape-recorder NN2 13 3.512
tape-recorder ltgt 0 0.250
37Search Engine Output
lemma, tag, dictionary, expander rule lts
id3-0 lp"-9.227912"gt the AT0 146471
company NN1 268244 is VBD 604071
PermFinVerb_hs buy VVN 307263
PermFinVerb_hs by PRP 587268
PermFinVerb_hs hans NP0 265524
PermFinVerb_hs . PUN 367491 lt/sgt
38TL Model Token Generation
39Reversible Lemmatiser and Token Generator
- Token Generator (for English) Lemma Tag --gt
Tokentrained on reversible Lemmatiser Token
Tag --gt Lemma - Reversible lemmatizer
- based on BNC Lemmatiser (108 words)
- inflection rules are regular-expressions
- augmented with additional tags (for
reversibility) - 10 types of inflection rules for ADJ, NN2, VVG
... - ca 200 rules and ca. 1500 lexical exceptions
40Reversible Lemmatiser
- Token Tag lt---gt LemmaTagOIRSetting VVG
lt---gt set VVG_f_4 - Lemma normalised lemma
- Tag BNC /CLAWS5 tag set
- O orthographic properties of token
- IR inflexion rule
- 100 reversibel no loss of information--gt but O
and IR are not known when generating
41Lemmatisation and Token Generation Rules
- Lemmatisation knowing token tag
- apply first matching rule
- Example
- Tag token suffix lem. Suffix token --gt
lemma - 1 VVG ffing --gt ff stuffing --gt
stuff - 2 VVG (.1,3ll)ing --gt 1 selling
--gt sell - 3 VVG ssing --gt ss kissing --gt
kiss - ....
- Token generation knowing lemma tag
- guess lemmatisation rule
- apply inverse lemmatisation rule
42Token Generation
- Generation while guessing inflexion rule
- abort VVG ---gt aborting VVG
- Method guess inflexion rule from suffix of
lemma. - Collect 27.000 suffixes from lemmatised BNCTag
suffix inflection rules - VVG 28 unknown lemma suffixVVG t
5 VVG rt 2 VVG ort 2 VVG
bort 1 deterministic token generation - The longer the known lemma suffix the better the
guess
43Evaluation ofReversible Lemmatiser
- Lemmatiser
- 96,18 correct lemmas
- incorrect mostly for closed class words (he, the,
a, ...) - Token Generator
- 99.5 correct reproduction of original
tokentested on 244,500 different wordforms - incorrect for writing variantsburned / burnt
VVN --gt burn VVNBNC british English burned more
likely
44Conclusion
- METIS-II German-to-English Basic Idea
- First use 'secure' symbolic resources
- generate partial translation hypotheses
- store hypotheses in an AND/OR graph
- Then use statistical resources
- rank best combination of partial translation
hypotheses - integrate various global resources with feature
functions
45Main Components
- Lexicon
- basic translation equivalences
- match phrases and discontinuous entries
- overgeneration
- Expander
- structural adjustment
- permute, insert, delete translation units
- Search engine
- rank translation hypotheses
- use target language knowledge
46Distribution of Information
- Search Engine vs. Lexicon stark lt--gt?heavy,str
ong,large,big Raucher lt--gt?smoker or stark
er Raucher lt??? heavy smoker - Expander vs. Lexicon guerra ???? war civil
lt--gt?civil española lt--gt?spanish or guerra
civil española lt--gt spanish civil war
47Evaluation
- Evaluation depends on
- Dictionary and performance of matching algorithm
- Expander rules
- Number and weights of feature functions
- Impact of modifications on BLEU score
- Changing weights of feature functions BLEU
scores from 1.6 to 1.8 - Modifying expander rules BLEU scores from 1.6 to
2.2 - Test set of 200 German sentences NIST
BLEU lemma LM tag LM 5.4801 0.1861
6M-n3 100K-n4 5.3004 0.2030 5M-n3
5M-n7
48Future prospectives
- Enhancement of components
- Dictionary lookup (maintainance matching)
- Generation of discontinuous English
fragmentse.g. give ltsthgt away, make ltsthgt easy - Testing more feature functions (lexical weight)
- Dynamic Adaption to user needs
- explore automatised weighing strategies for
- Dictionary entries
- Expander rules