Title: Crosslanguage algorithms: the progressive conflation of the MT and IR paradigms
1Cross-language algorithms the progressive
conflation of the MT and IR paradigms
- Yorick Wilks
- University of Sheffield
- www.dcs.shef.ac.uk/yorick
- www.nlp.shef.ac.uk
U. Tartu Estonia Feb.05.
2Main points of the talk
- The empirical-rational MT stand off in the early
Nineties what happened then and next? - What was the stone soup metaphor? the
piecemeal research agenda for the Nineties that
took over all NLP. - The underlying problem for statistical MT was
data sparseness, but was the answer just more
data? - The web as ultimate data gains and losses.
- Meanwhile, MT not only disintegrated as a task
but itself became integrated into others! - E.g. information retrieval, extraction, and
question answering ALL THESE CAN NOW BE CALLED
MT!!! - Difficulty now of locating MT intellectually, but
its continuing paramount importance to NLP.
3Stone soup days (some who were there cant
remember the point of the metaphor!!)
- IBMs CANDIDE, a wholly statistical, corpus-based
F-E E-F MT system, was evaluated against
commercial systems and other DARPA symbolic
systems, e.g. PANGLOSS. - CANDIDE never beat SYSTRAN over texts on which
neither had been trained. - The stone soup analogy focussed on the way that
Jelinek and Brown at IBM began to add modules to
CANDIDE which, were statistically based, but
linguistically motivated - Hence, what then was the statistical magic
stone that made the soup?? - CANDIDE was composed of statistically-based
modules (e.g. alignment) and creating more such
modules of greater complexity (e.g. statistical
parsing, wordsense disambiguation etc.) became
the NLP agenda - But the component modules were not all evaluable
against gold-standard data in the way MT was. - Hence the problem of losing MT as an evaluation
paradigm for NLP/CL.
4The barrier to further advance with the CANDIDE
paradigm was data sparseness
- You can think about this as the way the
repetitions of ngrams drop off with increasing n
for a corpus of any imaginable size. - A system that had noted COWS EAT and LIONS EAT
would probably have no idea what to do with
ELEPHANTS EAT (not to mention PRINTERS EAT
PAPER). - A standard way of putting this is that language
consists of large numbers of rare events, but the
scale of this is not always realised. - Jelinek himself became interested in what seem to
be symbolic methods of classification to reduce
this sparseness---e.g. semantic annotations and
classifications.
5A home-grown example
- Suppose you ask the following very simple
question - In the British National Corpus (BNC, 200m words,
now a tiny corpus), suppose we find all the
finite verbs with objects and ask what proportion
of them are unique in that corpus?
685!
- For quite other (lexical semantic) reasons, a
student and I concentrated on those where both
the verb and the object word were frequent (I.e.
avoiding rare words which give separate
problems--the issue here is only combinatorial!) - We looked for ones not present at all in 1990,
once in 1991-2, but occurring more than 8 times
in 1993
7Books made 358, 15822 Eyes studied 4040,
483 Police closed 2551, 1774 Directors
make 340, 3757 Phone began 328, 3654 Body
opened 1612, 2176 Probe follows 78, 3581 Mouth
became 816, 2816 Look says 644, 2976 The
table changes if you take the whole web but not
that much.
8What morals to draw here?
- The figures may suggest that even very very large
corpora may not help in the way that a pure
statistics method requires. - Note Amslers call on the corpora list for a
new approach to smaller corpora because of this,
and Dunnings claims that more can be got from
smaller samples than people think. - It seems clear people are working with
classifications that they cannot have derived
purely bottom up from vast corpora. - Google creates sets over the whole web of 4bn
pages it uses look at labs.google.com/sets and
they arent all that good! - Such empirical semantic set construction was a
major research enterprise for Jelinek and Brown
in 1990 - Hence all the current efforts to use Wordnet (or
to do more Stonesoupery by creating a Wordnet
substitute on empirical principles). - The web has provided a new market for MT but, as
a vast corpus, it has not yet provided a
solution to any problems in MT, given the tools
we have - Warning note on what may or may not help look at
the success of WSD and its failure to help
practical MT systems---will it stay totally
dislocated as WSD has from IR?
9Transition to looking at MT and nearby tasks (IE,
IR etc.) but staying with very large corpora for
the moment, and staying with the optimists.
- Consider Greffenstettes vast lexicon concept
and an MT task for it. - Example 1 you want to translate the collocation
XY into another language, and have an appropriate
bilingual dictionary with - n equivalents for X and m for Y giving mn
combinations. - You throw all the mn versions of XY at a large
target language corpus and rank order the target
collocations. - Take the top one.
- This sounds like asking the audience in Who Wants
To Be A Millionaire, but it works rather well! - But the earlier 85 figure makes you think that
maybe it shouldnt OR that the BNC really is too
far too small (Its the latter, of course).
10Example 2This one is also probably
Greffenstettes--but lots of people are having it
again in some form.
- Expand the last idea by storing from a vast
corpus all forms of Agent-Action-Object triples
(I.e. all examples of who does what to whom
etc.). - Use these to resolve ambiguity and interpretation
problems of the kind that obsess people who are
into concepts like coercion projection,
metonymy etc. in lexical semantics. - E.g. if in doubt what my car drinks gasoline
means, look at the stored things cars do with
gasoline and take a guess. - This isnt a very good algorithm, but it should
stir memories of Bar Hillels (1959) argument
against MT, namely that you couldnt store all
the facts in the world you would need to
interpret sentences - For me, of course, it stirs quite different
memories of a more empirical version of the old
Preference Semantics (1967) notion of doing
interpretation by means of a list of all possible
interlingual Agent-Action-Object triples! (only I
made the list up!)
11(No Transcript)
12More on the Bar-Hillelish car/road example
- Where one might hope to find that there are not
ROADS IN CARS but there are CARS ON ROADS - But, conversely and for identical syntactic
structure in - HE CANOED DOWN A RIVER IN BRAZIL
- There would be, in the supposed corpus, RIVERS IN
BRAZIL but not BRAZIL IN RIVERS. - So, may there be hope for a vast lexicon of
proto-facts derived from a corpus to settle
questions of interpretation? - Will there be enough facts in a corpus of
weblike size? - But so many webfacts are nonfacts (but maybe we
need only their forms not their truth) - Yet the above example suggests we made need
negative facts as well, and there is an INFINITE
number of them! - Maybe there is no escape from some cognitive
approach, or is this one too?
13OK, lets now stand back and look at MT in a
wider contextConsider well-known tasks that may
be MT or involve MT
- Machine-aided translation (Kays defence of this
as a separate task to be fused with editing
technology remember that came from his total
pessimism about MTs future!) - Multilingual IE based on templates (Gaizauskas,
Azzam, Humphreys templates as interlingua) - Multilingual IE could grow to become Crosslingual
Question Answering (QA) (not quite there yet,
could be seen again as a form of
template-as-interlingua, as in CLIE). -
14Also, Cross-Language IR (CLIR)
- Cross-language IR (CLIR) initially Salton used
a thesaurus as interlingua between documents in
different languages later work used Machine
Readable Bilingual Dictionaries (MRDs) to build
lexical taxonomies in one language from another,
or derived search clusters from bilingual texts.
15CLIR and MT
- One main difference is that CLIR can still be
useful at low precision (recall more important) - MT, on the other hand, hard to use if
alternatives are included in the output how much
of stuff like the following can you read?? - They decided to have PITCH, TAR, FISH, FISHFOOD
for dinner.
16Forms of CLIR
- Multi/crosslingual IR without interlinguas
(significant terms expanded, texts not
necessarily aligned, result nearly as good as
monolingual) - Normally using of a priori resources for
expansion - MRDs for CLIR (Davis, Ballasteros and Croft)
- Use of Wordnets (I.e EWN) for CLIR (original aim
of EWN project!) -
17Return of Garvins MT pivot in CLIR
- Metaphor strengthened by use of (old MT) notion
of pivot languages in IR. - Multiple pivot languages to reach same target
documents, thus strengthening retrieval (Gollins
and Sanderson SIGIR 01) (parallel CLIR) - Also Latvian-English and Latvian-Russian could
probably reach any EU language from e.g. Latvian
via multiple CLIR pivot retrievals (sequential
CLIR based on Russian-X or English-X). (CLARITY
project Sanderson and Gaizauskas
www.nlp.shef.ac.uk). - This IR usage differs from MT use, where pivot
was an interlingua not a language (except in BSO
Esperanto case and Aymara) and was used once
never iteratively.
18Or, Using existing MT systems for IR
- Using an MT system to determine terminology in
unknown language with MT (Oh et al. 2001, J-K
system) - Use of a strong established MT system for CLIR
(e.g. SYSTRAN, Gachot et al. In Grefenstette
(ed.) Cross Language Information Retrieval)
19Partial MT processing for MRD construction
- Hierarchies in one language created from another
(E-ESP, Guthrie, Farwell, Cowie, using LDOCE and
Collins) - Eurowordnet construction from bilingual and
monolingual resources (easy and hard way! The
easy way is straight lexical MT the hard way is
monolingual models plus the EWN interlingua)
20Vice-Versa MT and IR metaphors changing places
over ten years.
- Some developments in IR are now deemed MT by
IR researchers. - Treating retrieval of one string by another as a
form of, or use of, an MT algorithm - The last has also been applied to any use of
alignment (or any of the IBM Jelinek/Brown tools
in CANDIDE), now used to mean MT by transfer
when applied back to IR-like tasks---but where
the retrieved string can become a document!
21IR as MT, continued
- More technically, the use of language models in
IR (Ponte and Croft SIGIR 98, Laferty and Croft
2000) - Note that this is the reverse of what Sparck
Jones predicted in her 2000 article in the
AIJournal on the use of IR in AI! (cf. IR as
Statistical Translation, Berger and Laferty,
2001). - Work at UCB on visual features and collateral
text as translations cf. Jelinek applied to
parallel picture corpora or recognition criteria
as truth condtions.
22Treating retrieval of one string by another as a
form of an MT algorithm
- This metaphoric shift rests on using techniques
used to develop MT by IBM (including alignment
above) - deeming pairs of strings in a retrieval
relationship to be in some sense different
languages. - Extreme case treating QA as a form of MT between
two languages - FAQ questions and their answer (texts) taken to
define a pair of languages in a translation
relationship (Berger et al. 2000) - theoretical underpinning is matching of
language models i.e. what is the most likely
query given this answer (cf. IBM/Jelinek----search
for most probable source given the
translation)--cf. Same way up as Science--prove
the data from the theory but actually infer the
theory from the data.!!
23QA and multilinguality
- Little cross/multi lingual QA has been done but
it will soon appear, as has CLIE and CLIR - It is also a form of MT, and has already been
subjected monolingually to pure IR machine
learning (Berger et al. 2000) using their new IR
is MT paradigm - If Qs and As are actually in different languages
it will reinforce their metaphor that they are
monlingually as well!!! - However, progress in CLIR and CLIE (and the DARPA
QA track) suggest this will be a task with a
large symbolic component (Moldovan and Harbagiu)
(even if large chunks can be machine learned). NO
CONTRADICTION THERE---cf Jelinek setting off the
learning of individual NLP modules.!!
24Looking in a liitle more detail (and plugging
Sheffield stuff!) at systems of the types
mentioned in
- Cross language IR
- IE and multilingual IE
- Question answering
25The parallel CLIR IdeaGollins and Sanderson
(2001, www.ir.shef.ac.uk)
- Retrieve documents in another language even
though bilingual dictionaries may be unavailable,
sparse, incomplete etc. - IDEA Use different transitive routes and compare
(merge) the results - Hope to reduce the introduced error
- Assume that errors are independent on the
different routes - Assume translations in common are the best ones
and thus eliminate independent errors
26Lexical Triangulation
Dutch
Merge
German
English
fisch
Spanish
27Concept Of Triangulation
- A simple noise or error cancellation technique
- A special case of the more general approach of
using multiple evidence for retrieval - Singhal on spoken documents, Bartell on
Monolingual and McCarley on CLIR - The three languages used as pivots are not
equally independent - Expect Spanish - Dutch and Italian - Dutch to be
better than Spanish - Italian.
28(No Transcript)
29(No Transcript)
30- What is IE?
- getting information from content of huge
document collections by computer at high speed - looking not for key words but information that
fits some template pattern or scenario. - delivery of information as a structured database
of the template fillers (usually pieces of text) - classic IE phase is over and methods now have to
be machine learning based (AMILCARE at Sheffield)
31The Sheffield LaSIE system (for IE)
- LaSIE was Sheffields MUC-6 entry and is one IE
system under on-going development at Sheffield - Distinctive features of LaSIE
- use of a feature-based unification grammar with
bottom-up chart parser to do partial parsing - parsing of tags rather than lexical entries (no
conventional lexicon for parsing) - construction a semantic representation of all of
the text - reliance on a coreference algorithm and a domain
model to extend semantic links not discovered
during partial parsing
32- Challenges for IE Multilinguality
- Most work to date on IE is English only DARPA
MUCs. - Exceptions
- MUC-5 included Japanese extraction task
- MET DARPA Multilingual Extraction Task named
entity recognition in Chinese, Japanese and
Spanish - recent CEC LE projects ECRAN, AVENTINUS,
SPARKLE, TREE, FACILE. - French AUPELF ARC-4 potential IE evaluation
exercise for French systems - Japanese Information Retrieval and Extraction
Exercise (IREX) IR and NE evaluation
33- What is a Multilingual IE System?
- Two possibilities
- An IE system that does monolingual IE in multiple
languages. - Monolingual IE IE where source language and
extraction language are the same. - Extraction language language of template fills
and/or of summaries that an IE system generates. - An IE system that does cross-lingual IE.
- Cross-lingual IE (CLIE) IE where source
language and extraction language differ.
34(No Transcript)
35(No Transcript)
36(No Transcript)
37- An Architecture for Multilingual IE
- Design objectives for a multilingual IE system
- maximise reuse of algorithmic and domain model
components - minimise language-specific mechanisms and data
resources. - Given these requirements we have opted for
approach 3. - Advantages
- new languages can be added independently (no
need to consider language pairs) - single language-independent conceptual model of
domain. - Is it possible ?
38(No Transcript)
39(No Transcript)
40- M-LaSIE Development
- M-LaSIE has been developed for French, English
and Spanish. - English Same modules as the LaSIE system all
developed at Sheffield, except the Brill
part-of-speech tagger. - French Morpho-tokenizer module developed at U. de
Fribourg other modules at Sheffield. - Spanish Tokeniser and parser developed at UPC,
Barcelona these and morphological analyser and
tagger integrated into GATE (www.gate.ac.uk) by
UPC other modules at Sheffield.
41QA-LaSIE (Gaizauskas)
- Derived from LaSIE Large Scale Information
Extraction System - LaSIE developed to participate in the DARPA
Message Understanding Conferences (MUC-6/7) - Template filling (elements, relations, scenarios)
- Named Entity recognition
- Coreference identification
- QA-LaSIE is a pipeline of 9 component modules
first 8 are borrowed (with minor modifications)
from LaSIE - The question document and each candidate answer
document pass through all nine components - Key difference between MUC and QA task IE
template filling tasks are domain-specific QA is
domain-independent
42TREC-9 250 Byte Runs
43The TREC QA Track Task Definition (TREC 8/9)
- Inputs
- 4GB newswire texts (from the TREC text
collection) - File of natural language questions (200
TREC-8/700 TREC-9) - e.g.
- Where is the Taj Mahal?
- How tall is the Eiffel Tower?
- Who was Johnny Mathis high school track coach?
- Outputs
- Five ranked answers per question, including
pointer to source document - 50 byte category
- 250 byte category
- Up to two runs per category per site
- Limitations
- Each question has an answer in the text
collection - Each answer is a single literal string from a
text (no implicit or multiple answers)
44Sheffield QA System Architecture
- Overall objective is to use
- IR system as fast filter to select small set of
documents with high relevance to query from the
initial, large text collection - IE system to perform slow, detailed linguistic
analysis to extract answer from limited set of
docs proposed by IR system
45QA in Detail (1) Question Parsing
- Phrase structure rules are used to parse
different question types and produce a
quasi-logical form (QLF) representation which
contains - a qvar predicate identifying the sought entity
- a qattr predicate identifying the property or
relation whose value is sought for the qvar (this
may not always be present.)
46Question Answering in Detail An Example
QWho released the internet worm?
AMorris testified that he released the internet
worm
Sentence Score 2
Entity Score (e1) 0.91
Total (normalized) 0.97
47Conclusions on QA
- Our TREC-9 test results represent significant
drop wrt to best training results - But, much better than TREC-8, vindicating the
looser approach to matching answers - QA-LaSIE scores better than Okapi-baseline,
suggesting NLP is playing a significant role - But, a more intelligent baseline (e.g. selecting
answer passages based on word overlap with query)
might prove otherwise - Computing confidence measures provides some
support that our objective scoring function is
sensible. They can be used for - User support
- Helping to establish thresholds for no answer
response - Tuning parameters in the scoring function (ML
techniques?)
48QA and multilinguality
- Little cross/multi lingual QA has been done but
it will soon appear, as has CLIE and CLIR - It is also a form of MT, and has already been
subjected monolingually to pure IR machine
learning (Berger et al. 2000) using their new IR
is MT paradigm - If Qs and As are actually in different languages
it will reinforce their metaphor that they are
monlingually as well!!! - However, progress in CLIR and CLIE suggest this
will be a largely symbolic task --Sheffield
phone multilingual QA prototype rests (like
multilingual IE) of translation only of the
search template.
49- IE, QA, IR, MT form a complex of information
access methods - but which are now hard to distinguish
methodologically - IR is normally done before IE in an application
to cut down text searched. - The database that IE produces can then be
searched with IR or QA or can be translated by
MT - MT and IR now have very similar cross-language
methodologies, and crosslanguage QA, IE and
summarization are very close technique. - But all these are real tasks (with associated
and different evaluation methods), which is not
true of all the partial modules that savoured in
the Stone Soup (WSD, syntax parsing etc.) and
whose value remains problematic.
50THE END