CrossLanguage Retrieval and Laboratory

About This Presentation

Title:

CrossLanguage Retrieval and Laboratory

Description:

Free Text CLIR. What to translate? Queries or documents. Where to get translation knowledge? ... Document translation. Rapid support for interactive selection ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 90

Provided by: gin116

Category:

more less

Transcript and Presenter's Notes

Title: CrossLanguage Retrieval and Laboratory

1
Cross-Language Retrieval(and Laboratory)

Philip Resnik
University of Maryland

With many slides borrowed from Doug Oard
2
Information Access
Information Use
Translation
Translingual Browsing
Translingual Search
Select
Examine
Query
Document
3
A Little (Confusing) Vocabulary

Multilingual document
Document containing more than one language
Multilingual collection
Collection of documents in different languages
Multilingual system
Can retrieve from a multilingual collection
Cross-language system
Query in one language finds document in another
Translingual system
Queries can find documents in any language

4
Who needs Cross-Language Search?

When users can read several languages
Eliminate multiple queries
Query in most fluent language
Monolingual users can also benefit
If translations can be provided
If it suffices to know that a document exists
If text captions are used to search for images

5
Motivations

Commerce
Security
Social

6
Global Internet Hosts
Source Network Wizards Jan 99 Internet Domain
Survey
7

Global Web Page Languages
Source Jack Xu, Excite_at_Home, 1999
8
European Web Content
Source European Commission, Evolution of the
Internet and the World Wide Web in Europe, 1997
9
European Web Size Projection
Source Extrapolated from Grefenstette and
Nioche, RIAO 2000
10
Global Internet Audio
Almost 2000 Internet-accessible Radio and
Television Stations
source www.real.com, Feb 2000
11
13 Months Later
About 2500 Internet-accessible Radio and
Television Stations
source www.real.com, Mar 2001
12
User Needs Assessment

Who are the potential users?
What goals do we seek to support?
What language skills must we accommodate?

13
Global Languages
Source http//www.g11n.com/faq.html
14
Global Trade
Billions of US Dollars (1999)
Source World Trade Organization 2000 Annual
Report
15
Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
16
The Search Process
Author
Choose Document-Language Terms
Query-Document Matching
Document
17
(No Transcript)
18
Some History

1964 International Road Research
Multilingual thesauri
1970 SMART
Dictionary-based free-text cross-language
retrieval
1978 ISO Standard 5964 (revised 1985)
Guidelines for developing multilingual thesauri
1990 Latent Semantic Indexing
Corpus-based free-text translingual retrieval

19
Multilingual Thesauri

Build a cross-cultural knowledge structure
Cultural differences influence indexing choices
Use language-independent descriptors
Matched to language-specific lead-in vocabulary
Three construction techniques
Build it from scratch
Translate an existing thesaurus
Merge monolingual thesauri

20
Free Text CLIR

What to translate?
Queries or documents
Where to get translation knowledge?
Dictionary or corpus
How to use it?

21
Translingual Retrieval Architecture
Chinese Term Selection
Monolingual Chinese Retrieval
1 0.72 2 0.48
Language Identification
Chinese Term Selection
Chinese Query
English Term Selection
Cross- Language Retrieval
3 0.91 4 0.57 5 0.36
22
Evidence for Language Identification

Metadata
Included in HTTP and HTML
Word-scale features
Which dictionary gets the most hits?
Subword features
Character n-gram statistics

23
Query-Language Retrieval
Chinese Query Terms
English Document Terms
Monolingual Chinese Retrieval
3 0.91 4 0.57 5 0.36
Document Translation
24
Example Modular use of MT

Select a single query language
Translate every document into that language
Perform monolingual retrieval

25
Is Machine Translation Enough?
TDT-3 Mandarin Broadcast News
Systran
Balanced 2-best translation
26
Document-Language Retrieval
Chinese Query Terms
Query Translation
English Document Terms
Monolingual English Retrieval
3 0.91 4 0.57 5 0.36
27
Query vs. Document Translation

Query translation
Efficient for short queries (not relevance
feedback)
Limited context for ambiguous query terms
Document translation
Rapid support for interactive selection
Need only be done once (if query language is
same)
Merged query and document translation
Can produce better effectiveness than either alone

28
The Short Query Challenge
Source Jack Xu, Excite_at_Home, 1999
29
Interlingual Retrieval
Chinese Query Terms
Query Translation
English Document Terms
Interlingual Retrieval
3 0.91 4 0.57 5 0.36
Document Translation
30
Key Challenges in CLIR
probe survey take samples
cymbidium goeringii
oil petroleum
restrain
31
Whats a Term?

Granularity of a term depends on the task
Long for translation, more fine-grained for
retrieval
Phrases improve translation two ways
Less ambiguous than single words
Idiomatic expressions translate as a single
concept
Three ways to identify phrases
Semantic (e.g., appears in a dictionary)
Syntactic (e.g., parse as a noun phrase)
Co-occurrence (appear together unexpectedly often)

32
Segmentation

Choose a model
Assemble evidence
Choose a preference criterion
Choose a search strategy

33
Segmentation Models

Unique segmentation
Assign each item to at most one term
Plausible segmentation
Allow alternative segmentations of a string
Plausible inference
Expand contractions and abbreviations

34
Sources of Evidence for Segmentation

Lexical resources
Dictionaries, term lists, name lists, gazeteers
Corpus statistics
Within-document, within-collection, balanced
Algorithms
Transliteration rules, name cues, parsers,
The user
Forced join, forced split

35
Sources of Evidence for Translation

Lexical resources
Corpus statistics
Algorithms
The user

36
Types of Lexical Resources

Ontology
Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval
Lexicon
Ontology specialized for machine translation
Dictionary
Ontology specialized for human translation
Bilingual term list
List of translation-equivalent pairs

37
Machine Readable Dictionaries

Based on printed bilingual dictionaries
Becoming widely available
Cross-language term mappings are accessible
Sometimes listed in order of most common usage
Some knowledge structure is also present
Hard to extract and represent automatically

38
TREC topic 351, title and description fields
Manual translation, then automatic segmentation
Unbalanced translation All translations of every
term
Balanced translation 1-best translation of each
term
39
Dictionary-Based Query Translation

Original query El Nino and infectious diseases
Term selection El Nino infectious diseases
Term translation
(Dictionary coverage El Nino is not found)
Translation selection
Query formulation
Structure
Post-translation resegmentation

40
The Unbalanced Translation Problem

Common query terms may have many translations
Some of the translations may be rare
IR systems give rare translations higher weights
The wrong documents get highly ranked

41
Solution 1 Balanced Translation

Replace each term with plausible translations
Common terms have many translations
Specific terms are more useful for retrieval
Balance the contribution of each translation
Modular duplicate translations
Integrated average the contributions

42
Solution 2 Structured Queries

Weight of term a in a document i depends on
TF(a,i) Frequency of term a in document i
DF(a) How many documents term a occurs in
Build pseudo-terms from alternate translations
TF (syn(a,b),i) TF(a,i)TF(b,i)
DF (syn(a,b) docs with aUdocs with b
Downweights terms with any common translation
Particularly effective for long queries

43
Computing Weights

Unbalanced
Overweights query terms that have many
translations
Balanced (sum)
Sensitive to rare translations
Pirkola (syn)
Deemphasizes query terms with any common
translation

(Query Terms 1 2
3 )
44
Relative Effectiveness
NTCIR-2 ECIR Collection, CETALDC Dictionary,
Inquery 3.1p1
45
Exploiting Part-of-Speech (POS)

Constrain translations by part-of-speech
Requires POS tagger and POS-tagged lexicon
Works well when queries are full sentences
Short queries provide little basis for tagging
Constrained matching can hurt monolingual IR
Nouns in queries often match verbs in documents

46
Types of Bilingual Corpora

Parallel corpora translation-equivalent pairs
Document pairs
Sentence pairs
Term pairs
Comparable corpora topically related
Collection pairs
Document pairs

47
Corpus-Based CLIR Example
Top ranked French Documents
French Query Terms
Top ranked English Documents
English Translations
Parallel Corpus
French IR System
English IR System
48
Exploiting Comparable Corpora

Blind relevance feedback
Existing CLIR technique collection-linked
corpus
Lexicon enrichment
Existing lexicon collection-linked corpus
Dual-space techniques
Document-linked corpus

49
Blind Relevance Feedback

Augment a representation with related terms
Find related documents, extract distinguishing
terms
Multiple opportunities
Before doc translation Enrich the vocabulary
After doc translation Mitigate translation
errors
Before query translation Improve the query
After query translation Mitigate translation
errors
Short queries get the most dramatic improvement

50
Example Post-Translation Document Expansion
English Query
Term Selection
IR System
Document to be Indexed
Top 5
IR System
Results
Single Document
Term-to-Term Translation
English Corpus
Automatic Segmentation
Mandarin Chinese Documents
51
Post-Translation Document Expansion
Mandarin Newswire Text
52
Why Document Expansion Works

Story-length objects provide useful context
Ranked retrieval finds signal amid the noise
Selective terms discriminate among documents
Enrich index with low DF terms from top documents
Similar strategies work well in other
applications
CLIR query translation
Monolingual spoken document retrieval

53
Lexicon Enrichment

Use a bilingual lexicon to align context
regions
Regions with high coincidence of known
translations
Pair unknown terms with unmatched terms
Unknown language A, not in the lexicon
Unmatched language B, not covered by translation
Treat the most surprising pairs as new
translations
Not yet tested in a CLIR application

54
Lexicon Enrichment
Similar techniques can guide translation selection
55
Learning From Document Pairs
56
Similarity Thesauri

For each term, find most similar in other
language
Terms E1 S1 (or E3 S4) are used in similar
ways
Treat top related terms as candidate translations
Applying dictionary-based techniques
Performed well comparable news corpus
Automatically linked based on date and subject
codes

57
Generalized Vector Space Model

Term space of each language is different
Document links define a common document space
Describe documents based on the corpus
Vector of similarities to each corpus document
Compute cosine similarity in document space
Very effective in a within-domain evaluation

58
Latent Semantic Indexing

Cosine similarity captures noise with signal
Term choice variation and word sense ambiguity
Signal-preserving dimensionality reduction
Conflates terms with similar usage patterns
Reduces term choice effect, even across languages
Computationally expensive

59
Exploiting Parallel Corpora

Document-linked techniques
Corpus-guided translation selection
Statistical machine translation

60
Hieroglyphic
Egyptian Demotic
Greek
61
Corpus-Guided Translation Selection

Build target-language term n-gram language model
Can use the collection being searched
Smooth statistics using comparable and balanced
corpora
Use it to rank translation alternatives for each
term
Unigram language models are easy to build
Limits uncommon translation and spelling error
effects

62
Statistical Machine Translation

Add a translation model
Trained using a term-aligned corpus
Statistical MT toolkits are becoming available
Excellent results on hand-assembled corpora
Promising initial results on harvested Web pages

63
Obtaining Parallel Corpora

Translating monolingual corpora is impractical
Humans are expensive, machines are inaccurate
Harvesting new parallel corpora can be expensive
Reverse engineering collection-specific link
encoding
Crawling the Web offers an interesting
alternative
Low-quality translations, but lots of them
Reuse of existing parallel corpora is limited
Cross-domain applications typically perform poorly

64
Cognate Matching

Dictionary coverage is inherently limited
Translation of proper names
Translation of newly coined terms
Translation of unfamiliar technical terms
Strategy model derivational translation
Orthography-based
Pronunciation-based

65
Matching Orthographic Cognates

Retain untranslatable words unchanged
Often works well between European languages
Rule-based systems
Even off-the-shelf spelling correction can help!
Character-level statistical MT
Trained using a set of representative cognates

66
Matching Phonetic Cognates

Forward transliteration
Generate all potential transliterations
Reverse transliteration
Guess source string(s) that produced a
transliteration
Match in phonetic space

67
Post-Translation Resegmentation
68
Cross-Language Information Retrieval
Controlled Vocabulary
Free Text
Query Translation Document
Translation
Dictionary-based
Corpus-based
Term-aligned Document-aligned
Collection-aligned
Parallel Comparable
69
Evaluation Collections

TREC
TREC-6/7/8 English, French, German, Italian text
TREC-9 Chinese text
TREC-10 Arabic text
CLEF
CLEF-1 English, French, German, Italian text
CLEF-2 Above, plus Spanish and Dutch
TDT
TDT-2/3 Chinese and English, text and speech
NTCIR
NTCIR-1 Japanese and English text
NTCIR-2 Japanese, Chinese and English text

70
Topic Detection and Tracking

English and Chinese news stories
Newswire, radio, and television sources
Query-by-example, mixed language/source
Merged mutilingual result set
Set-based retrieval measures
Focus on utility

71
TDT Evaluating CL Speech Retrieval
Development Collection TDT-2
Evaluation Collection TDT-3
Oct 98
Dec 98
Jun 98
Jan 98
17 topics, variable number of exemplars
56 topics, variable number of exemplars
English text topic exemplars Associated
Press New York Times
2265 manually segmented stories
3371 manually segmented stories
Mandarin audio broadcast news Voice of America
Mar 98
Exhaustive relevance assessment based on event
overlap
72
President Bill Clinton and Chinese President
Jiang Zemin engaged in a spirited, televised
debate Saturday over human rights and
the Tiananmen Square crackdown, and announced a
string of agreements on arms control, energy and
environmental matters. There were no announced
breakthroughs on American human rights concerns,
including Tibet, but both leaders accentuated
the positive
Query by Example
English Newswire Exemplars
Mandarin Audio Stories
??????????????????????????????????????? ??????????
????????????????????????????? ????,???????????????
????????????,??????? ????????????????????????????
??????
73
Known Item Retrieval

Design queries to retrieve a single document
Measure the rank of that document in the list
Average the inverse of the rank across queries
Use sign test for statistical significance
Useful first pass evaluation strategy
Avoids the cost of relevance judgments
Poor mean inverse rank implies poor average
precision
Does not distinguish well among fairly close
systems

74
Evaluating Corpus-Based Techniques

Within-domain evaluation (upper bound)
Partition a bilingual corpus into training and
test
Use the training part to tune the system
Generate relevance judgments for evaluation part
Cross-domain evaluation (fair)
Use existing corpora and evaluation collections
No good metric for degree of domain shift

75
Evaluating Lexicon Coverage

Lexicon size
Vocabulary coverage of the collection
Term instance coverage of the collection
Term weight coverage of the collection
Term weight coverage on representative queries
Retrieval performance on a test collection

76
Outline

Questions
Overview
Search
Browsing

77
Interactive CLIR

Important Strong support for interactive
relevance judgments can make up for less accurate
nominations
Hersh et al., SIGIR 2000
Practical Interactive relevance judgments based
on imperfect translations can beat fully
automatic nominations alone
Oard and Resnik, IPM 1997

78
User-Assisted Query Translation
79
Reverse Translation
Search
Swiss bank
Query in English
Click on a box to remove a possible translation
bank
Bankgebäude ( ) bankverbindung (bank account,
correspondent) bank (bench, settle) damm (caus
eway, dam, embankment) ufer (shore, strand,
waterside) wall (parapet, rampart)
Continue
80
Selection

Goal Provide information to support decisions
May not require very good translations
e.g., Term-by-term title translation
People can read past some ambiguity
May help to display a few alternative translations

81
Language-Specific Selection
Search
Swiss bank
Query in English
English
German
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.72) Swiss Bankers Criticized AP / June 14,
1997 2 (0.48) Bank Director Resigns AP / July
24, 1997
1 (0.91) U.S. Senator Warpathing NZZ / June 14,
1997 2 (0.57) Bankensecret Law Change SDA /
August 22, 1997 3 (0.36) Banks Pressure
Existent NZZ / May 3, 1997
82
Translingual Selection
Search
Swiss bank
Query in English
German Query
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.91) U.S. Senator Warpathing NZZ June 14,
1997 2 (0.57) Bankensecret Law Change
SDA August 22, 1997 3 (0.52) Swiss Bankers
Criticized AP June 14, 1997 4 (0.36) Banks
Pressure Existent NZZ May 3, 1997 5 (0.28) Bank
Director Resigns AP July 24, 1997
83
Merging Ranked Lists
1 voa4062 .22 2 voa3052 .21 3
voa4091 .17 1000 voa4221 .04
1 voa4062 .52 2 voa2156 .37 3
voa3052 .31 1000 voa2159 .02

Types of Evidence
Rank
Score
Evidence Combination
Weighted round robin
Score combination
Parameter tuning
Condition-based
Query-based

1 voa4062 2 voa3052 3 voa2156
1000 voa4201
84
Examination Interface

Two goals
Refine document delivery decisions
Support vocabulary discovery for query
refinement
Rapid translation is essential
Document translation retrieval strategies are a
good fit
Focused on-the-fly translation may be a viable
alternative

85
The Challenge
86
State-of-the-Art Machine Translation
87
Term-by-Term Gloss Translation
88
Experiment Design
Participant
Task Order
Topic Key
1
Topic11, Topic17
Topic13, Topic29
Narrow
11, 13
Broad
17, 29
2
Topic11, Topic17
Topic13, Topic29
System Key
3
Topic17, Topic11
Topic29, Topic13
System A
System B
4
Topic17, Topic11
Topic29, Topic13
89
An Experiment Session

Task and system familiarization (30 minutes)
Gain experience with both systems
4 searches (20 minutes x 4)
Read topic description (in a language you know)
Examine translations (into that same language)
Select one of 5 relevance judgments (two
clusters)
Relevant
Somewhat relevant, Not relevant, Unsure, Not
judged
Instructed to seek high precision
8 questionnaires
Initial (1), each topic (4), each system (2),
Final (1)

90
Measure of Effectiveness

Unbalanced F-Measure
P precision
R recall
? 0.8
Favors precision over recall
This models an application in which
Fluent translation is expensive
Missing some relevant documents would be okay

91
French Results Overview
92
English Results Overview
93
Maryland Experiments
---------- Broad topics -----------
--------- Narrow topics -----------