Title: CrossLanguage Retrieval and Laboratory
1Cross-Language Retrieval(and Laboratory)
- Philip Resnik
- University of Maryland
With many slides borrowed from Doug Oard
2Information Access
Information Use
Translation
Translingual Browsing
Translingual Search
Select
Examine
Query
Document
3A Little (Confusing) Vocabulary
- Multilingual document
- Document containing more than one language
- Multilingual collection
- Collection of documents in different languages
- Multilingual system
- Can retrieve from a multilingual collection
- Cross-language system
- Query in one language finds document in another
- Translingual system
- Queries can find documents in any language
4Who needs Cross-Language Search?
- When users can read several languages
- Eliminate multiple queries
- Query in most fluent language
- Monolingual users can also benefit
- If translations can be provided
- If it suffices to know that a document exists
- If text captions are used to search for images
5Motivations
6Global Internet Hosts
Source Network Wizards Jan 99 Internet Domain
Survey
7 Global Web Page Languages
Source Jack Xu, Excite_at_Home, 1999
8European Web Content
Source European Commission, Evolution of the
Internet and the World Wide Web in Europe, 1997
9European Web Size Projection
Source Extrapolated from Grefenstette and
Nioche, RIAO 2000
10Global Internet Audio
Almost 2000 Internet-accessible Radio and
Television Stations
source www.real.com, Feb 2000
1113 Months Later
About 2500 Internet-accessible Radio and
Television Stations
source www.real.com, Mar 2001
12User Needs Assessment
- Who are the potential users?
- What goals do we seek to support?
- What language skills must we accommodate?
13Global Languages
Source http//www.g11n.com/faq.html
14Global Trade
Billions of US Dollars (1999)
Source World Trade Organization 2000 Annual
Report
15Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
16The Search Process
Author
Choose Document-Language Terms
Query-Document Matching
Document
17(No Transcript)
18Some History
- 1964 International Road Research
- Multilingual thesauri
- 1970 SMART
- Dictionary-based free-text cross-language
retrieval - 1978 ISO Standard 5964 (revised 1985)
- Guidelines for developing multilingual thesauri
- 1990 Latent Semantic Indexing
- Corpus-based free-text translingual retrieval
19Multilingual Thesauri
- Build a cross-cultural knowledge structure
- Cultural differences influence indexing choices
- Use language-independent descriptors
- Matched to language-specific lead-in vocabulary
- Three construction techniques
- Build it from scratch
- Translate an existing thesaurus
- Merge monolingual thesauri
20Free Text CLIR
- What to translate?
- Queries or documents
- Where to get translation knowledge?
- Dictionary or corpus
- How to use it?
21Translingual Retrieval Architecture
Chinese Term Selection
Monolingual Chinese Retrieval
1 0.72 2 0.48
Language Identification
Chinese Term Selection
Chinese Query
English Term Selection
Cross- Language Retrieval
3 0.91 4 0.57 5 0.36
22Evidence for Language Identification
- Metadata
- Included in HTTP and HTML
- Word-scale features
- Which dictionary gets the most hits?
- Subword features
- Character n-gram statistics
23Query-Language Retrieval
Chinese Query Terms
English Document Terms
Monolingual Chinese Retrieval
3 0.91 4 0.57 5 0.36
Document Translation
24Example Modular use of MT
- Select a single query language
- Translate every document into that language
- Perform monolingual retrieval
25Is Machine Translation Enough?
TDT-3 Mandarin Broadcast News
Systran
Balanced 2-best translation
26Document-Language Retrieval
Chinese Query Terms
Query Translation
English Document Terms
Monolingual English Retrieval
3 0.91 4 0.57 5 0.36
27Query vs. Document Translation
- Query translation
- Efficient for short queries (not relevance
feedback) - Limited context for ambiguous query terms
- Document translation
- Rapid support for interactive selection
- Need only be done once (if query language is
same) - Merged query and document translation
- Can produce better effectiveness than either alone
28The Short Query Challenge
Source Jack Xu, Excite_at_Home, 1999
29Interlingual Retrieval
Chinese Query Terms
Query Translation
English Document Terms
Interlingual Retrieval
3 0.91 4 0.57 5 0.36
Document Translation
30Key Challenges in CLIR
probe survey take samples
cymbidium goeringii
oil petroleum
restrain
31Whats a Term?
- Granularity of a term depends on the task
- Long for translation, more fine-grained for
retrieval - Phrases improve translation two ways
- Less ambiguous than single words
- Idiomatic expressions translate as a single
concept - Three ways to identify phrases
- Semantic (e.g., appears in a dictionary)
- Syntactic (e.g., parse as a noun phrase)
- Co-occurrence (appear together unexpectedly often)
32Segmentation
- Choose a model
- Assemble evidence
- Choose a preference criterion
- Choose a search strategy
33Segmentation Models
- Unique segmentation
- Assign each item to at most one term
- Plausible segmentation
- Allow alternative segmentations of a string
- Plausible inference
- Expand contractions and abbreviations
34Sources of Evidence for Segmentation
- Lexical resources
- Dictionaries, term lists, name lists, gazeteers
- Corpus statistics
- Within-document, within-collection, balanced
- Algorithms
- Transliteration rules, name cues, parsers,
- The user
- Forced join, forced split
35Sources of Evidence for Translation
- Lexical resources
- Corpus statistics
- Algorithms
- The user
36Types of Lexical Resources
- Ontology
- Representation of concepts and relationships
- Thesaurus
- Ontology specialized for retrieval
- Lexicon
- Ontology specialized for machine translation
- Dictionary
- Ontology specialized for human translation
- Bilingual term list
- List of translation-equivalent pairs
37Machine Readable Dictionaries
- Based on printed bilingual dictionaries
- Becoming widely available
- Cross-language term mappings are accessible
- Sometimes listed in order of most common usage
- Some knowledge structure is also present
- Hard to extract and represent automatically
38TREC topic 351, title and description fields
Manual translation, then automatic segmentation
Unbalanced translation All translations of every
term
Balanced translation 1-best translation of each
term
39Dictionary-Based Query Translation
- Original query El Nino and infectious diseases
- Term selection El Nino infectious diseases
- Term translation
- (Dictionary coverage El Nino is not found)
- Translation selection
- Query formulation
- Structure
- Post-translation resegmentation
40The Unbalanced Translation Problem
- Common query terms may have many translations
- Some of the translations may be rare
- IR systems give rare translations higher weights
- The wrong documents get highly ranked
41Solution 1 Balanced Translation
- Replace each term with plausible translations
- Common terms have many translations
- Specific terms are more useful for retrieval
- Balance the contribution of each translation
- Modular duplicate translations
- Integrated average the contributions
42Solution 2 Structured Queries
- Weight of term a in a document i depends on
- TF(a,i) Frequency of term a in document i
- DF(a) How many documents term a occurs in
- Build pseudo-terms from alternate translations
- TF (syn(a,b),i) TF(a,i)TF(b,i)
- DF (syn(a,b) docs with aUdocs with b
- Downweights terms with any common translation
- Particularly effective for long queries
43Computing Weights
- Unbalanced
- Overweights query terms that have many
translations - Balanced (sum)
- Sensitive to rare translations
- Pirkola (syn)
- Deemphasizes query terms with any common
translation
(Query Terms 1 2
3 )
44Relative Effectiveness
NTCIR-2 ECIR Collection, CETALDC Dictionary,
Inquery 3.1p1
45Exploiting Part-of-Speech (POS)
- Constrain translations by part-of-speech
- Requires POS tagger and POS-tagged lexicon
- Works well when queries are full sentences
- Short queries provide little basis for tagging
- Constrained matching can hurt monolingual IR
- Nouns in queries often match verbs in documents
46Types of Bilingual Corpora
- Parallel corpora translation-equivalent pairs
- Document pairs
- Sentence pairs
- Term pairs
- Comparable corpora topically related
- Collection pairs
- Document pairs
47Corpus-Based CLIR Example
Top ranked French Documents
French Query Terms
Top ranked English Documents
English Translations
Parallel Corpus
French IR System
English IR System
48Exploiting Comparable Corpora
- Blind relevance feedback
- Existing CLIR technique collection-linked
corpus - Lexicon enrichment
- Existing lexicon collection-linked corpus
- Dual-space techniques
- Document-linked corpus
49Blind Relevance Feedback
- Augment a representation with related terms
- Find related documents, extract distinguishing
terms - Multiple opportunities
- Before doc translation Enrich the vocabulary
- After doc translation Mitigate translation
errors - Before query translation Improve the query
- After query translation Mitigate translation
errors - Short queries get the most dramatic improvement
50Example Post-Translation Document Expansion
English Query
Term Selection
IR System
Document to be Indexed
Top 5
IR System
Results
Single Document
Term-to-Term Translation
English Corpus
Automatic Segmentation
Mandarin Chinese Documents
51Post-Translation Document Expansion
Mandarin Newswire Text
52Why Document Expansion Works
- Story-length objects provide useful context
- Ranked retrieval finds signal amid the noise
- Selective terms discriminate among documents
- Enrich index with low DF terms from top documents
- Similar strategies work well in other
applications - CLIR query translation
- Monolingual spoken document retrieval
53Lexicon Enrichment
- Use a bilingual lexicon to align context
regions - Regions with high coincidence of known
translations - Pair unknown terms with unmatched terms
- Unknown language A, not in the lexicon
- Unmatched language B, not covered by translation
- Treat the most surprising pairs as new
translations - Not yet tested in a CLIR application
54Lexicon Enrichment
Similar techniques can guide translation selection
55Learning From Document Pairs
56Similarity Thesauri
- For each term, find most similar in other
language - Terms E1 S1 (or E3 S4) are used in similar
ways - Treat top related terms as candidate translations
- Applying dictionary-based techniques
- Performed well comparable news corpus
- Automatically linked based on date and subject
codes
57Generalized Vector Space Model
- Term space of each language is different
- Document links define a common document space
- Describe documents based on the corpus
- Vector of similarities to each corpus document
- Compute cosine similarity in document space
- Very effective in a within-domain evaluation
58Latent Semantic Indexing
- Cosine similarity captures noise with signal
- Term choice variation and word sense ambiguity
- Signal-preserving dimensionality reduction
- Conflates terms with similar usage patterns
- Reduces term choice effect, even across languages
- Computationally expensive
59Exploiting Parallel Corpora
- Document-linked techniques
- Corpus-guided translation selection
- Statistical machine translation
60Hieroglyphic
Egyptian Demotic
Greek
61Corpus-Guided Translation Selection
- Build target-language term n-gram language model
- Can use the collection being searched
- Smooth statistics using comparable and balanced
corpora - Use it to rank translation alternatives for each
term - Unigram language models are easy to build
- Limits uncommon translation and spelling error
effects
62Statistical Machine Translation
- Add a translation model
- Trained using a term-aligned corpus
- Statistical MT toolkits are becoming available
- Excellent results on hand-assembled corpora
- Promising initial results on harvested Web pages
63Obtaining Parallel Corpora
- Translating monolingual corpora is impractical
- Humans are expensive, machines are inaccurate
- Harvesting new parallel corpora can be expensive
- Reverse engineering collection-specific link
encoding - Crawling the Web offers an interesting
alternative - Low-quality translations, but lots of them
- Reuse of existing parallel corpora is limited
- Cross-domain applications typically perform poorly
64Cognate Matching
- Dictionary coverage is inherently limited
- Translation of proper names
- Translation of newly coined terms
- Translation of unfamiliar technical terms
- Strategy model derivational translation
- Orthography-based
- Pronunciation-based
65Matching Orthographic Cognates
- Retain untranslatable words unchanged
- Often works well between European languages
- Rule-based systems
- Even off-the-shelf spelling correction can help!
- Character-level statistical MT
- Trained using a set of representative cognates
66Matching Phonetic Cognates
- Forward transliteration
- Generate all potential transliterations
- Reverse transliteration
- Guess source string(s) that produced a
transliteration - Match in phonetic space
67Post-Translation Resegmentation
68Cross-Language Information Retrieval
Controlled Vocabulary
Free Text
Query Translation Document
Translation
Dictionary-based
Corpus-based
Term-aligned Document-aligned
Collection-aligned
Parallel Comparable
69Evaluation Collections
- TREC
- TREC-6/7/8 English, French, German, Italian text
- TREC-9 Chinese text
- TREC-10 Arabic text
- CLEF
- CLEF-1 English, French, German, Italian text
- CLEF-2 Above, plus Spanish and Dutch
- TDT
- TDT-2/3 Chinese and English, text and speech
- NTCIR
- NTCIR-1 Japanese and English text
- NTCIR-2 Japanese, Chinese and English text
70Topic Detection and Tracking
- English and Chinese news stories
- Newswire, radio, and television sources
- Query-by-example, mixed language/source
- Merged mutilingual result set
- Set-based retrieval measures
- Focus on utility
71TDT Evaluating CL Speech Retrieval
Development Collection TDT-2
Evaluation Collection TDT-3
Oct 98
Dec 98
Jun 98
Jan 98
17 topics, variable number of exemplars
56 topics, variable number of exemplars
English text topic exemplars Associated
Press New York Times
2265 manually segmented stories
3371 manually segmented stories
Mandarin audio broadcast news Voice of America
Mar 98
Exhaustive relevance assessment based on event
overlap
72 President Bill Clinton and Chinese President
Jiang Zemin engaged in a spirited, televised
debate Saturday over human rights and
the Tiananmen Square crackdown, and announced a
string of agreements on arms control, energy and
environmental matters. There were no announced
breakthroughs on American human rights concerns,
including Tibet, but both leaders accentuated
the positive
Query by Example
English Newswire Exemplars
Mandarin Audio Stories
??????????????????????????????????????? ??????????
????????????????????????????? ????,???????????????
????????????,??????? ????????????????????????????
??????
73Known Item Retrieval
- Design queries to retrieve a single document
- Measure the rank of that document in the list
- Average the inverse of the rank across queries
- Use sign test for statistical significance
- Useful first pass evaluation strategy
- Avoids the cost of relevance judgments
- Poor mean inverse rank implies poor average
precision - Does not distinguish well among fairly close
systems
74Evaluating Corpus-Based Techniques
- Within-domain evaluation (upper bound)
- Partition a bilingual corpus into training and
test - Use the training part to tune the system
- Generate relevance judgments for evaluation part
- Cross-domain evaluation (fair)
- Use existing corpora and evaluation collections
- No good metric for degree of domain shift
75Evaluating Lexicon Coverage
- Lexicon size
- Vocabulary coverage of the collection
- Term instance coverage of the collection
- Term weight coverage of the collection
- Term weight coverage on representative queries
- Retrieval performance on a test collection
76Outline
- Questions
- Overview
- Search
- Browsing
77Interactive CLIR
- Important Strong support for interactive
relevance judgments can make up for less accurate
nominations - Hersh et al., SIGIR 2000
- Practical Interactive relevance judgments based
on imperfect translations can beat fully
automatic nominations alone - Oard and Resnik, IPM 1997
78User-Assisted Query Translation
79Reverse Translation
Search
Swiss bank
Query in English
Click on a box to remove a possible translation
bank
Bankgebäude ( ) bankverbindung (bank account,
correspondent) bank (bench, settle) damm (caus
eway, dam, embankment) ufer (shore, strand,
waterside) wall (parapet, rampart)
Continue
80Selection
- Goal Provide information to support decisions
- May not require very good translations
- e.g., Term-by-term title translation
- People can read past some ambiguity
- May help to display a few alternative translations
81Language-Specific Selection
Search
Swiss bank
Query in English
English
German
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.72) Swiss Bankers Criticized AP / June 14,
1997 2 (0.48) Bank Director Resigns AP / July
24, 1997
1 (0.91) U.S. Senator Warpathing NZZ / June 14,
1997 2 (0.57) Bankensecret Law Change SDA /
August 22, 1997 3 (0.36) Banks Pressure
Existent NZZ / May 3, 1997
82Translingual Selection
Search
Swiss bank
Query in English
German Query
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.91) U.S. Senator Warpathing NZZ June 14,
1997 2 (0.57) Bankensecret Law Change
SDA August 22, 1997 3 (0.52) Swiss Bankers
Criticized AP June 14, 1997 4 (0.36) Banks
Pressure Existent NZZ May 3, 1997 5 (0.28) Bank
Director Resigns AP July 24, 1997
83Merging Ranked Lists
1 voa4062 .22 2 voa3052 .21 3
voa4091 .17 1000 voa4221 .04
1 voa4062 .52 2 voa2156 .37 3
voa3052 .31 1000 voa2159 .02
- Types of Evidence
- Rank
- Score
- Evidence Combination
- Weighted round robin
- Score combination
- Parameter tuning
- Condition-based
- Query-based
1 voa4062 2 voa3052 3 voa2156
1000 voa4201
84Examination Interface
- Two goals
- Refine document delivery decisions
- Support vocabulary discovery for query
refinement - Rapid translation is essential
- Document translation retrieval strategies are a
good fit - Focused on-the-fly translation may be a viable
alternative
85The Challenge
86State-of-the-Art Machine Translation
87Term-by-Term Gloss Translation
88Experiment Design
Participant
Task Order
Topic Key
1
Topic11, Topic17
Topic13, Topic29
Narrow
11, 13
Broad
17, 29
2
Topic11, Topic17
Topic13, Topic29
System Key
3
Topic17, Topic11
Topic29, Topic13
System A
System B
4
Topic17, Topic11
Topic29, Topic13
89An Experiment Session
- Task and system familiarization (30 minutes)
- Gain experience with both systems
- 4 searches (20 minutes x 4)
- Read topic description (in a language you know)
- Examine translations (into that same language)
- Select one of 5 relevance judgments (two
clusters) - Relevant
- Somewhat relevant, Not relevant, Unsure, Not
judged - Instructed to seek high precision
- 8 questionnaires
- Initial (1), each topic (4), each system (2),
Final (1)
90Measure of Effectiveness
- Unbalanced F-Measure
- P precision
- R recall
- ? 0.8
- Favors precision over recall
- This models an application in which
- Fluent translation is expensive
- Missing some relevant documents would be okay
91French Results Overview
92English Results Overview
93Maryland Experiments
---------- Broad topics -----------
--------- Narrow topics -----------
- MT is almost always better
- Significant overall and for narrow topics alone
(one-tailed t-test, p - F measure is less insightful for narrow topics
- Always near 0 or 1
94Some Observations
- Small agreement with CLEF assessments!
- Time pressure, precision bias, strict judgments
- Systran was fairly consistent across sites
- Only when the language pair was the same
- Monolingual Systran Gloss
- In both recall and precision
- UNEDs phrase translations improve recall
- With no adverse affect on precision
95Delivery
- Use may require high-quality translation
- Machine translation quality is often rough
- Route to best translator based on
- Acceptable delay
- Required quality (language and technical skills)
- Cost
96Summary
- Controlled vocabulary
- Mature, efficient, easily explained
- Dictionary-based
- Simple, broad coverage
- Collection-aligned corpus-based
- Generally helpful
- Document- and Term-aligned corpus-based
- Effective in the same domain
- User interface
- Only very preliminary results available
97Research Opportunities(Oards view)
Segmentation Phrase Indexing
Lexical Coverage