Title: CrossLingual IR
1Cross-Lingual IR
- Salim RoukosIBM T. J. Watson Research Center
- 9/11/02
2Assumptions for 2010 (Asilomar Report)
- 1 TB Mem, 1000 TB disk, 1B users,
- 1T devicesgt 1b servers
- self-managing, very secure, and very reliable
- Auto-x install, heal, adaptive, auto-tuning
wizard - Information discovery metadata for describing
schema, - cast operations
- Federation across 1k, 1m databases
- "Find the average enterprise-wide employee
salary. - "Are there any really good Italian restaurants
within 5 miles of where I live?"
3Exploit multilingual information streams
- - Xinhua
- - SDA
- AFP
- AP
- ...
- Parallel vs comparable documents - Build
Translingual search
4X-lingual Retrieval
xxx Docs
English Docs
French Docs
Chinese Docs
online
E gt X MT
X gt E MT
E gt C MT
English for gisting
Ranked Docs
Query English
IR scoring
Chinese
Caveat Machine Translation isnt perfect and
queries tend to be short.
5From information need to query
- Who has the largest market share for notebooks
IBM or Dell? - Q1 notebook market share
- Q2 laptop market share IBM Dell
- Q3 ThinkPad IBM Dell
?
D
I
q
D
P(q I) p(q D is R, C)
D
6Probabilistic Models of IR
D document C doc collection q query
P(D is R q, C) P(q D is R, C) P (D is R
C)
Prior Link analysis,other?
LM Beyond 1g? Currently P(qD is R) k p(qD)
(1-k) p(q)
- Need training data to estimate model
- Order 100k queries (not 1k)
7Probabilistic Model of What?
P(R a,D, q, C)
Many features in ME/MIX models word
ngrams synonyms Wordnet ontologies hidden
topics, top N docs, ..
8Goal -- Give users info they are seeking in
context
- Is XIR different from IR?
- Translingual search ? improved monolingual
retrieval? - Monolingual vs multilingual users
- How are XIR and MT related?
- How can we scale up?
- Create training sets to foster probabilistic
modeling research for IR (100k queries) - Modeling multilingual web content and link
structure - Dialog Interaction
- Its about modeling!