Title: Towards Semantics for IR
1Towards Semantics for IR
- Eugene Agichtein
- Emory University
Acknowledgements A bunch of slides in this talk
are adapted from lots of people, including Chris
Manning, ChengXiang Zhai, James Allan, Ray
Mooney, and Jimmy Lin.
2Who is this guy?
Sept 2006- Assistant Professor in the Math CS
department at Emory. 2004 to 2006 Postdoc in
the Text Mining, Search, and Navigation group at
Microsoft Research, Redmond. 2004 Ph.D. in
Computer Science from Columbia University
dissertation on extracting structured relations
from large unstructured text databases 1998
B.S. in Engineering from The Cooper Union.
Research interests accessing, discovering, and
managing information in unstructured (text) data,
with current emphasis on developing robust and
scalable text mining techniques for the biology
and health domains.
3Outline
- Text Information Retrieval 10-minute overview
- Problems with lexical retrieval
- Synonymy, Polysemy, Ambiguity
- A partial solution synonym lookup
- Towards concept retrieval
- LSI
- Language Models for IR
- PLSI
- Towards real semantic search
- Entities, Relations, Facts, Events in Text (my
research area)
4Information Retrieval From Text
IR System
5Was that the whole story in IR?
Source Selection
Query Formulation
Search
Selection
Examination
Delivery
6Supporting the Search Process
Source Selection
Resource
Query Formulation
Query
Search
Ranked List
Selection
Indexing
Documents
Index
Examination
Acquisition
Documents
Collection
Delivery
7Example Query
- Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia? - One could grep all of Shakespeares plays for
Brutus and Caesar, then strip out lines
containing Calpurnia? - Slow (for large corpora)
- NOT Calpurnia requires egrep ?
- But other operations (e.g., find the word Romans
near countrymen , or top-K scenes most about )
not feasible
8Term-document incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
9Incidence vectors
- So we have a 0/1 vector for each term.
- Boolean model
- To answer query take the vectors for Brutus,
Caesar and Calpurnia (complemented) ? bitwise
AND. - 110100 AND 110111 AND 101111 100100
- Vector-space model
- Compute query-document similarity as dot
product/cosine between query and document vector - Rank by similarity
10Answers to query
- Antony and Cleopatra, Act III, Scene ii
- Agrippa Aside to DOMITIUS ENOBARBUS Why,
Enobarbus, - When Antony found
Julius Caesar dead, - He cried almost to
roaring and he wept - When at Philippi he
found Brutus slain. - Hamlet, Act III, Scene ii
- Lord Polonius I did enact Julius Caesar I was
killed i' the - Capitol Brutus killed me.
11Modern Search Engines in 1 Minute
- Crawl Time
- Inverted List terms ? doc IDs
- Content chunks (doc copies)
- Query Time
- Lookup query terms in IL? filter set
- Get content chunks for doc IDs
- Rank documents using hundreds of features (e.g.,
term weights, web topology, proximity, position) - Retrieve Top K documents for query ( K filter set)
index
angina
5
treatment
4
Content chunks
12Outline
- Text Information Retrieval 10-minute overview
- Problems with lexical retrieval
- Synonymy, Polysemy, Ambiguity
- A partial solution synonym lookup
- Towards concept retrieval
- LSI
- Language Models for IR
- PLSI
- Towards real semantic search
- Entities, Relations, Facts, Events
13The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
14Noisy-Channel Model of IR
Information need
d1
d2
Query
User has a information need, thinks of a
relevant document
and writes down some queries
dn
Task of information retrieval given the query,
figure out which document it came from?
document collection
15How is this a noisy-channel?
- No one seriously claims that this is actually
whats going on - But this view is mathematically convenient!
Source
Destination
channel
message
message
noise
Source
Destination
Information need
query terms
Encoder
channel
Query formulation process
16Problems with term-based retrieval
- Synonymy
- Power law vs. Zipf distribution
- Polysemy
- Saturn
- Ambiguity
- What do frogs eat?
17Polysemy and Context
- Document similarity on single word level
polysemy and context
18Ambiguity
- Different documents with the same keywords may
have different meanings
What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
- Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders. - Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds. - Some bats catch fish with their claws, and a few
species eat lizards, rodents, small birds, tree
frogs, and other bats.
?
?
19Indexing Word Synsets/Senses
- How does indexing word senses solve the
synonym/polysemy problem? - Okay, so where do we get the word senses?
- WordNet a lexical database for standard English
- Automatically find clusters of words that
describe the same concepts
dog, canine, doggy, puppy, etc. ? concept 112986
I deposited my check in the bank. bank ? concept
76529 I saw the sailboat from the bank. bank ?
concept 53107
http//wordnet.princeton.edu/
20Example Contextual Word Similarity
Use Mutual Information
Dagan et al, Computer Speech Language, 1995
21Word Sense Disambiguation
- Given a word in context, automatically determine
the sense (concept) - This is the Word Sense Disambiguation (WSD)
problem - Context is the key
- For each ambiguous word, note the surrounding
words - Learn a classifier from a collection of
examples - Use the classifier to determine the senses of
words in the documents
bank river, sailboat, water, etc. ? side of a
river bank check, money, account, etc. ?
financial institution
22Example Unsupervised WSD
- Hypothesis same senses of words will have
similar neighboring words - Disambiguation algorithm
- Identify context vectors corresponding to all
occurrences of a particular word - Partition them into regions of high density
- Assign a sense to each such region
-
- Sit on a chair
- Take a seat on this chair
- The chair of the Math Department
- The chair of the meeting
23Does it help retrieval?
- Not really
- Examples of limited success.
Ellen M. Voorhees. (1993) Using WordNet to
Disambiguate Word Senses for Text Retrieval.
Proceedings of SIGIR 1993. Mark Sanderson.
(1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994 And others
Hinrich Schütze and Jan O. Pedersen. (1995)
Information Retrieval Based on Word Senses.
Proceedings of the 4th Annual Symposium on
Document Analysis and Information
Retrieval. Rada Mihalcea and Dan Moldovan.
(2000) Semantic Indexing Using WordNet Senses.
Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
24Why Disambiguation Can Hurt
- Bag-of-words techniques already disambiguate
- Context for each term is established in the query
- Heuristics (e.g., always most frequent sense)
work better - WSD is hard!
- Many words are highly polysemous, e.g., interest
- Granularity of senses is often domain/application
specific - Queries are short not enough context for
accurate WSD - WSD tries to improve precision
- But incorrect sense assignments would hurt recall
- Slight gains in precision do not offset large
drops in recall
25Outline
- Text Information Retrieval 10-minute overview
- Problems with lexical retrieval
- Synonymy, Polysemy, Ambiguity
- A partial solution word synsets, WSD
- Towards concept retrieval
- LSI
- Language Models for IR
- PLSI
- Towards real semantic search
- Entities, Relations, Facts, Events
26Latent Semantic Analysis
- Perform a low-rank approximation of document-term
matrix (typical rank 100-300) - General idea
- Map documents (and terms) to a low-dimensional
representation. - Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space). - Compute document similarity based on the inner
product in this latent semantic space - Goals
- Similar terms map to similar location in low
dimensional space - Noise reduction by dimension reduction
27Latent Semantic Analysis
- Latent semantic space illustrating example
courtesy of Susan Dumais
28(No Transcript)
29Simplistic picture
Topic 1
Topic 2
Topic 3
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Some (old) empirical evidence
- Precision at or above median TREC precision
- Top scorer on almost 20 TREC 1,2,3 topics (c.f.
1990) - Slightly better on average than original vector
space - Effect of dimensionality
36(No Transcript)
37Problems with term-based retrieval
- Synonymy
- Power law vs. Zipf distribution
- Polysemy
- Saturn
- Ambiguity
- What do frogs eat?
38Outline
- Text Information Retrieval 5-minute overview
- Problems with lexical retrieval
- Synonymy, Polysemy, Ambiguity
- A partial solution synonym lookup
- Towards concept retrieval
- LSI
- Language Models for IR
- PLSI
- Towards real semantic search
- Entities, Relations, Facts, Events
39IR based on Language Model (LM)
Information need
d1
generation
d2
query
dn
- A common search heuristic is to use words that
you expect to find in matching documents as your
query why, I saw Sergey Brin advocating that
strategy on late night TV one night in my hotel
room, so it must be good! - The LM approach directly exploits that idea!
document collection
40Formal Language (Model)
- Traditional generative model generates strings
- Finite state machines or regular grammars, etc.
- Example
I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
I
wish
wish I wish
41Stochastic Language Models
- Models probability of generating strings in the
language (commonly all strings over alphabet ?)
Model M
0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 l
ikes
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
P(s M) 0.00000008
42Stochastic Language Models
- Model probability of generating any string
Model M1
Model M2
0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1
yon 0.01 maiden 0.0001 woman
0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.
0001 yon 0.0005 maiden 0.01 woman
P(sM2) P(sM1)
43Stochastic Language Models
- A statistical model for generating text
- Probability distribution over strings in a given
language
M
44Unigram and higher-order models
- Unigram Language Models
- Bigram (generally, n-gram) Language Models
- Other Language Models
- Grammar-based models (PCFGs), etc.
- Probably not the first thing to try in IR
Easy. Effective!
45Using Language Models in IR
- Treat each document as the basis for a model
(e.g., unigram sufficient statistics) - Rank document d based on P(d q)
- P(d q) P(q d) x P(d) / P(q)
- P(q) is the same for all documents, so ignore
- P(d) the prior is often treated as the same for
all d - But we could use criteria like authority, length,
genre - P(q d) is the probability of q given ds model
- Very general formal approach
46The fundamental problem of LMs
- Usually we dont know the model M
- But have a sample of text representative of that
model - Estimate a language model from a sample
- Then compute the observation probability
M
47Language Models for IR
- Language Modeling Approaches
- Attempt to model query generation process
- Documents are ranked by the probability that a
query would be observed as a random sample from
the respective document model - Multinomial approach
48Retrieval based on probabilistic LM
- Treat the generation of queries as a random
process. - Approach
- Infer a language model for each document.
- Estimate the probability of generating the query
according to each of these models. - Rank the documents according to these
probabilities. - Usually a unigram estimate of words is used
- Some work on bigrams, paralleling van Rijsbergen
49Retrieval based on probabilistic LM
- Intuition
- Users
- Have a reasonable idea of terms that are likely
to occur in documents of interest. - They will choose query terms that distinguish
these documents from others in the collection. - Collection statistics
- Are integral parts of the language model.
- Are not used heuristically as in many other
approaches. - In theory. In practice, theres usually some
wiggle room for empirically set parameters
50(No Transcript)
51(No Transcript)
52Query generation probability
- Ranking formula
- The probability of producing the query given the
language model of document d using MLE is
Unigram assumption Given a particular language
model, the query terms occur independently
53(No Transcript)
54(No Transcript)
55(No Transcript)
56Smoothing (continued)
- Theres a wide space of approaches to smoothing
probability distributions to deal with this
problem, such as adding 1, ½ or ? to counts,
Dirichlet priors, discounting, and interpolation
Chen and Goodman, 98 - Another simple idea that works well in practice
is to use a mixture between the document
multinomial and the collection multinomial
distribution
57Smoothing Mixture model
- P(wd) ?Pmle(wMd) (1 ?)Pmle(wMc)
- Mixes the probability from the document with the
general collection frequency of the word. - Correctly setting ? is very important
- A high value of lambda makes the search
conjunctive-like suitable for short queries - A low value is more suitable for long queries
- Can tune ? to optimize performance
- Perhaps make it dependent on document size (cf.
Dirichlet prior or Witten-Bell smoothing)
58Basic mixture model summary
- General formulation of the LM for IR
- The user has a document in mind, and generates
the query from this document. - The equation represents the probability that the
document that the user had in mind was in fact
this one.
general language model
individual-document model
59Example
- Document collection (2 documents)
- d1 Xerox reports a profit but revenue is down
- d2 Lucent narrows quarter loss but revenue
decreases further - Model MLE unigram from documents ? ½
- Query revenue down
- P(Qd1) (1/8 2/16)/2 x (1/8 1/16)/2
- 1/8 x 3/32 3/256
- P(Qd2) (1/8 2/16)/2 x (0 1/16)/2
- 1/8 x 1/32 1/256
- A component of model is missing what is it, and
why? - Ranking d1 d2
60Language Models for IR Tasks
- Cross-lingual IR
- Distributed IR
- Structured doc retrieval
- Personalization
- Modelling redundancy
- Predicting query difficulty
- Predicting information extraction accuracy
- PLSI
61Standard Probabilistic IR
Information need
d1
matching
d2
query
dn
document collection
62IR based on Language Model (LM)
Information need
d1
generation
d2
query
dn
- A common search heuristic is to use words that
you expect to find in matching documents as your
query why, I saw Sergey Brin advocating that
strategy on late night TV one night in my hotel
room, so it must be good! - LM approach directly exploits that idea!
document collection
63Collection-Topic-Document Model
Information need
d1
d2
generation
query
dn
document collection
64Collection-Topic-Document model
- 3-level model
- Whole collection model ( )
- Specific-topic model relevant-documents model (
) - Individual-document model ( )
- Relevance hypothesis
- A request(query topic) is generated from a
specific-topic model , . - Iff a document is relevant to the topic, the same
model will apply to the document. - It will replace part of the individual-document
model in explaining the document. - The probability of relevance of a document
- The probability that this model explains part of
the document - The probability that the , ,
combination is better than the ,
combination
65Outline
- Text Information Retrieval 5-minute overview
- Problems with lexical retrieval
- Synonymy, Polysemy, Ambiguity
- A partial solution synonym lookup
- Towards concept retrieval
- LSI
- Language Models for IR
- PLSI
- Towards real semantic search
- Entities, Relations, Facts, Events
66Probabilistic LSI
- Uses LSI idea, but based in probability theory
- Comes from statistical Aspect Language Model
- Generate co-occurrence model based on
non-observed class - This is a mixture model
- Models a distribution through a mixture (weighted
sum) of other distributions - Independence Assumptions
- Observed pairs (doc, word) are generated randomly
- Conditional independence conditioned on latent
class, words are generated independently of
document
67Aspect Model
K chosen in advance (how many topics in
collection???)
- Generation process
- Choose a doc d with prob P(d)
- There are N ds
- Choose a latent class z with (generated) prob
P(zd) - There are K zs, and K
- Generate a word w with (generated) prob P(wz)
- This creates pair (d, w), without direct concern
for z - Joining the probabilities gives you
Remember P(zd) means probability of z, given d
68Aspect Model (2)
- Applying Bayes theorem
- This is conceptually different than LSI
- Word distribution P(wd) based on combination of
specific classes/factors/aspects, P(wz)
69Detour EM Algorithm
- Tune parameters of distributions with
missing/hidden data - Topics hidden classes
- Extremely useful, general technique
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76Expectation Maximization
- Sketch of an EM algorithm for PLSI
- E-step calculate future probabilities of z based
on current estimates - M-step update estimate parameters based on
calculated probabilities - Problem overfitting ?
77Similarities LSI and PLSI
- Using intermediate, latent, non-observed data for
classification (hence the L) - Can compose Joint Probability similar to LSI SVD
- U ? U_hat P(di zk)
- V ? V_hat P(wj zk)
- S ? S_hat diag(P(zk))k
- JP U_hatS_hatV_hat
- JP is simliar to SVD term-doc matrix N
- Values calculated probabilistically
P(di zk)
P(wj zk)
diag(P(zk))k
78Differences LSI and PLSI
- Basis
- LSI term frequencies (usually) and performs
dimension reduction via projection or 0-ing
weaker components - PLSI statistical generate model of
probabilistic relation between W, D and Z refine
until effective model is produced
79Experiment 128-factor decomposition
80Experiments
81pLSI Improves on LSI
- Consistently better accuracy curves than LSI
- TEM SVD, computationally
- Better from a modeling sense
- Uses likelihood of sampling and aims for
maximization - SVD uses L2-norm or other implicit Gaussian
noise assumption - More intuitive
- Polysemy is recognizable
- By viewing P(wz)
- Similar handling of synonymy
82LSA, pLSA in Practice?
- Only rumors (about Teoma using it)
- Both LSA, pLSA VERY expensive
- LSA
- Running times of one day on 10K docs
- pLSA
- M. Federico (ICASSP 2002, hermes.itc.it) use a
corpus of 1.2 millions of newspaper articles with
a vocabulary of 200K words approximate pLSA
using Non-negative Matrix Factorization (NMF) - 612 hours of CPU time (7 processors, 2.5
hours/iteration, 35 iterations) - Do we need (P)LSI for web search?
83Did we solve our problem?
84Ambiguity
- Different documents with the same keywords may
have different meanings
What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
- Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders. - Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds. - Some bats catch fish with their claws, and a few
species eat lizards, rodents, small birds, tree
frogs, and other bats.
?
?
85What we need
- Detect and exploit semantic relations between
entities - Whole other lecture ?