Towards Semantics for IR

About This Presentation

Title:

Towards Semantics for IR

Description:

Acknowledgements ... 2004 to 2006: Postdoc in the Text Mining, Search, ... {dog, canine, doggy, puppy, etc.} concept 112986. I deposited my check in the bank. ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 86

Provided by: mathcs

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Towards Semantics for IR

1
Towards Semantics for IR

Eugene Agichtein
Emory University

Acknowledgements A bunch of slides in this talk
are adapted from lots of people, including Chris
Manning, ChengXiang Zhai, James Allan, Ray
Mooney, and Jimmy Lin.
2
Who is this guy?
Sept 2006- Assistant Professor in the Math CS
department at Emory. 2004 to 2006 Postdoc in
the Text Mining, Search, and Navigation group at
Microsoft Research, Redmond. 2004 Ph.D. in
Computer Science from Columbia University
dissertation on extracting structured relations
from large unstructured text databases 1998
B.S. in Engineering from The Cooper Union.
Research interests accessing, discovering, and
managing information in unstructured (text) data,
with current emphasis on developing robust and
scalable text mining techniques for the biology
and health domains.
3
Outline

Text Information Retrieval 10-minute overview
Problems with lexical retrieval
Synonymy, Polysemy, Ambiguity
A partial solution synonym lookup
Towards concept retrieval
LSI
Language Models for IR
PLSI
Towards real semantic search
Entities, Relations, Facts, Events in Text (my
research area)

4
Information Retrieval From Text
IR System
5
Was that the whole story in IR?
Source Selection
Query Formulation
Search
Selection
Examination
Delivery
6
Supporting the Search Process
Source Selection
Resource
Query Formulation
Query
Search
Ranked List
Selection
Indexing
Documents
Index
Examination
Acquisition
Documents
Collection
Delivery
7
Example Query

Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
One could grep all of Shakespeares plays for
Brutus and Caesar, then strip out lines
containing Calpurnia?
Slow (for large corpora)
NOT Calpurnia requires egrep ?
But other operations (e.g., find the word Romans
near countrymen , or top-K scenes most about )
not feasible

8
Term-document incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
9
Incidence vectors

So we have a 0/1 vector for each term.
Boolean model
To answer query take the vectors for Brutus,
Caesar and Calpurnia (complemented) ? bitwise
AND.
110100 AND 110111 AND 101111 100100
Vector-space model
Compute query-document similarity as dot
product/cosine between query and document vector
Rank by similarity

10
Answers to query

Antony and Cleopatra, Act III, Scene ii
Agrippa Aside to DOMITIUS ENOBARBUS Why,
Enobarbus,
When Antony found
Julius Caesar dead,
He cried almost to
roaring and he wept
When at Philippi he
found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius I did enact Julius Caesar I was
killed i' the
Capitol Brutus killed me.

11
Modern Search Engines in 1 Minute

Crawl Time
Inverted List terms ? doc IDs
Content chunks (doc copies)
Query Time
Lookup query terms in IL? filter set
Get content chunks for doc IDs
Rank documents using hundreds of features (e.g.,
term weights, web topology, proximity, position)
Retrieve Top K documents for query ( K filter set)

index
angina
5
treatment
4
Content chunks
12
Outline

Text Information Retrieval 10-minute overview
Problems with lexical retrieval
Synonymy, Polysemy, Ambiguity
A partial solution synonym lookup
Towards concept retrieval
LSI
Language Models for IR
PLSI
Towards real semantic search
Entities, Relations, Facts, Events

13
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
14
Noisy-Channel Model of IR
Information need
d1
d2
Query

User has a information need, thinks of a
relevant document
and writes down some queries
dn
Task of information retrieval given the query,
figure out which document it came from?
document collection
15
How is this a noisy-channel?

No one seriously claims that this is actually
whats going on
But this view is mathematically convenient!

Source
Destination
channel
message
message
noise
Source
Destination
Information need
query terms
Encoder
channel
Query formulation process
16
Problems with term-based retrieval

Synonymy
Power law vs. Zipf distribution
Polysemy
Saturn
Ambiguity
What do frogs eat?

17
Polysemy and Context

Document similarity on single word level
polysemy and context

18
Ambiguity

Different documents with the same keywords may
have different meanings

What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?

Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders.
Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds.
Some bats catch fish with their claws, and a few
species eat lizards, rodents, small birds, tree
frogs, and other bats.

?
?
19
Indexing Word Synsets/Senses

How does indexing word senses solve the
synonym/polysemy problem?
Okay, so where do we get the word senses?
WordNet a lexical database for standard English
Automatically find clusters of words that
describe the same concepts

dog, canine, doggy, puppy, etc. ? concept 112986
I deposited my check in the bank. bank ? concept
76529 I saw the sailboat from the bank. bank ?
concept 53107
http//wordnet.princeton.edu/
20
Example Contextual Word Similarity
Use Mutual Information
Dagan et al, Computer Speech Language, 1995
21
Word Sense Disambiguation

Given a word in context, automatically determine
the sense (concept)
This is the Word Sense Disambiguation (WSD)
problem
Context is the key
For each ambiguous word, note the surrounding
words
Learn a classifier from a collection of
examples
Use the classifier to determine the senses of
words in the documents

bank river, sailboat, water, etc. ? side of a
river bank check, money, account, etc. ?
financial institution
22
Example Unsupervised WSD

Hypothesis same senses of words will have
similar neighboring words
Disambiguation algorithm
Identify context vectors corresponding to all
occurrences of a particular word
Partition them into regions of high density
Assign a sense to each such region
Sit on a chair
Take a seat on this chair
The chair of the Math Department
The chair of the meeting

23
Does it help retrieval?

Not really
Examples of limited success.

Ellen M. Voorhees. (1993) Using WordNet to
Disambiguate Word Senses for Text Retrieval.
Proceedings of SIGIR 1993. Mark Sanderson.
(1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994 And others
Hinrich Schütze and Jan O. Pedersen. (1995)
Information Retrieval Based on Word Senses.
Proceedings of the 4th Annual Symposium on
Document Analysis and Information
Retrieval. Rada Mihalcea and Dan Moldovan.
(2000) Semantic Indexing Using WordNet Senses.
Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
24
Why Disambiguation Can Hurt

Bag-of-words techniques already disambiguate
Context for each term is established in the query
Heuristics (e.g., always most frequent sense)
work better
WSD is hard!
Many words are highly polysemous, e.g., interest
Granularity of senses is often domain/application
specific
Queries are short not enough context for
accurate WSD
WSD tries to improve precision
But incorrect sense assignments would hurt recall
Slight gains in precision do not offset large
drops in recall

25
Outline

Text Information Retrieval 10-minute overview
Problems with lexical retrieval
Synonymy, Polysemy, Ambiguity
A partial solution word synsets, WSD
Towards concept retrieval
LSI
Language Models for IR
PLSI
Towards real semantic search
Entities, Relations, Facts, Events

26
Latent Semantic Analysis

Perform a low-rank approximation of document-term
matrix (typical rank 100-300)
General idea
Map documents (and terms) to a low-dimensional
representation.
Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space).
Compute document similarity based on the inner
product in this latent semantic space
Goals
Similar terms map to similar location in low
dimensional space
Noise reduction by dimension reduction

27
Latent Semantic Analysis

Latent semantic space illustrating example

courtesy of Susan Dumais
28
(No Transcript)
29
Simplistic picture
Topic 1
Topic 2
Topic 3
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Some (old) empirical evidence

Precision at or above median TREC precision
Top scorer on almost 20 TREC 1,2,3 topics (c.f.
1990)
Slightly better on average than original vector
space
Effect of dimensionality

36
(No Transcript)
37
Problems with term-based retrieval

Synonymy
Power law vs. Zipf distribution
Polysemy
Saturn
Ambiguity
What do frogs eat?

38
Outline

Text Information Retrieval 5-minute overview
Problems with lexical retrieval
Synonymy, Polysemy, Ambiguity
A partial solution synonym lookup
Towards concept retrieval
LSI
Language Models for IR
PLSI
Towards real semantic search
Entities, Relations, Facts, Events

39
IR based on Language Model (LM)
Information need
d1
generation
d2
query

dn

A common search heuristic is to use words that
you expect to find in matching documents as your
query why, I saw Sergey Brin advocating that
strategy on late night TV one night in my hotel
room, so it must be good!
The LM approach directly exploits that idea!

document collection
40
Formal Language (Model)

Traditional generative model generates strings
Finite state machines or regular grammars, etc.
Example

I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
I
wish

wish I wish
41
Stochastic Language Models

Models probability of generating strings in the
language (commonly all strings over alphabet ?)

Model M
0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 l
ikes
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
P(s M) 0.00000008
42
Stochastic Language Models

Model probability of generating any string

Model M1
Model M2
0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1
yon 0.01 maiden 0.0001 woman
0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.
0001 yon 0.0005 maiden 0.01 woman
P(sM2) P(sM1)
43
Stochastic Language Models

A statistical model for generating text
Probability distribution over strings in a given
language

M
44
Unigram and higher-order models

Unigram Language Models
Bigram (generally, n-gram) Language Models
Other Language Models
Grammar-based models (PCFGs), etc.
Probably not the first thing to try in IR

Easy. Effective!
45
Using Language Models in IR

Treat each document as the basis for a model
(e.g., unigram sufficient statistics)
Rank document d based on P(d q)
P(d q) P(q d) x P(d) / P(q)
P(q) is the same for all documents, so ignore
P(d) the prior is often treated as the same for
all d
But we could use criteria like authority, length,
genre
P(q d) is the probability of q given ds model
Very general formal approach

46
The fundamental problem of LMs

Usually we dont know the model M
But have a sample of text representative of that
model
Estimate a language model from a sample
Then compute the observation probability

M
47
Language Models for IR

Language Modeling Approaches
Attempt to model query generation process
Documents are ranked by the probability that a
query would be observed as a random sample from
the respective document model
Multinomial approach

48
Retrieval based on probabilistic LM

Treat the generation of queries as a random
process.
Approach
Infer a language model for each document.
Estimate the probability of generating the query
according to each of these models.
Rank the documents according to these
probabilities.
Usually a unigram estimate of words is used
Some work on bigrams, paralleling van Rijsbergen

49
Retrieval based on probabilistic LM

Intuition
Users
Have a reasonable idea of terms that are likely
to occur in documents of interest.
They will choose query terms that distinguish
these documents from others in the collection.
Collection statistics
Are integral parts of the language model.
Are not used heuristically as in many other
approaches.
In theory. In practice, theres usually some
wiggle room for empirically set parameters

50
(No Transcript)
51
(No Transcript)
52
Query generation probability

Ranking formula
The probability of producing the query given the
language model of document d using MLE is

Unigram assumption Given a particular language
model, the query terms occur independently
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
Smoothing (continued)

Theres a wide space of approaches to smoothing
probability distributions to deal with this
problem, such as adding 1, ½ or ? to counts,
Dirichlet priors, discounting, and interpolation
Chen and Goodman, 98
Another simple idea that works well in practice
is to use a mixture between the document
multinomial and the collection multinomial
distribution

57
Smoothing Mixture model

P(wd) ?Pmle(wMd) (1 ?)Pmle(wMc)
Mixes the probability from the document with the
general collection frequency of the word.
Correctly setting ? is very important
A high value of lambda makes the search
conjunctive-like suitable for short queries
A low value is more suitable for long queries
Can tune ? to optimize performance
Perhaps make it dependent on document size (cf.
Dirichlet prior or Witten-Bell smoothing)

58
Basic mixture model summary

General formulation of the LM for IR
The user has a document in mind, and generates
the query from this document.
The equation represents the probability that the
document that the user had in mind was in fact
this one.

general language model
individual-document model
59
Example

Document collection (2 documents)
d1 Xerox reports a profit but revenue is down
d2 Lucent narrows quarter loss but revenue
decreases further
Model MLE unigram from documents ? ½
Query revenue down
P(Qd1) (1/8 2/16)/2 x (1/8 1/16)/2
1/8 x 3/32 3/256
P(Qd2) (1/8 2/16)/2 x (0 1/16)/2
1/8 x 1/32 1/256
A component of model is missing what is it, and
why?
Ranking d1 d2

60
Language Models for IR Tasks

Cross-lingual IR
Distributed IR
Structured doc retrieval
Personalization
Modelling redundancy
Predicting query difficulty
Predicting information extraction accuracy
PLSI

61
Standard Probabilistic IR
Information need
d1
matching
d2
query

dn
document collection
62
IR based on Language Model (LM)
Information need
d1
generation
d2
query

dn

A common search heuristic is to use words that
you expect to find in matching documents as your
query why, I saw Sergey Brin advocating that
strategy on late night TV one night in my hotel
room, so it must be good!
LM approach directly exploits that idea!

document collection
63
Collection-Topic-Document Model
Information need
d1
d2
generation

query

dn
document collection
64
Collection-Topic-Document model

3-level model
Whole collection model ( )
Specific-topic model relevant-documents model (
)
Individual-document model ( )
Relevance hypothesis
A request(query topic) is generated from a
specific-topic model , .
Iff a document is relevant to the topic, the same
model will apply to the document.
It will replace part of the individual-document
model in explaining the document.
The probability of relevance of a document
The probability that this model explains part of
the document
The probability that the , ,
combination is better than the ,
combination

65
Outline

Text Information Retrieval 5-minute overview
Problems with lexical retrieval
Synonymy, Polysemy, Ambiguity
A partial solution synonym lookup
Towards concept retrieval
LSI
Language Models for IR
PLSI
Towards real semantic search
Entities, Relations, Facts, Events

66
Probabilistic LSI

Uses LSI idea, but based in probability theory
Comes from statistical Aspect Language Model
Generate co-occurrence model based on
non-observed class
This is a mixture model
Models a distribution through a mixture (weighted
sum) of other distributions
Independence Assumptions
Observed pairs (doc, word) are generated randomly
Conditional independence conditioned on latent
class, words are generated independently of
document

67
Aspect Model
K chosen in advance (how many topics in
collection???)

Generation process
Choose a doc d with prob P(d)
There are N ds
Choose a latent class z with (generated) prob
P(zd)
There are K zs, and K
Generate a word w with (generated) prob P(wz)
This creates pair (d, w), without direct concern
for z
Joining the probabilities gives you

Remember P(zd) means probability of z, given d
68
Aspect Model (2)

Applying Bayes theorem
This is conceptually different than LSI
Word distribution P(wd) based on combination of
specific classes/factors/aspects, P(wz)

69
Detour EM Algorithm

Tune parameters of distributions with
missing/hidden data
Topics hidden classes
Extremely useful, general technique

70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
Expectation Maximization

Sketch of an EM algorithm for PLSI
E-step calculate future probabilities of z based
on current estimates
M-step update estimate parameters based on
calculated probabilities
Problem overfitting ?

77
Similarities LSI and PLSI

Using intermediate, latent, non-observed data for
classification (hence the L)
Can compose Joint Probability similar to LSI SVD
U ? U_hat P(di zk)
V ? V_hat P(wj zk)
S ? S_hat diag(P(zk))k
JP U_hatS_hatV_hat
JP is simliar to SVD term-doc matrix N
Values calculated probabilistically

P(di zk)
P(wj zk)
diag(P(zk))k
78
Differences LSI and PLSI

Basis
LSI term frequencies (usually) and performs
dimension reduction via projection or 0-ing
weaker components
PLSI statistical generate model of
probabilistic relation between W, D and Z refine
until effective model is produced

79
Experiment 128-factor decomposition
80
Experiments
81
pLSI Improves on LSI

Consistently better accuracy curves than LSI
TEM SVD, computationally
Better from a modeling sense
Uses likelihood of sampling and aims for
maximization
SVD uses L2-norm or other implicit Gaussian
noise assumption
More intuitive
Polysemy is recognizable
By viewing P(wz)
Similar handling of synonymy

82
LSA, pLSA in Practice?

Only rumors (about Teoma using it)
Both LSA, pLSA VERY expensive
LSA
Running times of one day on 10K docs
pLSA
M. Federico (ICASSP 2002, hermes.itc.it) use a
corpus of 1.2 millions of newspaper articles with
a vocabulary of 200K words approximate pLSA
using Non-negative Matrix Factorization (NMF)
612 hours of CPU time (7 processors, 2.5
hours/iteration, 35 iterations)
Do we need (P)LSI for web search?

83
Did we solve our problem?
84
Ambiguity

Different documents with the same keywords may
have different meanings

What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?

Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders.
Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds.
Some bats catch fish with their claws, and a few
species eat lizards, rodents, small birds, tree
frogs, and other bats.