Information Retrieval presentation

About This Presentation

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval

CSE 8337
Spring 2005
Modeling
Material for these slides obtained from
Modern Information Retrieval by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto
http//www.sims.berkeley.edu/hearst/irbook/
Introduction to Modern Information Retrieval by
Gerard Salton and Michael J. McGill, McGraw-Hill,
1983.

2
Modeling TOC

Introduction
Classic IR Models
Boolean Model
Vector Model
Probabilistic Model
Set Theoretic Models
Fuzzy Set Model
Extended Boolean Model
Generalized Vector Model
Latent Semantic Indexing
Neural Network Model
Alternative Probabilistic Models
Inference Network
Belief Network

3
Introduction

IR systems usually adopt index terms to process
queries
Index term
a keyword or group of selected words
any word (more general)
Stemming might be used
connect connecting, connection, connections
An inverted file is built for the chosen index
terms

4
Introduction
Docs
Index Terms
doc
match
Ranking
Information Need
query
5
Introduction

Matching at index term level is quite imprecise
No surprise that users get frequently unsatisfied
Since most users have no training in query
formation, problem is even worst
Frequent dissatisfaction of Web users
Issue of deciding relevance is critical for IR
systems ranking

6
Introduction

A ranking is an ordering of the documents
retrieved that (hopefully) reflects the relevance
of the documents to the query
A ranking is based on fundamental premisses
regarding the notion of relevance, such as
common sets of index terms
sharing of weighted terms
likelihood of relevance
Each set of premisses leads to a distinct IR model

7
IR Models
U s e r T a s k
Retrieval Adhoc Filtering
Browsing
8
IR Models
9
Classic IR Models - Basic Concepts

Each document represented by a set of
representative keywords or index terms
An index term is a document word useful for
remembering the document main themes
Usually, index terms are nouns because nouns have
meaning by themselves
However, search engines assume that all words are
index terms (full text representation)

10
Classic IR Models - Basic Concepts

The importance of the index terms is represented
by weights associated to them
ki- an index term
dj - a document
wij - a weight associated with (ki,dj)
The weight wij quantifies the importance of the
index term for describing the document contents

11
Classic IR Models - Basic Concepts

t is the total number of index terms
K k1, k2, , kt is the set of all index
terms
wij gt 0 is a weight associated with (ki,dj)
wij 0 indicates that term does not belong to
doc
dj (w1j, w2j, , wtj) is a weighted vector
associated with the document dj
gi(dj) wij is a function which returns the
weight associated with pair (ki,dj)

12
The Boolean Model

Simple model based on set theory
Queries specified as boolean expressions
precise semantics and neat formalism
Terms are either present or absent. Thus,
wij ? 0,1
Consider
q ka ? (kb ? ?kc)
qdnf (1,1,1) ? (1,1,0) ? (1,0,0)
qcc (1,1,0) is a conjunctive component

13
The Boolean Model

q ka ? (kb ? ?kc)
sim(q,dj)
1 if ? qcc (qcc ? qdnf) ? (?ki, gi(dj)
gi(qcc))
0 otherwise

14
Drawbacks of the Boolean Model

Retrieval based on binary decision criteria with
no notion of partial matching
No ranking of the documents is provided
Information need has to be translated into a
Boolean expression
The Boolean queries formulated by the users are
most often too simplistic
As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query

15
The Vector Model

Use of binary weights is too limiting
Non-binary weights provide consideration for
partial matches
These term weights are used to compute a degree
of similarity between a query and each document
Ranked set of documents provides for better
matching

16
The Vector Model

wij gt 0 whenever ki appears in dj
wiq gt 0 associated with the pair (ki,q)
dj (w1j, w2j, ..., wtj)
q (w1q, w2q, ..., wtq)
To each term ki is associated a unitary vector
i
The unitary vectors i and j are assumed to be
orthonormal (i.e., index terms are assumed to
occur independently within the documents)
The t unitary vectors i form an orthonormal
basis for a t-dimensional space where queries and
documents are represented as weighted vectors

17
The Vector Model
j
dj
?
q
i

Sim(q,dj) cos(?)
dj ? q / dj q
? wij wiq / dj q
Since wij gt 0 and wiq gt 0, 0 lt sim(q,dj) lt1
A document is retrieved even if it matches the
query terms only partially

18
Weights wij and wiq ?

One approach is to examine the frequency of the
occurence of a word in a document
Absolute frequency
tf factor, the term frequency within a document
freqi,j - raw frequency of ki within dj
Both high-frequency and low-frequency terms may
not actually be significant
Relative frequency tf divided by number of
words in document
Normalized frequency
fi,j (freqi,j)/(maxl freql,j)

19
Inverse Document Frequency

Importance of term may depend more on how it can
distinguish between documents.
Quantification of inter-documents separation
Dissimilarity not similarity
idf factor, the inverse document frequency

20
IDF

N be the total number of docs in the collection
ni be the number of docs which contain ki
The idf factor is computed as
idfi log (N/ni)
the log is used to make the values of tf and
idf comparable. It can also be interpreted as
the amount of information associated with the
term ki.
IDF Ex
N1000, n1100, n2500, n3800
idf1 3 - 2 1
idf2 3 2.7 0.3
idf3 3 2.9 0.1

21
The Vector Model

The best term-weighting schemes take both into
account.
wij fi,j log(N/ni)
This strategy is called a tf-idf weighting
scheme

22
The Vector Model

For the query term weights, a suggestion is
wiq (0.5 0.5 freqi,q / max(freql,q)
log(N/ni)
The vector model with tf-idf weights is a good
ranking strategy with general collections
The vector model is usually as good as any known
ranking alternatives.
It is also simple and fast to compute.

23
The Vector Model

Advantages
term-weighting improves quality of the answer set
partial matching allows retrieval of docs that
approximate the query conditions
cosine ranking formula sorts documents according
to degree of similarity to the query
Disadvantages
Assumes independence of index terms (??) not
clear that this is bad though

24
The Vector Model Example I
25
The Vector Model Example II
26
The Vector Model Example III
27
Probabilistic Model

Objective to capture the IR problem using a
probabilistic framework
Given a user query, there is an ideal answer set
Querying as specification of the properties of
this ideal answer set (clustering)
But, what are these properties?
Guess at the beginning what they could be (i.e.,
guess initial description of ideal answer set)
Improve by iteration

28
Probabilistic Model

An initial set of documents is retrieved somehow
User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be
inspected)
IR system uses this information to refine
description of ideal answer set
By repeting this process, it is expected that the
description of the ideal answer set will improve
Have always in mind the need to guess at the very
beginning the description of the ideal answer set
Description of ideal answer set is modeled in
probabilistic terms

29
Probabilistic Ranking Principle

Given a user query q and a document dj, the
probabilistic model tries to estimate the
probability that the user will find the document
dj interesting (i.e., relevant). Ideal answer set
is referred to as R and should maximize the
probability of relevance. Documents in the set R
are predicted to be relevant.
But,
how to compute probabilities?
what is the sample space?

30
The Ranking

Probabilistic ranking computed as
sim(q,dj) P(dj relevant-to q) / P(dj
non-relevant-to q)
This is the odds of the document dj being
relevant
Taking the odds minimize the probability of an
erroneous judgement
Definition
wij ? 0,1
P(R dj) probability that given doc is
relevant
P(?R dj) probability doc is not relevant

31
The Ranking

sim(dj,q) P(R dj) / P(?R dj) P(dj
R) P(R) P(dj ?R)
P(?R)
P(dj R) P(dj
?R)
P(dj R) probability of randomly selecting the
document dj from the set R of relevant documents

32
The Ranking

sim(dj,q) P(dj R)
P(dj ?R) ? P(ki
R) ? P(?ki R)
? P(ki ?R) ? P(?ki ?R)
P(ki R) probability that the index term ki is
present in a document randomly selected from the
set R of relevant documents

33
The Ranking

sim(dj,q)
log ? P(ki R) ? P(?kj R)
? P(ki ?R) ? P(?ki
?R)
K log ? P(ki R) log ? P(ki
?R) P(?ki R)
P(?ki ?R)
where P(?ki R) 1 - P(ki
R) P(?ki ?R) 1 - P(ki ?R)

34
The Initial Ranking

sim(dj,q)
? wiq wij (log P(ki R) log P(ki
?R) )
P(?ki R) P(?ki ?R)
Probabilities P(ki R) and P(ki ?R) ?
Estimates based on assumptions
P(ki R) 0.5
P(ki ?R) ni N
Use this initial guess to retrieve an initial
ranking
Improve upon this initial ranking

35
Improving the Initial Ranking

Let
V set of docs initially retrieved
Vi subset of docs retrieved that contain ki
Reevaluate estimates
P(ki R) Vi V
P(ki ?R) ni - Vi N - V
Repeat recursively

36
Improving the Initial Ranking

To avoid problems with V1 and Vi0
P(ki R) Vi 0.5 V 1
P(ki ?R) ni - Vi 0.5 N - V 1
Also,
P(ki R) Vi ni/N V 1
P(ki ?R) ni - Vi ni/N N - V
1

37
Pluses and Minuses

Advantages
Docs ranked in decreasing order of probability of
relevance
Disadvantages
need to guess initial estimates for P(ki R)
method does not take into account tf and idf
factors

38
Brief Comparison of Classic Models

Boolean model does not provide for partial
matches and is considered to be the weakest
classic model
Salton and Buckley did a series of experiments
that indicate that, in general, the vector model
outperforms the probabilistic model with general
collections
This seems also to be the view of the research
community

39
Set Theoretic Models

The Boolean model imposes a binary criterion for
deciding relevance
The question of how to extend the Boolean model
to accomodate partial matching and a ranking has
attracted considerable attention in the past
We discuss now two set theoretic models for this
Fuzzy Set Model
Extended Boolean Model

40
Fuzzy Set Model

This vagueness of document/query matching can be
modeled using a fuzzy framework, as follows
with each term is associated a fuzzy set
each doc has a degree of membership in this fuzzy
set
Here, we discuss the model proposed by Ogawa,
Morita, and Kobayashi (1991)

41
Fuzzy Set Theory

A fuzzy subset A of U is characterized by a
membership function ?(A,u) U ?
0,1 which associates with each element u of
U a number ?(u) in the interval 0,1
Definition
Let A and B be two fuzzy subsets of U. Also, let
A be the complement of A. Then,
?(A,u) 1 - ?(A,u)
?(A?B,u) max(?(A,u), ?(B,u))
?(A?B,u) min(?(A,u), ?(B,u))

42
Fuzzy Information Retrieval

Fuzzy sets are modeled based on a thesaurus
This thesaurus is built as follows
Let c be a term-term correlation matrix
Let ci,l be a normalized correlation factor for
(ki,kl) ci,l ni,l
ni nl - ni,l
ni number of docs which contain ki
nl number of docs which contain kl
ni,l number of docs which contain both ki and kl
We now have the notion of proximity among index
terms.

43
Fuzzy Information Retrieval

The correlation factor ci,l can be used to define
fuzzy set membership for a document dj as
follows ?i,j 1 - ? (1 - ci,l)
ki ? dj
?i,j membership of doc dj in fuzzy subset
associated with ki
The above expression computes an algebraic sum
over all terms in the doc dj

44
Fuzzy Information Retrieval

A doc dj belongs to the fuzzy set for ki, if its
own terms are associated with ki
If doc dj contains a term kl which is closely
related to ki, we have
ci,l 1
?i,j 1
index ki is a good fuzzy index for doc

45
Fuzzy IR An Example

q ka ? (kb ? ?kc)
qdnf
(1,1,1) (1,1,0) (1,0,0)
cc1 cc2 cc3
?q,dj ?cc1cc2cc3,j 1 - (1 - ?a,j
?b,j) ?c,j) (1 - ?a,j ?b,j (1-?c,j))
(1 - ?a,j (1-?b,j) (1-?c,j))

46
Fuzzy Information Retrieval

Fuzzy IR models have been discussed mainly in the
literature associated with fuzzy theory
Experiments with standard test collections are
not available
Difficult to compare at this time

47
Extended Boolean Model

Boolean model is simple and elegant.
But, no provision for a ranking
As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set
membership
Extend the Boolean model with the notions of
partial matching and term weighting
Combine characteristics of the Vector model with
properties of Boolean algebra

48
The Idea

The Extended Boolean Model (introduced by Salton,
Fox, and Wu, 1983) is based on a critique of a
basic assumption in Boolean algebra
Let,
q kx ? ky
wxj fxj idfx associated with
kx,dj max(idfi)
Further, wxj x and wyj y

49
The Idea
qand kx ? ky wxj x and wyj y
AND
50
The Idea
qor kx ? ky wxj x and wyj y
(1,1)
OR
51
Generalizing the Idea

We can extend the previous model to consider
Euclidean distances in a t-dimensional space
This can be done using p-norms which extend the
notion of distance to include p-distances, where
1 ? p ? ? is a new parameter

52
Generalizing the Idea

A generalized disjunctive query is given by
qor k1 k2 . . . kt
A generalized conjunctive query is given by
qand k1 k2 . . . kt

53
Properties
54
Properties

This is quite powerful and is a good argument in
favor of the extended Boolean model
q (k1 k2) k3
k1 and k2 are to be used as in a vector retrieval
while the presence of k3 is required.
sim(q,dj) ( (1 - ( (1-x1) (1-x2) ) )
x3 ) 2
______ 2

55
Conclusions

Model is quite powerful
Properties are interesting and might be useful
Computation is somewhat complex
However, distributivity operation does not hold
for ranking computation
q1 (k1 ? k2) ? k3
q2 (k1 ? k3) ? (k2 ? k3)
sim(q1,dj) ? sim(q2,dj)

Write a Comment

User Comments (0)

About PowerShow.com

Information Retrieval PowerPoint PPT Presentation