Title: Information Retrieval
1Information Retrieval
- CSE 8337
- Spring 2005
- Modeling
- Material for these slides obtained from
- Modern Information Retrieval by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto
http//www.sims.berkeley.edu/hearst/irbook/ - Introduction to Modern Information Retrieval by
Gerard Salton and Michael J. McGill, McGraw-Hill,
1983.
2Modeling TOC
- Introduction
- Classic IR Models
- Boolean Model
- Vector Model
- Probabilistic Model
- Set Theoretic Models
- Fuzzy Set Model
- Extended Boolean Model
- Generalized Vector Model
- Latent Semantic Indexing
- Neural Network Model
- Alternative Probabilistic Models
- Inference Network
- Belief Network
3Introduction
- IR systems usually adopt index terms to process
queries - Index term
- a keyword or group of selected words
- any word (more general)
- Stemming might be used
- connect connecting, connection, connections
- An inverted file is built for the chosen index
terms
4Introduction
Docs
Index Terms
doc
match
Ranking
Information Need
query
5Introduction
- Matching at index term level is quite imprecise
- No surprise that users get frequently unsatisfied
- Since most users have no training in query
formation, problem is even worst - Frequent dissatisfaction of Web users
- Issue of deciding relevance is critical for IR
systems ranking
6Introduction
- A ranking is an ordering of the documents
retrieved that (hopefully) reflects the relevance
of the documents to the query - A ranking is based on fundamental premisses
regarding the notion of relevance, such as - common sets of index terms
- sharing of weighted terms
- likelihood of relevance
- Each set of premisses leads to a distinct IR model
7IR Models
U s e r T a s k
Retrieval Adhoc Filtering
Browsing
8IR Models
9Classic IR Models - Basic Concepts
- Each document represented by a set of
representative keywords or index terms - An index term is a document word useful for
remembering the document main themes - Usually, index terms are nouns because nouns have
meaning by themselves - However, search engines assume that all words are
index terms (full text representation)
10Classic IR Models - Basic Concepts
- The importance of the index terms is represented
by weights associated to them - ki- an index term
- dj - a document
- wij - a weight associated with (ki,dj)
- The weight wij quantifies the importance of the
index term for describing the document contents
11Classic IR Models - Basic Concepts
- t is the total number of index terms
- K k1, k2, , kt is the set of all index
terms - wij gt 0 is a weight associated with (ki,dj)
- wij 0 indicates that term does not belong to
doc - dj (w1j, w2j, , wtj) is a weighted vector
associated with the document dj - gi(dj) wij is a function which returns the
weight associated with pair (ki,dj)
12The Boolean Model
- Simple model based on set theory
- Queries specified as boolean expressions
- precise semantics and neat formalism
- Terms are either present or absent. Thus,
wij ? 0,1 - Consider
- q ka ? (kb ? ?kc)
- qdnf (1,1,1) ? (1,1,0) ? (1,0,0)
- qcc (1,1,0) is a conjunctive component
13The Boolean Model
- q ka ? (kb ? ?kc)
- sim(q,dj)
- 1 if ? qcc (qcc ? qdnf) ? (?ki, gi(dj)
gi(qcc)) - 0 otherwise
14Drawbacks of the Boolean Model
- Retrieval based on binary decision criteria with
no notion of partial matching - No ranking of the documents is provided
- Information need has to be translated into a
Boolean expression - The Boolean queries formulated by the users are
most often too simplistic - As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
15The Vector Model
- Use of binary weights is too limiting
- Non-binary weights provide consideration for
partial matches - These term weights are used to compute a degree
of similarity between a query and each document - Ranked set of documents provides for better
matching
16The Vector Model
- wij gt 0 whenever ki appears in dj
- wiq gt 0 associated with the pair (ki,q)
- dj (w1j, w2j, ..., wtj)
- q (w1q, w2q, ..., wtq)
- To each term ki is associated a unitary vector
i - The unitary vectors i and j are assumed to be
orthonormal (i.e., index terms are assumed to
occur independently within the documents) - The t unitary vectors i form an orthonormal
basis for a t-dimensional space where queries and
documents are represented as weighted vectors
17The Vector Model
j
dj
?
q
i
- Sim(q,dj) cos(?)
- dj ? q / dj q
- ? wij wiq / dj q
- Since wij gt 0 and wiq gt 0, 0 lt sim(q,dj) lt1
- A document is retrieved even if it matches the
query terms only partially
18Weights wij and wiq ?
- One approach is to examine the frequency of the
occurence of a word in a document - Absolute frequency
- tf factor, the term frequency within a document
- freqi,j - raw frequency of ki within dj
- Both high-frequency and low-frequency terms may
not actually be significant - Relative frequency tf divided by number of
words in document - Normalized frequency
- fi,j (freqi,j)/(maxl freql,j)
19Inverse Document Frequency
- Importance of term may depend more on how it can
distinguish between documents. - Quantification of inter-documents separation
- Dissimilarity not similarity
- idf factor, the inverse document frequency
20 IDF
- N be the total number of docs in the collection
- ni be the number of docs which contain ki
- The idf factor is computed as
- idfi log (N/ni)
- the log is used to make the values of tf and
idf comparable. It can also be interpreted as
the amount of information associated with the
term ki. - IDF Ex
- N1000, n1100, n2500, n3800
- idf1 3 - 2 1
- idf2 3 2.7 0.3
- idf3 3 2.9 0.1
21The Vector Model
- The best term-weighting schemes take both into
account. - wij fi,j log(N/ni)
- This strategy is called a tf-idf weighting
scheme
22The Vector Model
- For the query term weights, a suggestion is
- wiq (0.5 0.5 freqi,q / max(freql,q)
log(N/ni) - The vector model with tf-idf weights is a good
ranking strategy with general collections - The vector model is usually as good as any known
ranking alternatives. - It is also simple and fast to compute.
23The Vector Model
- Advantages
- term-weighting improves quality of the answer set
- partial matching allows retrieval of docs that
approximate the query conditions - cosine ranking formula sorts documents according
to degree of similarity to the query - Disadvantages
- Assumes independence of index terms (??) not
clear that this is bad though
24The Vector Model Example I
25The Vector Model Example II
26The Vector Model Example III
27Probabilistic Model
- Objective to capture the IR problem using a
probabilistic framework - Given a user query, there is an ideal answer set
- Querying as specification of the properties of
this ideal answer set (clustering) - But, what are these properties?
- Guess at the beginning what they could be (i.e.,
guess initial description of ideal answer set) - Improve by iteration
28Probabilistic Model
- An initial set of documents is retrieved somehow
- User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be
inspected) - IR system uses this information to refine
description of ideal answer set - By repeting this process, it is expected that the
description of the ideal answer set will improve - Have always in mind the need to guess at the very
beginning the description of the ideal answer set - Description of ideal answer set is modeled in
probabilistic terms
29Probabilistic Ranking Principle
- Given a user query q and a document dj, the
probabilistic model tries to estimate the
probability that the user will find the document
dj interesting (i.e., relevant). Ideal answer set
is referred to as R and should maximize the
probability of relevance. Documents in the set R
are predicted to be relevant. - But,
- how to compute probabilities?
- what is the sample space?
30The Ranking
- Probabilistic ranking computed as
- sim(q,dj) P(dj relevant-to q) / P(dj
non-relevant-to q) - This is the odds of the document dj being
relevant - Taking the odds minimize the probability of an
erroneous judgement - Definition
- wij ? 0,1
- P(R dj) probability that given doc is
relevant - P(?R dj) probability doc is not relevant
31The Ranking
- sim(dj,q) P(R dj) / P(?R dj) P(dj
R) P(R) P(dj ?R)
P(?R) - P(dj R) P(dj
?R) - P(dj R) probability of randomly selecting the
document dj from the set R of relevant documents
32The Ranking
- sim(dj,q) P(dj R)
P(dj ?R) ? P(ki
R) ? P(?ki R) - ? P(ki ?R) ? P(?ki ?R)
- P(ki R) probability that the index term ki is
present in a document randomly selected from the
set R of relevant documents
33The Ranking
- sim(dj,q)
- log ? P(ki R) ? P(?kj R)
- ? P(ki ?R) ? P(?ki
?R) - K log ? P(ki R) log ? P(ki
?R) P(?ki R)
P(?ki ?R) - where P(?ki R) 1 - P(ki
R) P(?ki ?R) 1 - P(ki ?R)
34The Initial Ranking
- sim(dj,q)
- ? wiq wij (log P(ki R) log P(ki
?R) ) - P(?ki R) P(?ki ?R)
- Probabilities P(ki R) and P(ki ?R) ?
- Estimates based on assumptions
- P(ki R) 0.5
- P(ki ?R) ni N
- Use this initial guess to retrieve an initial
ranking - Improve upon this initial ranking
35Improving the Initial Ranking
- Let
- V set of docs initially retrieved
- Vi subset of docs retrieved that contain ki
- Reevaluate estimates
- P(ki R) Vi V
- P(ki ?R) ni - Vi N - V
- Repeat recursively
36Improving the Initial Ranking
- To avoid problems with V1 and Vi0
- P(ki R) Vi 0.5 V 1
- P(ki ?R) ni - Vi 0.5 N - V 1
- Also,
- P(ki R) Vi ni/N V 1
- P(ki ?R) ni - Vi ni/N N - V
1
37Pluses and Minuses
- Advantages
- Docs ranked in decreasing order of probability of
relevance - Disadvantages
- need to guess initial estimates for P(ki R)
- method does not take into account tf and idf
factors
38Brief Comparison of Classic Models
- Boolean model does not provide for partial
matches and is considered to be the weakest
classic model - Salton and Buckley did a series of experiments
that indicate that, in general, the vector model
outperforms the probabilistic model with general
collections - This seems also to be the view of the research
community
39Set Theoretic Models
- The Boolean model imposes a binary criterion for
deciding relevance - The question of how to extend the Boolean model
to accomodate partial matching and a ranking has
attracted considerable attention in the past - We discuss now two set theoretic models for this
- Fuzzy Set Model
- Extended Boolean Model
40 Fuzzy Set Model
- This vagueness of document/query matching can be
modeled using a fuzzy framework, as follows - with each term is associated a fuzzy set
- each doc has a degree of membership in this fuzzy
set - Here, we discuss the model proposed by Ogawa,
Morita, and Kobayashi (1991)
41 Fuzzy Set Theory
- A fuzzy subset A of U is characterized by a
membership function ?(A,u) U ?
0,1 which associates with each element u of
U a number ?(u) in the interval 0,1 - Definition
- Let A and B be two fuzzy subsets of U. Also, let
A be the complement of A. Then, - ?(A,u) 1 - ?(A,u)
- ?(A?B,u) max(?(A,u), ?(B,u))
- ?(A?B,u) min(?(A,u), ?(B,u))
42Fuzzy Information Retrieval
- Fuzzy sets are modeled based on a thesaurus
- This thesaurus is built as follows
- Let c be a term-term correlation matrix
- Let ci,l be a normalized correlation factor for
(ki,kl) ci,l ni,l
ni nl - ni,l - ni number of docs which contain ki
- nl number of docs which contain kl
- ni,l number of docs which contain both ki and kl
- We now have the notion of proximity among index
terms.
43Fuzzy Information Retrieval
- The correlation factor ci,l can be used to define
fuzzy set membership for a document dj as
follows ?i,j 1 - ? (1 - ci,l)
ki ? dj - ?i,j membership of doc dj in fuzzy subset
associated with ki - The above expression computes an algebraic sum
over all terms in the doc dj
44Fuzzy Information Retrieval
- A doc dj belongs to the fuzzy set for ki, if its
own terms are associated with ki - If doc dj contains a term kl which is closely
related to ki, we have - ci,l 1
- ?i,j 1
- index ki is a good fuzzy index for doc
45Fuzzy IR An Example
- q ka ? (kb ? ?kc)
- qdnf
- (1,1,1) (1,1,0) (1,0,0)
- cc1 cc2 cc3
- ?q,dj ?cc1cc2cc3,j 1 - (1 - ?a,j
?b,j) ?c,j) (1 - ?a,j ?b,j (1-?c,j))
(1 - ?a,j (1-?b,j) (1-?c,j))
46Fuzzy Information Retrieval
- Fuzzy IR models have been discussed mainly in the
literature associated with fuzzy theory - Experiments with standard test collections are
not available - Difficult to compare at this time
47Extended Boolean Model
- Boolean model is simple and elegant.
- But, no provision for a ranking
- As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set
membership - Extend the Boolean model with the notions of
partial matching and term weighting - Combine characteristics of the Vector model with
properties of Boolean algebra
48The Idea
- The Extended Boolean Model (introduced by Salton,
Fox, and Wu, 1983) is based on a critique of a
basic assumption in Boolean algebra - Let,
- q kx ? ky
- wxj fxj idfx associated with
kx,dj max(idfi) - Further, wxj x and wyj y
49The Idea
qand kx ? ky wxj x and wyj y
AND
50The Idea
qor kx ? ky wxj x and wyj y
(1,1)
OR
51Generalizing the Idea
- We can extend the previous model to consider
Euclidean distances in a t-dimensional space - This can be done using p-norms which extend the
notion of distance to include p-distances, where
1 ? p ? ? is a new parameter
52Generalizing the Idea
- A generalized disjunctive query is given by
- qor k1 k2 . . . kt
- A generalized conjunctive query is given by
- qand k1 k2 . . . kt
53Properties
54Properties
- This is quite powerful and is a good argument in
favor of the extended Boolean model - q (k1 k2) k3
- k1 and k2 are to be used as in a vector retrieval
while the presence of k3 is required. - sim(q,dj) ( (1 - ( (1-x1) (1-x2) ) )
x3 ) 2
______ 2 -
55Conclusions
- Model is quite powerful
- Properties are interesting and might be useful
- Computation is somewhat complex
- However, distributivity operation does not hold
for ranking computation - q1 (k1 ? k2) ? k3
- q2 (k1 ? k3) ? (k2 ? k3)
- sim(q1,dj) ? sim(q2,dj)