Modeling Chap. 2 - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Modeling Chap. 2

Description:

Traditional IR systems adopt index terms to index, retrieve documents. An index term is simply any word ... clean formalism. simplicity. Main disadvantages ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 34

Provided by: KCK86

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Chap. 2

1
Modeling (Chap. 2)

Modern Information Retrieval
Spring 2000

2
Introduction

Traditional IR systems adopt index terms to
index, retrieve documents
An index term is simply any word that appears in
text of documents
Retrieval based on index terms is simple
premise is that semantics of documents and user
information can be expressed through set of index
terms

Key Question
semantics in document (user request) lost when
text replaced with set of words
matching between documents and user request done
in very imprecise space of index terms (low
quality retrieval)
problem worsened for users with no training in
properly forming queries (cause of frequent
dissatisfaction of Web users with answers
obtained)

4
Taxonomy of IR Models

Three classic models
Boolean
documents and queries represented as sets of
index terms
Vector
documents and queries represented as vectors in
t-dimensional space
Probabilistic
document and query representations based on
probability theory

5
Basic Concepts

Classic models consider that each document is
described by index terms
Index term is a (document) word that helps in
remembering documents main themes
index terms used to index and summarize document
content
in general, index terms are nouns (because
meaning by themselves)
index terms may consider all distinct words in a
document collection

Distinct index terms have varying relevance when
describing document contents
Thus numerical weights assigned to each index
term of a document
Let ki be index term, dj document, and wi,j ? 0
be weight for pair (ki, dj)
Weight quantifies importance of index term for
describing document semantic contents

7
Definition (pp. 25)

Let t be no. of index terms in system and ki be
generic index term.
K k1, , kt is set of all index terms.
A weight wi,j gt 0 associated with each index term
ki of document dj.
For index term that does not appear in document
text, wi,j 0.
Document dj associated with index term vector j
represented by j (w1,j, w2,j, wt,j)

8
Boolean Model

Simple retrieval model based on set theory and
Boolean algebra
framework is easy to grasp by users (concept of
set is intuitive)
Queries specified as Boolean expressions which
have precise semantics

9
Drawbacks

Retrieval strategy is binary decision (document
is relevant/non-relevant)
prevents good retrieval performance
not simple to translate information need into
Boolean expression (difficult and awkward to
express)
dominant model with commercial DB systems

10
Boolean Model (Cont.)

Considers that index terms are present or absent
in document
index term weights are binary, I.e. wi,j ? 0,1
query q composed of index terms linked by not,
and, or
query is Boolean expression which can be
represented as DNF

11
Boolean Model (Cont.)

Query qka ? (kb ? ?kc) can be written in DNF
as dnf (1,1,1) ? (1,1,0) ? (1,0,0)
each component is binary weighted vector
associated with tuple (ka, kb, kc)
binary weighted vectors are called conjunctive
components of dnf

12
Boolean Model (cont.)

Index term weight variables are all binary, I.e.
wi,j ? 0,1
query q is a Boolean expression
Let dnf be DNF for query q
Let cc be any conjunctive components of dnf
Similarity of document dj to query q is
sim(dj,q) 1 if ? cc ( cc ? dnf) ?
(?ki,gi( j) gi( cc)) where gi( j) wi,j
sim(dj,q) 0 otherwise

13
Boolean Model (Cont.)

If sim(dj,q) 1 then Boolean model predict that
document dj is relevant to query q (it might not
be)
Otherwise, prediction is that document is not
relevant
Boolean model predicts that each document is
either relevant or non-relevant
no notion of partial match

Main advantages
clean formalism
simplicity
Main disadvantages
exact matching lead to retrieval of too few or
too many documents
index term weighting can lead to improvement in
retrieval performance

15
Vector Model

Assign non-binary weights to index terms in
queries and documents
term weights used to compute degree of similarity
between document and user query
by sorting retrieved documents in decreasing
order (of degree of similarity), vector model
considers partially matched documents
ranked document answer set a lot more precise
(than answer set by Boolean model)

16
Vector Model (Cont.)

Weight wi,j for pair (ki, dj) is positive and
non-binary
index terms in query are also weighted
Let wi,q be weight associated with pair ki,q,
where wi,q ? 0
query vector defined as (w1,q, w2,q, ,
wt,q) where t is total no. of index terms in
system
vector for document dj is represented by j
(w1,j, w2,j, , wt,j)

17
Vector Model (Cont.)

Document dj and user query q represented as
t-dimensional vectors.
evaluate degree of similarity of dj with regard
to q as correlation between vectors j and .
Correlation can be quantified by cosine of angle
between these two vectors
sim(dj,q)

18
Vector Model (Cont.)

Sim(q,dj) varies from 0 to 1.
Ranks documents according to degree of similarity
to query
document may be retrieved even if it partially
matches query
establish a threshold on sim(dj,q) and retrieve
documents with degree of similarity above
threshold

19
Index term weights

Documents are collection C of objects
User query is set A of objects
IR problem is to determine which documents are in
set A and which are not (I.e. clustering problem)
In clustering problem
intra-cluster similarity (features which better
describe objects in set A)
inter-cluster similarity (features which better
distinguish objects in set A from remaining
objects in collection C

In vector model, intra-cluster similarity
quantified by measuring raw frequency of term ki
inside document dj (tf factor)
how well term describes document contents
inter-cluster dissimilarity quantified by
measuring inverse of frequency of term ki among
documents in collection (idf factor)
terms which appear in many documents are not very
useful for distinguishing relevant document from
non-relevant one

21
Definition (pp.29)

Let N be total no. of documents in system
let ni be number of documents in which index term
ki appears
let freqi,j be raw frequency of term ki in
document dj
no. of times term ki mentioned in text of
document dj
Normalized frequency fi,j of term ki in dj
fi,j

Maximum computed over all terms mentioned in text
of document dj
if term ki does not appear in document dj then
fi,j 0
let idfi, inverse document frequency for ki be
idfi log
best known term weighting scheme
wi,j fi,j ? log

Advantages of vector model
term weighting scheme improves retrieval
performance
retrieve documents that approximate query
conditions
sorts documents according to degree of similarity
to query
Disadvantage
index terms are mutually independent

24
Probabilistic Model

Given user query, there is set of documents
containing exactly relevant documents.
Ideal answer set
given description of ideal answer set, no problem
in retrieving its documents
querying process is process of specifying
properties of ideal answer set
the properties are not exactly known
there are index terms whose semantics are used to
characterize these properties

25
Probabilistic Model (Cont.)

These properties not known at query time
effort has to be made to initially guess what
they (I.e. properties) are
initial guess generate preliminary probabilistic
description of ideal answer set to retrieve first
set of documents
user interaction initiated to improve
probabilistic description of ideal answer set

User examine retrieved documents and decide which
ones are relevant
this information used to refine description of
ideal answer set
by repeating this process, such description will
evolve and be closer to ideal answer set

27
Fundamental Assumption

Given user query q and document dj in collection,
probabilistic model estimate probability that
user will find document dj relevant
assumes that probability of relevance depends on
query and document representations only
assumes that there is subset of all documents
which user prefers as answer set for query q
such ideal answer set is labeled R
documents in set R are predicted to be relevant
to query

Given query q, probabilistic model assigns to
each document dj the ratio P(dj relevant-to
q)/P(dj non-relevant-to q)
measure of similarity to query
odds of document dj being relevant to query q

Index term weight variables are all binary I.e.
wi,j ? 0,1, wi,q ? 0,1
query q is subset of index terms
let R be set of documents known (initially
guessed) to be relevant
let be complement of R
let P(R j) be probability that document dj is
relevant to query q
let P( j) be probability that document dj
not relevant to query q.

Similarity sim(dj,q) of document dj to query q is
ratio
sim(dj,q)
sim(dj,q)
sim(dj,q) wi,q ? wi,j ?

How to compute P(kiR) and P(ki ) initially ?
assume P(kiR) is constant for all index terms ki
(typically 0.5)
P(kiR) 0.5
assume distribution of index terms among
non-relevant documents approximated by
distribution of index terms among all documents
in collection
P(ki ) ni/N where ni is no. of documents
containing index term ki N is total no. of doc.