Information Retrieval and Organisation - PowerPoint PPT Presentation

1 / 8

About This Presentation

Title:

Information Retrieval and Organisation

Description:

Okapi BM25. Assume that. Factor in the term frequencies (tf) and document ... For example, the Okapi BM25 term weighting formulas have been very successful, ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 9

Provided by: Christoph669

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Organisation

1
Information Retrieval and Organisation

Chapter 11
Probabilistic Information Retrieval

Dell Zhang Birkbeck, University of London
2
Why Probabilities in IR?
Query Representation
User Information Need
Understanding of user need is uncertain
How to match?
Uncertain guess of whether document has relevant
content
Document Representation
Documents
In IR systems, matching between each document and
query is attempted in a semantically imprecise
space of index terms. Probabilities provide a
principled foundation for uncertain
reasoning. Can we use probabilities to quantify
our uncertainties?
3
Why Probabilities in IR?

Problems with vector space model
Ad-hoc term weighting schemes
Ad-hoc basis vectors
Ad-hoc similarity measurement
We need something more principled!

4
Probability Ranking Principle

The document ranking method is the core of an IR
system
We have a collection of documents. The user
issues a query. A list of documents needs to be
returned.
In what order do we present documents to the
user? We want the best document to be first,
second best second, etc.

5
Probability Ranking Principle

If a reference retrieval system's response to
each request is a ranking of the documents in the
collection in order of decreasing probability of
relevance to the user who submitted the request,
where the probabilities are estimated as
accurately as possible on the basis of whatever
data have been made available to the system for
this purpose, the overall effectiveness of the
system to its user will be the best that is
obtainable on the basis of those data.

van Rijsbergen (1979113-114)
6
Probability Ranking Principle

Theorem. The PRP is optimal, in the sense that it
minimizes the expected loss (also known as the
Bayes risk) under 1/0 loss.
Provable if all probabilities are known correctly.

7
Binary Independence Model

BIM is the model that has traditionally been used
in conjunction with PRP
Binary Boolean documents and queries are
represented as binary incidence vectors of terms.
Independence terms occur in documents and
queries independently.

BIM Bernoulli Naive Bayes model
8
Binary Independence Model

Use Bayes Rule

9
Binary Independence Model

Rank documents by their odds

constant
10
Binary Independence Model

Make the Naïve Bayes conditional independence
assumption

11
Binary Independence Model

12
Binary Independence Model

Assume that

constant
constant
useful
13
Binary Independence Model

Taking the logarithm function which is monotonic,
we get the retrieval status value

log odds ratio
14
Binary Independence Model

Assume that relevant documents are a very small
percentage

IDF
15
Okapi BM25

Assume that
Factor in the term frequencies (tf) and document
length (Ld and Lave)

Ideally the parameters k1, b should be tuned on a
validation set.
Practically good vales 1.2 k1 2 b 0.75
16
Appraisal

Getting reasonable approximations of
probabilities is possible, but requires
restrictive assumptions. In the BIM, these are
a Boolean representation of documents/queries/rele
vance
term independence
terms not in the query dont affect the outcome
document relevance values are independent
Problem either require partial relevance
information or only can derive somewhat inferior
term weights.

17
Appraisal

Probabilistic methods are one of the oldest but
also one of the currently hottest topics in IR.
Traditionally neat ideas, but theyve never won
on performance.
It may be different now. For example, the Okapi
BM25 term weighting formulas have been very
successful, especially in TREC evaluations.

18
Well-Known UK Researchers
Stephen Robertson
Keith van Rijsbergen
Karen Sparck Jones

Write a Comment

User Comments (0)