Information Retrieval and Organisation - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Information Retrieval and Organisation

Description:

Okapi BM25. Assume that. Factor in the term frequencies (tf) and document ... For example, the Okapi BM25 term weighting formulas have been very successful, ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 9
Provided by: Christoph669
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Organisation


1
Information Retrieval and Organisation
  • Chapter 11
  • Probabilistic Information Retrieval

Dell Zhang Birkbeck, University of London
2
Why Probabilities in IR?
Query Representation
User Information Need
Understanding of user need is uncertain
How to match?
Uncertain guess of whether document has relevant
content
Document Representation
Documents
In IR systems, matching between each document and
query is attempted in a semantically imprecise
space of index terms. Probabilities provide a
principled foundation for uncertain
reasoning. Can we use probabilities to quantify
our uncertainties?
3
Why Probabilities in IR?
  • Problems with vector space model
  • Ad-hoc term weighting schemes
  • Ad-hoc basis vectors
  • Ad-hoc similarity measurement
  • We need something more principled!

4
Probability Ranking Principle
  • The document ranking method is the core of an IR
    system
  • We have a collection of documents. The user
    issues a query. A list of documents needs to be
    returned.
  • In what order do we present documents to the
    user? We want the best document to be first,
    second best second, etc.

5
Probability Ranking Principle
  • If a reference retrieval system's response to
    each request is a ranking of the documents in the
    collection in order of decreasing probability of
    relevance to the user who submitted the request,
    where the probabilities are estimated as
    accurately as possible on the basis of whatever
    data have been made available to the system for
    this purpose, the overall effectiveness of the
    system to its user will be the best that is
    obtainable on the basis of those data.

van Rijsbergen (1979113-114)
6
Probability Ranking Principle
  • Theorem. The PRP is optimal, in the sense that it
    minimizes the expected loss (also known as the
    Bayes risk) under 1/0 loss.
  • Provable if all probabilities are known correctly.

7
Binary Independence Model
  • BIM is the model that has traditionally been used
    in conjunction with PRP
  • Binary Boolean documents and queries are
    represented as binary incidence vectors of terms.
  • Independence terms occur in documents and
    queries independently.

BIM Bernoulli Naive Bayes model
8
Binary Independence Model
  • Use Bayes Rule

9
Binary Independence Model
  • Rank documents by their odds

constant
10
Binary Independence Model
  • Make the Naïve Bayes conditional independence
    assumption

11
Binary Independence Model
  • Let

12
Binary Independence Model
  • Assume that

constant
constant
useful
13
Binary Independence Model
  • Taking the logarithm function which is monotonic,
    we get the retrieval status value

log odds ratio
14
Binary Independence Model
  • Assume that relevant documents are a very small
    percentage

IDF
15
Okapi BM25
  • Assume that
  • Factor in the term frequencies (tf) and document
    length (Ld and Lave)

Ideally the parameters k1, b should be tuned on a
validation set.
Practically good vales 1.2 k1 2 b 0.75
16
Appraisal
  • Getting reasonable approximations of
    probabilities is possible, but requires
    restrictive assumptions. In the BIM, these are
  • a Boolean representation of documents/queries/rele
    vance
  • term independence
  • terms not in the query dont affect the outcome
  • document relevance values are independent
  • Problem either require partial relevance
    information or only can derive somewhat inferior
    term weights.

17
Appraisal
  • Probabilistic methods are one of the oldest but
    also one of the currently hottest topics in IR.
  • Traditionally neat ideas, but theyve never won
    on performance.
  • It may be different now. For example, the Okapi
    BM25 term weighting formulas have been very
    successful, especially in TREC evaluations.

18
Well-Known UK Researchers
Stephen Robertson
Keith van Rijsbergen
Karen Sparck Jones
Write a Comment
User Comments (0)
About PowerShow.com