Search and Retrieval: Term Weighting and Document Ranking - PowerPoint PPT Presentation

About This Presentation

Title:

Search and Retrieval: Term Weighting and Document Ranking

Description:

Title: Text Data Mining Author: hearst Last modified by: SIMS Network Administrator Created Date: 9/22/1997 12:59:07 AM Document presentation format – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 25

Provided by: hear81

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Search and Retrieval: Term Weighting and Document Ranking

1
Search and RetrievalTerm Weighting and Document
Ranking

Prof. Marti Hearst
SIMS 202, Lecture 21

2
Today

Review Evaluation from last time
Term Weights and Document Ranking
Go over Midterm

3
Last Time

Relevance
Evaluation of IR Systems
Precision vs. Recall
Cutoff Points, E-measure, search length
Test Collections/TREC
Blair Maron Study

4
What to Evaluate?

What can be measured that reflects users ability
to use system? (Cleverdon 66)
Coverage of Information
Form of Presentation
Effort required/Ease of Use
Time and Space Efficiency
Recall
proportion of relevant material actually
retrieved
Precision
proportion of retrieved material actually relevant

5
Retrieved vs. Relevant Documents
High Recall (hypothetical)
Retrieved
High Precision (hypothetical)
Relevant
6
Standard IR Evaluation
Retrieved Documents

Precision
Recall

retrieved (pink blue)
Collection
7
Precision/Recall Curves

There is a tradeoff between Precision and Recall
So measure Precision at different levels of Recall

precision
x
x
x
x
recall
8
Expected Search Length

Documents are presented in order of predicted
relevance
Search length number of non-relevant documents
that user must scan through in order to have
their information need satisfied
The shorter the better
below n2, search length 2 n3, search length
6

What is the correction for this?
9
Expected Search Length

Documents are presented in order of predicted
relevance
Search length number of non-relevant documents
that user must scan through in order to have
their information need satisfied
The shorter the better
below n2, search length 2 n3, search length
6

Correction n6, search length 3
10
The E-Measure

Combine Precision and Recall into one number (van
Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R
11
The E and F-Measures

Combine Precision and Recall into one number (van
Rijsbergen 79)

P precision R recall
With the F-measure, larger values are better.
12
The F-Measure
13
TREC

Text REtrieval Conference/Competition
Run by NIST (National Institute of Standards
Technology)
1997 was the 6th year
Collection 3 Gigabytes, gt1 Million Docs
Newswire full text news (AP, WSJ, Ziff)
Government documents (federal register)
Queries Relevance Judgments
Queries devised and judged by Information
Specialists
Relevance judgments done only for those documents
retrieved -- not entire collection!
Competition
Various research and commercial groups compete
Results judged on precision and recall, going up
to a recall level of 1000 documents

14
Other facts about TREC

Recall is only an estimate based on the documents
that have actually been assigned judgments
Recall is judged up to a maximum of 1000 relevant
documents per query
In the standard ad hoc situation, everything is
automated no human intervention is allowed.
ad hoc a one-shot query
Standing query/routing a query that is asked
over and over again can train the system to
react to it.

15
Finding Out About

Three phases
Asking of a question
Construction of an answer
Assessment of the answer
Part of an iterative process

16
Problems with Boolean

Difficult to control
and get back too few
or get back too many
No inherent ordering

17
Ranking Algorithms

Assign weights to the terms in the query.
Assign weights to the terms in the documents.
Compare the weighted query terms to the weighted
document terms.
Rank order the results.

18
Information need
Collections
text input
19
Vector Representation (revisited see Salton
article in Science)

Documents and Queries are represented as vectors.
Position 1 corresponds to term 1, position 2 to
term 2, position t to term t

20
Assigning Weights

Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole

21
Assigning Weights

tf x idf measure
term frequency (tf)
inverse document frequency (idf)

22
tf x idf

Normalize the term weights (so longer documents
are not unfairly given more weight)

23
Vector Space Similarity Measurecombine tf x idf
into a similarity measure
24
To Think About

How does this ranking algorithm behave?
Make a set of hypothetical documents consisting
of terms and their weights
Create some hypothetical queries
How are the documents ranked, depending on the
weights of their terms and the queries terms?

Write a Comment

User Comments (0)