Evaluation Measures - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Evaluation Measures

Description:

NN. NR. RN. not retrieved not relevant. retrieved not relevant. retrieved relevant ... Assume user will start at top and stop when satisfied. C.Watters. cs6403. 21 ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 54

Provided by: carolyn97

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation Measures

1
Evaluation Measures

You cant manage what you cant measure.
Peter Drucker

2
Evaluation?

Effectiveness?
For whom?
For what?
Efficiency?
Time?
Computational Cost?
Cost of missed information? Too much info?

3
Studies of Retrieval Effectiveness
The Cranfield Experiments, Cyril W.
Cleverdon, Cranfield College of
Aeronautics, 1957 1968 (hundreds of docs)
SMART System, Gerald Salton, Cornell University,
1964-1988 (thousands of docs,) TREC, Donna
Harman, National Institute of Standards and
Technology (NIST), 1992 - (millions of docs,
100k to 7.5M per set, training Qs and test Qs,
150 each)
4
What can we measure?

???
Algorithm (Efficiency)
Speed of algorithm
Update potential of indexing scheme
Size of storage required
Potential for distribution parallelism
User Experience (Effectiveness)
How many of all relevant docs were found
How many were missed
How many errors in selection
How many need to be scanned before get good ones

5
Measures based on relevance
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
Doc set
retrieved relevant
6
Relevance

System always correct viz a viz match!!
Who judges relevance?
Inter-rater reliability
Early evaluations
Done by panel of experts
1-2000 abstracts of docs
TREC experiments
Done automatically
thousands of docs
Pooling people

7
Defining the universe of relevant docs

Manual inspection
Manual exhaustive search
Pooling (TREC)
Relevant set is the union of multiple techniques
Sampling
Take a random sample
Inspect
Estimate from the sample for the whole set

8
Defining the relevant docs in a retrieved set
(hit list)

Panel of judges
Individual users
Automatic detection techniques
Vocabulary overlap with known relevant docs
metadata

9
Evaluation Measures

Recall and Precision and Fallout
Single valued measures
Macro and micro averages
F-values
R-precision
E-measure
Mean Average Precision (MAP)

10
General Approaches

Global Approach
Evaluation with respect to whole corpus
unranked
Local Approach
Evaluation with respect to retrieved set
Use rank

11
Recall and Precision

Recall
Proportion of relevant docs retrieved
Precision
Proportion of retrieved docs that are relevant

12
I Global Measures based on relevance
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
Doc set
retrieved relevant
13
Measures based on relevance
RR RR
NR RR
RR RN RN
RN NN
recall precision
fallout

14
Formula (what do we have to work with?)

Rq number of docs in whole data set relevant to
query, q
Rr number of docs in hit set (retrieved docs)
that are relevant
Rh number of docs in hit set
Recall Precision

Recall Precision

collection
Rr
Rq
Rh
16

Recall Precision
Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
d123
Rhd6, d8, d9, d84, d123
Rr
Recall Precision
???what does that tell us??

17
Recall-Precision Graphs for sets of queries in a
given system
100
precision
Recall
100
18
Typical recall-precision graph
precision
1.0
narrow, specific query
0.75
0.5
Broad, general query
0.25
recall
0.25
0.5
1.0
0.75
19
Recall-precision graphs to compare systems
precision
The red system appears better than the black, but
is the difference statistically significant?
1.0
0.75
0.5
0.25
recall
0.25
0.5
1.0
0.75
20
II. Working with Ranked Hit Sets

Document set is reduced to retrieved docs
Assume user will start at top and stop when
satisfied

21
Recall-precision after retrieval of n documents
n relevant recall precision 1 yes 0.2 1.0 2 yes 0.
4 1.0 3 no 0.4 0.67 4 yes 0.6 0.75 5 no 0.6 0.60 6
yes 0.8 0.67 7 no 0.8 0.57 8 no 0.8 0.50 9 no 0.8
0.44 10 no 0.8 0.40 11 no 0.8 0.36 12 no 0.8 0.33
13 yes 1.0 0.38 14 no 1.0 0.36
SMART system using Cranfield data, 200 documents
in aeronautics of which 5 are relevant
22
Recall-precision graph single query
recall
200
1.0
13
6
12
0.75
5
4
0.5
2
3
0.25
1
precision
0.25
0.5
1.0
0.75
23
Consider Rank

Recall Precision
Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
d123
Rhd123, d84, d56, d6, d8 , d 9, d 511, d129 , d
187, d 25
Rr
Recall Precision
What happens as we go through the hits?

24
Standard Recall Levels

Plot Precision for Recall 0, 10,.100

100
P
R
100
25
Compare Results of two queries
Q1
Q2

Q1
Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
d123
Rhd123, d84, d56, d8 , d 9, d 511, d129 , d 25,
d38, d3
Q2
Rqd3, d5, d56, d89 , d90 , d 94, d129 , d206
, d500 d502
Rhd12 ,d84, d56, d5, d8 , d 3, d 511, d129 ,d44
,d89

26
Comparison of Query results
27
P-R for Multiple Queries

For each recall level average precision
Avg Prec at recall r
Nq is number of queries
Pi(r) is prec at recall level r for ith query

28
Comparison of two systems
29
Macro and Micro Averaging

Micro average over each point
Macro average of averages per query
example

30
Statistical tests
Suppose that a search is carried out on systems i
and j System i is superior to system j
if recall(i) gt recall(j) precision(i) gt
precision(j)
31
Statistical tests
The t-test is the standard statistical test
for comparing two table of numbers, but depends
on statistical assumptions of independence and
normal distributions that do not apply to this
data. The sign test makes no assumptions of
normality and uses only the sign (not the
magnitude) of the the differences in the sample
values, but assumes independent samples. The
Wilcoxon signed rank uses the ranks of the
differences, not their magnitudes, and makes no
assumption of normality but but assumes
independent samples.
32
II Single Value Measures

E-Measure F1 measure
Normalized Recall
Expected Search Length
R-Precision
Mean Average Precision (MAP)

33
Single-Valued Measures

Expected search length
Average rank of the first relevant document
Mean precision at a fixed number of documents
Precision at 10 docs is often used for Web search
R-precision at a fixed recall level
Adjusts for the total number of relevant docs
Mean breakeven point
Value at which precision recall
Mean Average Precision (MAP)
Interpolated Avg precision at recall0.0, 0.1,
, 1.0
Uninterpolated Avg precision at each relevant doc

34
Single Set-Based Measures

Balanced F-measure
Harmonic mean of recall and precision
r(j) is recall for jth doc retrieved
p(j) is precision for jth doc retrieved
What if no relevant documents exist?

35
E Measure

Complement for F Measure for beta 1
Can increase importance of recall or precision
Increase beta ???

36
Normalized Recall

recall is normalized against all relevant
documents.
Suppose there are N documents in
the collection and out of which n are relevant.
These n documents are ranked as r1, r2,, rn. (so
r12 here)
Normalize recall is distance from ideal results

37
Normalized recall measure
actual ranks
worst ranks
ideal ranks
recall
5
10
15
200
195
ranks of retrieved documents
38
Normalized recall
area between
actual and worst
area between best and worst
Normalized recall
ri - i n(N - n)
Rnorm 1 -
39
Example N200 n5 at 1,3,5,10,14
40
Expected Search Length

Average rank of first relevant document
Average rank of first n relevant docs
Assume weak or no ordering
Lq j i . n/(r1) (combinations of n in I)
n is required of relevant docs
r is the number of relevant docs
j is non-relevant docs before get relevant ones
??

41
A query to find 1 relevant document would have a
search length 2 and a query to find 2 relevant
documents would have a search length 6 (2 for 1st
and 4 for 2nd).
42
To get 2 relevant docs in 5 hits

The possible search lengths are 3, 4, 5 or 6. So
to calculate the expected search length we first
calculate how many different ways that two
relevant documents could be distributed among 5.
(5.2) 10. Of these, 4 would give a search
result of 3, 3 would give 4, 2 would give 5 and 1
would give 6. Their corresponding probabilities
are 4/10, 3/10, 2/10 an 1/10. The expected search
length is now -
(4/10).3 (3/10).4 (2/10).5 (1/10).6 4
This method leads us to the possibility of
averaging out the results of many searches and
will give a measure of the expected search
length.
Most often only used for first relevant doc
found!!!

43
R-Precision

Measures recall in top ranked documents
Expecting R relevant documents
Retrieve a set of R documents
Measure the precision in R
Best case precision 1 (all docs are relevant)
- low R-precision -gt where are the rel docs

44
R-Precision example
45
Mean Average Precision (MAP)

Average precision after each relevant doc
For each query calc average precision

46
MAP example
47
Why Mean Average Precision?

It is easy to trade between recall and precision
Adding related query terms improves recall
But query expansion may decrease precision
Restricting queries by part-of-speech or phrases
etc improves precision
But generally decreases recall
Comparisons should reflect this balance
Mean Average precision does this

48
MAP example
49
Visualizing Mean Average Precision
1.0
0.8
0.6
Average Precision
0.4
0.2
0.0
Topic
50
What MAP Hides
Adapted from a presentation by Ellen Voorhees at
the University of Maryland, March 29, 1999
51
ConsiderationsSingle-Value Measures