Title: Evaluation Measures
1Evaluation Measures
- You cant manage what you cant measure.
- Peter Drucker
2Evaluation?
- Effectiveness?
- For whom?
- For what?
- Efficiency?
- Time?
- Computational Cost?
- Cost of missed information? Too much info?
3Studies of Retrieval Effectiveness
The Cranfield Experiments, Cyril W.
Cleverdon, Cranfield College of
Aeronautics, 1957 1968 (hundreds of docs)
SMART System, Gerald Salton, Cornell University,
1964-1988 (thousands of docs,) TREC, Donna
Harman, National Institute of Standards and
Technology (NIST), 1992 - (millions of docs,
100k to 7.5M per set, training Qs and test Qs,
150 each)
4What can we measure?
- ???
- Algorithm (Efficiency)
- Speed of algorithm
- Update potential of indexing scheme
- Size of storage required
- Potential for distribution parallelism
- User Experience (Effectiveness)
- How many of all relevant docs were found
- How many were missed
- How many errors in selection
- How many need to be scanned before get good ones
5Measures based on relevance
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
Doc set
retrieved relevant
6Relevance
- System always correct viz a viz match!!
- Who judges relevance?
- Inter-rater reliability
- Early evaluations
- Done by panel of experts
- 1-2000 abstracts of docs
- TREC experiments
- Done automatically
- thousands of docs
- Pooling people
7Defining the universe of relevant docs
- Manual inspection
- Manual exhaustive search
- Pooling (TREC)
- Relevant set is the union of multiple techniques
- Sampling
- Take a random sample
- Inspect
- Estimate from the sample for the whole set
8Defining the relevant docs in a retrieved set
(hit list)
- Panel of judges
- Individual users
- Automatic detection techniques
- Vocabulary overlap with known relevant docs
- metadata
9Evaluation Measures
- Recall and Precision and Fallout
- Single valued measures
- Macro and micro averages
- F-values
- R-precision
- E-measure
- Mean Average Precision (MAP)
10General Approaches
- Global Approach
- Evaluation with respect to whole corpus
- unranked
- Local Approach
- Evaluation with respect to retrieved set
- Use rank
11Recall and Precision
- Recall
- Proportion of relevant docs retrieved
- Precision
- Proportion of retrieved docs that are relevant
12I Global Measures based on relevance
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
Doc set
retrieved relevant
13Measures based on relevance
RR RR
NR RR
RR RN RN
RN NN
recall precision
fallout
14Formula (what do we have to work with?)
- Rq number of docs in whole data set relevant to
query, q - Rr number of docs in hit set (retrieved docs)
that are relevant - Rh number of docs in hit set
- Recall Precision
15collection
Rr
Rq
Rh
16- Recall Precision
- Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
d123 - Rhd6, d8, d9, d84, d123
- Rr
- Recall Precision
- ???what does that tell us??
17Recall-Precision Graphs for sets of queries in a
given system
100
precision
Recall
100
18Typical recall-precision graph
precision
1.0
narrow, specific query
0.75
0.5
Broad, general query
0.25
recall
0.25
0.5
1.0
0.75
19Recall-precision graphs to compare systems
precision
The red system appears better than the black, but
is the difference statistically significant?
1.0
0.75
0.5
0.25
recall
0.25
0.5
1.0
0.75
20II. Working with Ranked Hit Sets
- Document set is reduced to retrieved docs
- Assume user will start at top and stop when
satisfied
21Recall-precision after retrieval of n documents
n relevant recall precision 1 yes 0.2 1.0 2 yes 0.
4 1.0 3 no 0.4 0.67 4 yes 0.6 0.75 5 no 0.6 0.60 6
yes 0.8 0.67 7 no 0.8 0.57 8 no 0.8 0.50 9 no 0.8
0.44 10 no 0.8 0.40 11 no 0.8 0.36 12 no 0.8 0.33
13 yes 1.0 0.38 14 no 1.0 0.36
SMART system using Cranfield data, 200 documents
in aeronautics of which 5 are relevant
22Recall-precision graph single query
recall
200
1.0
13
6
12
0.75
5
4
0.5
2
3
0.25
1
precision
0.25
0.5
1.0
0.75
23Consider Rank
- Recall Precision
- Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
d123 - Rhd123, d84, d56, d6, d8 , d 9, d 511, d129 , d
187, d 25 - Rr
- Recall Precision
- What happens as we go through the hits?
24Standard Recall Levels
- Plot Precision for Recall 0, 10,.100
100
P
R
100
25Compare Results of two queries
Q1
Q2
- Q1
- Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
d123 - Rhd123, d84, d56, d8 , d 9, d 511, d129 , d 25,
d38, d3 - Q2
- Rqd3, d5, d56, d89 , d90 , d 94, d129 , d206
, d500 d502 - Rhd12 ,d84, d56, d5, d8 , d 3, d 511, d129 ,d44
,d89
26Comparison of Query results
27P-R for Multiple Queries
- For each recall level average precision
- Avg Prec at recall r
- Nq is number of queries
- Pi(r) is prec at recall level r for ith query
28Comparison of two systems
29Macro and Micro Averaging
- Micro average over each point
- Macro average of averages per query
- example
30Statistical tests
Suppose that a search is carried out on systems i
and j System i is superior to system j
if recall(i) gt recall(j) precision(i) gt
precision(j)
31Statistical tests
The t-test is the standard statistical test
for comparing two table of numbers, but depends
on statistical assumptions of independence and
normal distributions that do not apply to this
data. The sign test makes no assumptions of
normality and uses only the sign (not the
magnitude) of the the differences in the sample
values, but assumes independent samples. The
Wilcoxon signed rank uses the ranks of the
differences, not their magnitudes, and makes no
assumption of normality but but assumes
independent samples.
32II Single Value Measures
- E-Measure F1 measure
- Normalized Recall
- Expected Search Length
- R-Precision
- Mean Average Precision (MAP)
33Single-Valued Measures
- Expected search length
- Average rank of the first relevant document
- Mean precision at a fixed number of documents
- Precision at 10 docs is often used for Web search
- R-precision at a fixed recall level
- Adjusts for the total number of relevant docs
- Mean breakeven point
- Value at which precision recall
- Mean Average Precision (MAP)
- Interpolated Avg precision at recall0.0, 0.1,
, 1.0 - Uninterpolated Avg precision at each relevant doc
34Single Set-Based Measures
- Balanced F-measure
- Harmonic mean of recall and precision
- r(j) is recall for jth doc retrieved
- p(j) is precision for jth doc retrieved
- What if no relevant documents exist?
35E Measure
- Complement for F Measure for beta 1
- Can increase importance of recall or precision
- Increase beta ???
36Normalized Recall
- recall is normalized against all relevant
documents. - Suppose there are N documents in
- the collection and out of which n are relevant.
These n documents are ranked as r1, r2,, rn. (so
r12 here) - Normalize recall is distance from ideal results
37Normalized recall measure
actual ranks
worst ranks
ideal ranks
recall
5
10
15
200
195
ranks of retrieved documents
38Normalized recall
area between
actual and worst
area between best and worst
Normalized recall
ri - i n(N - n)
Rnorm 1 -
39Example N200 n5 at 1,3,5,10,14
40Expected Search Length
- Average rank of first relevant document
- Average rank of first n relevant docs
- Assume weak or no ordering
- Lq j i . n/(r1) (combinations of n in I)
- n is required of relevant docs
- r is the number of relevant docs
- j is non-relevant docs before get relevant ones
- ??
41A query to find 1 relevant document would have a
search length 2 and a query to find 2 relevant
documents would have a search length 6 (2 for 1st
and 4 for 2nd).
42To get 2 relevant docs in 5 hits
- The possible search lengths are 3, 4, 5 or 6. So
to calculate the expected search length we first
calculate how many different ways that two
relevant documents could be distributed among 5.
(5.2) 10. Of these, 4 would give a search
result of 3, 3 would give 4, 2 would give 5 and 1
would give 6. Their corresponding probabilities
are 4/10, 3/10, 2/10 an 1/10. The expected search
length is now - - (4/10).3 (3/10).4 (2/10).5 (1/10).6 4
- This method leads us to the possibility of
averaging out the results of many searches and
will give a measure of the expected search
length. - Most often only used for first relevant doc
found!!!
43R-Precision
- Measures recall in top ranked documents
- Expecting R relevant documents
- Retrieve a set of R documents
- Measure the precision in R
- Best case precision 1 (all docs are relevant)
- - low R-precision -gt where are the rel docs
44R-Precision example
45Mean Average Precision (MAP)
- Average precision after each relevant doc
- For each query calc average precision
46MAP example
47Why Mean Average Precision?
- It is easy to trade between recall and precision
- Adding related query terms improves recall
- But query expansion may decrease precision
- Restricting queries by part-of-speech or phrases
etc improves precision - But generally decreases recall
- Comparisons should reflect this balance
- Mean Average precision does this
48 MAP example
49Visualizing Mean Average Precision
1.0
0.8
0.6
Average Precision
0.4
0.2
0.0
Topic
50What MAP Hides
Adapted from a presentation by Ellen Voorhees at
the University of Maryland, March 29, 1999
51ConsiderationsSingle-Value Measures
- Expected search length
- Task dependent
- Mean precision at n documents
- There may not be n relevant documents
- R-Precision
- A specific fraction is rarely the users goal
- Mean Average Precision
- Task may determine balance of precision or recall
52Problems with testing
- Determining relevant docs
- Setting up test questions
- Comparing results
- Understanding relevance of the results
53Thought du jour
- What evaluation metric would you use to compare
Google and Yahoo?