Evaluation Measures - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Evaluation Measures

Description:

NN. NR. RN. not retrieved not relevant. retrieved not relevant. retrieved relevant ... Assume user will start at top and stop when satisfied. C.Watters. cs6403. 21 ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 54
Provided by: carolyn97
Category:
Tags: evaluation | measures | nn | top

less

Transcript and Presenter's Notes

Title: Evaluation Measures


1
Evaluation Measures
  • You cant manage what you cant measure.
  • Peter Drucker

2
Evaluation?
  • Effectiveness?
  • For whom?
  • For what?
  • Efficiency?
  • Time?
  • Computational Cost?
  • Cost of missed information? Too much info?

3
Studies of Retrieval Effectiveness
The Cranfield Experiments, Cyril W.
Cleverdon, Cranfield College of
Aeronautics, 1957 1968 (hundreds of docs)
SMART System, Gerald Salton, Cornell University,
1964-1988 (thousands of docs,) TREC, Donna
Harman, National Institute of Standards and
Technology (NIST), 1992 - (millions of docs,
100k to 7.5M per set, training Qs and test Qs,
150 each)
4
What can we measure?
  • ???
  • Algorithm (Efficiency)
  • Speed of algorithm
  • Update potential of indexing scheme
  • Size of storage required
  • Potential for distribution parallelism
  • User Experience (Effectiveness)
  • How many of all relevant docs were found
  • How many were missed
  • How many errors in selection
  • How many need to be scanned before get good ones

5
Measures based on relevance
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
Doc set
retrieved relevant
6
Relevance
  • System always correct viz a viz match!!
  • Who judges relevance?
  • Inter-rater reliability
  • Early evaluations
  • Done by panel of experts
  • 1-2000 abstracts of docs
  • TREC experiments
  • Done automatically
  • thousands of docs
  • Pooling people

7
Defining the universe of relevant docs
  • Manual inspection
  • Manual exhaustive search
  • Pooling (TREC)
  • Relevant set is the union of multiple techniques
  • Sampling
  • Take a random sample
  • Inspect
  • Estimate from the sample for the whole set

8
Defining the relevant docs in a retrieved set
(hit list)
  • Panel of judges
  • Individual users
  • Automatic detection techniques
  • Vocabulary overlap with known relevant docs
  • metadata

9
Evaluation Measures
  • Recall and Precision and Fallout
  • Single valued measures
  • Macro and micro averages
  • F-values
  • R-precision
  • E-measure
  • Mean Average Precision (MAP)

10
General Approaches
  • Global Approach
  • Evaluation with respect to whole corpus
  • unranked
  • Local Approach
  • Evaluation with respect to retrieved set
  • Use rank

11
Recall and Precision
  • Recall
  • Proportion of relevant docs retrieved
  • Precision
  • Proportion of retrieved docs that are relevant

12
I Global Measures based on relevance
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
Doc set
retrieved relevant
13
Measures based on relevance
RR RR
NR RR
RR RN RN
RN NN
recall precision
fallout

14
Formula (what do we have to work with?)
  • Rq number of docs in whole data set relevant to
    query, q
  • Rr number of docs in hit set (retrieved docs)
    that are relevant
  • Rh number of docs in hit set
  • Recall Precision

15
  • Recall Precision

collection
Rr
Rq
Rh
16
  • Recall Precision
  • Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
    d123
  • Rhd6, d8, d9, d84, d123
  • Rr
  • Recall Precision
  • ???what does that tell us??

17
Recall-Precision Graphs for sets of queries in a
given system
100
precision
Recall
100
18
Typical recall-precision graph
precision
1.0
narrow, specific query
0.75
0.5
Broad, general query
0.25
recall
0.25
0.5
1.0
0.75
19
Recall-precision graphs to compare systems
precision
The red system appears better than the black, but
is the difference statistically significant?
1.0
0.75
0.5
0.25
recall
0.25
0.5
1.0
0.75
20
II. Working with Ranked Hit Sets
  • Document set is reduced to retrieved docs
  • Assume user will start at top and stop when
    satisfied

21
Recall-precision after retrieval of n documents
n relevant recall precision 1 yes 0.2 1.0 2 yes 0.
4 1.0 3 no 0.4 0.67 4 yes 0.6 0.75 5 no 0.6 0.60 6
yes 0.8 0.67 7 no 0.8 0.57 8 no 0.8 0.50 9 no 0.8
0.44 10 no 0.8 0.40 11 no 0.8 0.36 12 no 0.8 0.33
13 yes 1.0 0.38 14 no 1.0 0.36
SMART system using Cranfield data, 200 documents
in aeronautics of which 5 are relevant
22
Recall-precision graph single query
recall
200
1.0
13
6
12
0.75
5
4
0.5
2
3
0.25
1
precision
0.25
0.5
1.0
0.75
23
Consider Rank
  • Recall Precision
  • Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
    d123
  • Rhd123, d84, d56, d6, d8 , d 9, d 511, d129 , d
    187, d 25
  • Rr
  • Recall Precision
  • What happens as we go through the hits?

24
Standard Recall Levels
  • Plot Precision for Recall 0, 10,.100

100
P
R
100
25
Compare Results of two queries
Q1
Q2
  • Q1
  • Rqd3, d5, d9, d25, d39, d44, d56, d71, d89,
    d123
  • Rhd123, d84, d56, d8 , d 9, d 511, d129 , d 25,
    d38, d3
  • Q2
  • Rqd3, d5, d56, d89 , d90 , d 94, d129 , d206
    , d500 d502
  • Rhd12 ,d84, d56, d5, d8 , d 3, d 511, d129 ,d44
    ,d89

26
Comparison of Query results
27
P-R for Multiple Queries
  • For each recall level average precision
  • Avg Prec at recall r
  • Nq is number of queries
  • Pi(r) is prec at recall level r for ith query

28
Comparison of two systems
29
Macro and Micro Averaging
  • Micro average over each point
  • Macro average of averages per query
  • example

30
Statistical tests
Suppose that a search is carried out on systems i
and j System i is superior to system j
if recall(i) gt recall(j) precision(i) gt
precision(j)
31
Statistical tests
The t-test is the standard statistical test
for comparing two table of numbers, but depends
on statistical assumptions of independence and
normal distributions that do not apply to this
data. The sign test makes no assumptions of
normality and uses only the sign (not the
magnitude) of the the differences in the sample
values, but assumes independent samples. The
Wilcoxon signed rank uses the ranks of the
differences, not their magnitudes, and makes no
assumption of normality but but assumes
independent samples.
32
II Single Value Measures
  • E-Measure F1 measure
  • Normalized Recall
  • Expected Search Length
  • R-Precision
  • Mean Average Precision (MAP)

33
Single-Valued Measures
  • Expected search length
  • Average rank of the first relevant document
  • Mean precision at a fixed number of documents
  • Precision at 10 docs is often used for Web search
  • R-precision at a fixed recall level
  • Adjusts for the total number of relevant docs
  • Mean breakeven point
  • Value at which precision recall
  • Mean Average Precision (MAP)
  • Interpolated Avg precision at recall0.0, 0.1,
    , 1.0
  • Uninterpolated Avg precision at each relevant doc

34
Single Set-Based Measures
  • Balanced F-measure
  • Harmonic mean of recall and precision
  • r(j) is recall for jth doc retrieved
  • p(j) is precision for jth doc retrieved
  • What if no relevant documents exist?

35
E Measure
  • Complement for F Measure for beta 1
  • Can increase importance of recall or precision
  • Increase beta ???

36
Normalized Recall
  • recall is normalized against all relevant
    documents.
  • Suppose there are N documents in
  • the collection and out of which n are relevant.
    These n documents are ranked as r1, r2,, rn. (so
    r12 here)
  • Normalize recall is distance from ideal results

37
Normalized recall measure
actual ranks
worst ranks
ideal ranks
recall
5
10
15
200
195
ranks of retrieved documents
38
Normalized recall
area between
actual and worst
area between best and worst
Normalized recall
ri - i n(N - n)
Rnorm 1 -
39
Example N200 n5 at 1,3,5,10,14
40
Expected Search Length
  • Average rank of first relevant document
  • Average rank of first n relevant docs
  • Assume weak or no ordering
  • Lq j i . n/(r1) (combinations of n in I)
  • n is required of relevant docs
  • r is the number of relevant docs
  • j is non-relevant docs before get relevant ones
  • ??

41
A query to find 1 relevant document would have a
search length 2 and a query to find 2 relevant
documents would have a search length 6 (2 for 1st
and 4 for 2nd).
42
To get 2 relevant docs in 5 hits
  • The possible search lengths are 3, 4, 5 or 6. So
    to calculate the expected search length we first
    calculate how many different ways that two
    relevant documents could be distributed among 5.
    (5.2) 10. Of these, 4 would give a search
    result of 3, 3 would give 4, 2 would give 5 and 1
    would give 6. Their corresponding probabilities
    are 4/10, 3/10, 2/10 an 1/10. The expected search
    length is now -
  • (4/10).3 (3/10).4 (2/10).5 (1/10).6 4
  • This method leads us to the possibility of
    averaging out the results of many searches and
    will give a measure of the expected search
    length.
  • Most often only used for first relevant doc
    found!!!

43
R-Precision
  • Measures recall in top ranked documents
  • Expecting R relevant documents
  • Retrieve a set of R documents
  • Measure the precision in R
  • Best case precision 1 (all docs are relevant)
  • - low R-precision -gt where are the rel docs

44
R-Precision example
45
Mean Average Precision (MAP)
  • Average precision after each relevant doc
  • For each query calc average precision

46
MAP example
47
Why Mean Average Precision?
  • It is easy to trade between recall and precision
  • Adding related query terms improves recall
  • But query expansion may decrease precision
  • Restricting queries by part-of-speech or phrases
    etc improves precision
  • But generally decreases recall
  • Comparisons should reflect this balance
  • Mean Average precision does this

48
MAP example
49
Visualizing Mean Average Precision
1.0
0.8
0.6
Average Precision
0.4
0.2
0.0
Topic
50
What MAP Hides
Adapted from a presentation by Ellen Voorhees at
the University of Maryland, March 29, 1999
51
ConsiderationsSingle-Value Measures
  • Expected search length
  • Task dependent
  • Mean precision at n documents
  • There may not be n relevant documents
  • R-Precision
  • A specific fraction is rarely the users goal
  • Mean Average Precision
  • Task may determine balance of precision or recall

52
Problems with testing
  • Determining relevant docs
  • Setting up test questions
  • Comparing results
  • Understanding relevance of the results

53
Thought du jour
  • What evaluation metric would you use to compare
    Google and Yahoo?
Write a Comment
User Comments (0)
About PowerShow.com