IR Evaluation - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

IR Evaluation

Description:

ROC Curves. Receiver Operating Characteristic Curve ... ROC Curve of Signal Detection. 21. Utility. Combines rank information with value of items: ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 30

Provided by: tcnj

Category:

more less

Transcript and Presenter's Notes

Title: IR Evaluation

1
IR Evaluation

Evaluate performance of an IR system
Retrieval accuracy
Retrieval efficiency
User satisfaction
May seem a secondary issue, but crucial to
progress
Rigorous evaluations initiated by Cleverdon in
1960s

2
IR Evaluation Concepts
A Retrieved documents
R Relevant documents

FA false alarms
T true positives
M missed
3
Primary Evaluation Metrics

Recall is T/R
Precision is T/A
These definitions are not always practical
R may not be known
We are more interested in precision distribution
than the final value

4
Recall-Precision Correlation

R D003, D056, D129
A
D123 P0.00 R0.00
D084 P0.00 R0.00
D056 P0.33 R0.33
D006 P0.25 R0.33
D008 P0.20 R0.33
D009 P0.16 R0.33
D511 P0.14 R0.33
D129 P0.25 R0.66
D187 P0.22 R0.66
.
D003 P0.20 R1.00

5
RecallPrecision Curve
highest precision points
precision
33 30 25 22 20 16 14
33
66
100
recall
6
Interpolated RP Curves

Measures precision at selected recall levels
10, 20, etc. reference intervals
Interpolation maximum system capability
modulo ordering within reference intervals
Maximum precision within interval
Measures performance over a set of queries
Average over queries at reference intervals

7
Interpolating Precision

Select reference intervals
Usually 01020308090100
There are 11 reference points
Calculate precision in each interval
int(P,I) maximum actual P in 0,I1)
extend left from first non-zero value

8
Interpolated Recall-Precision

D123 R0.00
D084 P0.33 R0.00
P0.33 R0.10
P0.33 R0.20
P0.33 R0.30
D056 P0.33 R0.33 max for R ? 0.33
D006 R0.33
D008 R0.33
D009 R0.33
D511 R0.33
P0.25 R0.40
P0.25 R0.50
P0.25 R0.60
D129 P0.25 R0.66 max for R ? 0.66
D187 R0.66
P0.30 R0.70
P0.30 R0.80
P0.30 R0.90
D003 P0.20 R1.00 max for R ? 1.00

9
Interpolated RecallPrecision Curve
precision
highest precision points
33 30 25 22 20 16 14
0 10 20 30 33 40 50 60 66 70 80
90 100
recall
10
Recall-Precision Example 2

R D003, D056, D129
A
D123 P0.00 R0.00
D084 P0.00 R0.00
D056 P0.33 R0.33
D006 P0.25 R0.33
D008 P0.20 R0.33
D009 P0.16 R0.33
D511 P0.14 R0.33
D129 P0.25 R0.66
D187 P0.22 R0.66
D003 P0.30 R1.00

11
Interpolated RecallPrecision Curve
highest precision points
precision
33 30 25 22 20 16 14
0 10 20 30 33 40 50 60 66 70 80 90 100
recall
12
Interpolated RP Average
Retrieved 1000 Relevant 125
? more relevants Rel_ret
77 Interpolated Recall - Precision at 0.00
1.0000 ? easier query? at 0.10
0.9375 at 0.20 0.5102 at 0.30
0.3125 at 0.40 0.2167 at 0.50
0.1280 at 0.60 0.0787 at 0.70
0.0000 at 0.80 0.0000 at
0.90 0.0000 at 1.00 0.0000
Average precision (non-interpolated) over all
rel docs 0.2647 11-point
average
13
Interpolated RP Average
Total number of documents over all queries
Retrieved 1000 Relevant 17 ?
few relevants Rel_ret
2 Interpolated Recall - Precision Averages
at 0.00 0.0556 ? tough query? at 0.10
0.0556 at 0.20 0.0000 at 0.30
0.0000 at 0.40 0.0000 at
0.50 0.0000 at 0.60 0.0000
at 0.70 0.0000 at 0.80 0.0000
at 0.90 0.0000 at 1.00 0.0000
Average precision (non-interpolated) over all
rel docs 0.0052
14
RecallPrecision Averages
Total number of documents over all queries (50)
Retrieved 50000 Relevant 4674
Rel_ret 1998 Interpolated Recall -
Precision Averages at 0.00 0.8140
at 0.10 0.5691 at 0.20 0.3672
at 0.30 0.2665 at 0.40 0.1607
at 0.50 0.0993 11-points of
reference at 0.60 0.0528 at 0.70
0.0292 at 0.80 0.0039 at 0.90
0.0025 at 1.00 0.0000 Average
precision (non-interpolated) over all rel docs
0.1898 11-point average
15
RecallPrecision Curve
16
Precision at document counts

Measures precision after each r documents
retrieved 5, 10, 15, etc.
D123 P0.00 R0.00
D084 P0.00 R0.00
D056 P0.33 R0.33 (R-Precision0.33)
D006 P0.25 R0.33
D008 P0.20 R0.33
D009 P0.16 R0.33
D511 P0.14 R0.33
D129 P0.25 R0.66
D187 P0.22 R0.66
D038 P0.20 R0.66
.
D003 P0.20 R1.00

17
Non-interpolated R-Precision
Retrieved 1000 Relevant 125 Rel_ret
77 Precision At 5 docs 1.0000
At 10 docs 1.0000 At 15 docs 0.9333
At 20 docs 0.8500 At 30 docs 0.7000
At 100 docs 0.3300 At 200 docs 0.2350
At 500 docs 0.1260 At 1000 docs
0.0770 R-Precision precision after R
num_rel for a query docs retrieved Exact
0.2960
18
Further Evaluation Metrics

Fallout (also false-alarm rate)
Measures system ability to filter out
non-relevant documents
FR FA/R (all non-relevant documents)
more stable than precision no dependence of R
size
can be interpolated at Recall reference points
since FR is small usually log(FR) is plotted
F-score
A single value score
F2PR/(PR)

19
Additional Metrics

Miss Rate
Measures probability that the system misses
relevant documents
MR (RA)/R
Can be interpolated vs. Recall or False Alarm
ROC Curves
Receiver Operating Characteristic Curve
Plots cost of running a system as function of
e.g., false alarms and misses

20
ROC Curve of Signal Detection
21
Utility

Combines rank information with value of items
For retrieved items value to the user
For missed/false alarms items cost incurred
For fallout avoided cost savings
U (R A) (N B) (R- C) (N- D)

22
Reference Collections

Classical
CACM, ISI, Cranfield
Small (1-10 Mbytes) everything works
Complete manual judgments
TREC (Text Retrieval Conference)
Created in 1992
Now 5GBytes (2 million documents) 500 queries
Judgments obtained through pooling method
TDT
TDT2 60K broadcast stories 100 topics
High-quality judgments obtained in multiple
iteration

23
CACM-3204 Collection

3204 abstracts of Comm. ACM papers
Documents short, some just title
Short (12 word avg.) queries
Avg. 15 relevants per query
One of the standard pre-TREC collections
Reputation fairly easy

24
Performance Comparisons
25
Significance of Comparison

11-pt precision frequently used as single-value
metric.
Sparck-Jones chart
lt5 difference non noticeable (noise)
lt10 noticeable, but not significant
gt10 significant (material)
Depends upon collection characteristics
Chi-squared test (?2) measures deviation from
expectation
Measured at significance level 0.05 or 0.01
(2-tailed)

26
Significance Testing

Assume all systems performance is identical
Observed precision values vs. expected
HA 38,39,64 at recall level 40
H0 47,47,47 is null-hypothesis
?2 measures deviation of sample from H0
?2 ? (vo ve)2 / ve

27
How to set up IR experiments

An IR system to be tested
A document collection
size, type of material
training sub-collection (80) testing (20)
A set of queries
Best if obtained from users, e.g., Web,
Relevance judgments
Must be done objectively
Use multiple assessors, compare their results

28
Pooling method for Qrels

Take top K (100) documents for each query for
each system
Remove duplicates, note overlap
Have the pool judged manually
Assume documents not in pool not relevant
Assume most relevants in the pool
Small difference for N100
No significant difference for N200

29
Formal Evaluations

TREC IR
TDT Filtering
MUC/DUC Information extraction
Measures recall and precision in filling data
templates with info extracted from text
SUMMAC automated summarization
Can summary substitute the original?
Would a human select the same sentences?
How much time needed to comprehend a summary?

Write a Comment

User Comments (0)