Title: IR Evaluation
1IR Evaluation
- Evaluate performance of an IR system
- Retrieval accuracy
- Retrieval efficiency
- User satisfaction
- May seem a secondary issue, but crucial to
progress - Rigorous evaluations initiated by Cleverdon in
1960s
2IR Evaluation Concepts
A Retrieved documents
R Relevant documents
FA false alarms
T true positives
M missed
3Primary Evaluation Metrics
- Recall is T/R
- Precision is T/A
- These definitions are not always practical
- R may not be known
- We are more interested in precision distribution
than the final value
4Recall-Precision Correlation
- R D003, D056, D129
- A
- D123 P0.00 R0.00
- D084 P0.00 R0.00
- D056 P0.33 R0.33
- D006 P0.25 R0.33
- D008 P0.20 R0.33
- D009 P0.16 R0.33
- D511 P0.14 R0.33
- D129 P0.25 R0.66
- D187 P0.22 R0.66
- .
- D003 P0.20 R1.00
5RecallPrecision Curve
highest precision points
precision
33 30 25 22 20 16 14
33
66
100
recall
6Interpolated RP Curves
- Measures precision at selected recall levels
- 10, 20, etc. reference intervals
- Interpolation maximum system capability
- modulo ordering within reference intervals
- Maximum precision within interval
- Measures performance over a set of queries
- Average over queries at reference intervals
7Interpolating Precision
- Select reference intervals
- Usually 01020308090100
- There are 11 reference points
- Calculate precision in each interval
- int(P,I) maximum actual P in 0,I1)
- extend left from first non-zero value
8Interpolated Recall-Precision
- D123 R0.00
- D084 P0.33 R0.00
- P0.33 R0.10
- P0.33 R0.20
- P0.33 R0.30
- D056 P0.33 R0.33 max for R ? 0.33
- D006 R0.33
- D008 R0.33
- D009 R0.33
- D511 R0.33
- P0.25 R0.40
- P0.25 R0.50
- P0.25 R0.60
- D129 P0.25 R0.66 max for R ? 0.66
- D187 R0.66
- P0.30 R0.70
- P0.30 R0.80
- P0.30 R0.90
- D003 P0.20 R1.00 max for R ? 1.00
9Interpolated RecallPrecision Curve
precision
highest precision points
33 30 25 22 20 16 14
0 10 20 30 33 40 50 60 66 70 80
90 100
recall
10Recall-Precision Example 2
- R D003, D056, D129
- A
- D123 P0.00 R0.00
- D084 P0.00 R0.00
- D056 P0.33 R0.33
- D006 P0.25 R0.33
- D008 P0.20 R0.33
- D009 P0.16 R0.33
- D511 P0.14 R0.33
- D129 P0.25 R0.66
- D187 P0.22 R0.66
- D003 P0.30 R1.00
11Interpolated RecallPrecision Curve
highest precision points
precision
33 30 25 22 20 16 14
0 10 20 30 33 40 50 60 66 70 80 90 100
recall
12Interpolated RP Average
Retrieved 1000 Relevant 125
? more relevants Rel_ret
77 Interpolated Recall - Precision at 0.00
1.0000 ? easier query? at 0.10
0.9375 at 0.20 0.5102 at 0.30
0.3125 at 0.40 0.2167 at 0.50
0.1280 at 0.60 0.0787 at 0.70
0.0000 at 0.80 0.0000 at
0.90 0.0000 at 1.00 0.0000
Average precision (non-interpolated) over all
rel docs 0.2647 11-point
average
13Interpolated RP Average
Total number of documents over all queries
Retrieved 1000 Relevant 17 ?
few relevants Rel_ret
2 Interpolated Recall - Precision Averages
at 0.00 0.0556 ? tough query? at 0.10
0.0556 at 0.20 0.0000 at 0.30
0.0000 at 0.40 0.0000 at
0.50 0.0000 at 0.60 0.0000
at 0.70 0.0000 at 0.80 0.0000
at 0.90 0.0000 at 1.00 0.0000
Average precision (non-interpolated) over all
rel docs 0.0052
14RecallPrecision Averages
Total number of documents over all queries (50)
Retrieved 50000 Relevant 4674
Rel_ret 1998 Interpolated Recall -
Precision Averages at 0.00 0.8140
at 0.10 0.5691 at 0.20 0.3672
at 0.30 0.2665 at 0.40 0.1607
at 0.50 0.0993 11-points of
reference at 0.60 0.0528 at 0.70
0.0292 at 0.80 0.0039 at 0.90
0.0025 at 1.00 0.0000 Average
precision (non-interpolated) over all rel docs
0.1898 11-point average
15RecallPrecision Curve
16Precision at document counts
- Measures precision after each r documents
retrieved 5, 10, 15, etc. - D123 P0.00 R0.00
- D084 P0.00 R0.00
- D056 P0.33 R0.33 (R-Precision0.33)
- D006 P0.25 R0.33
- D008 P0.20 R0.33
- D009 P0.16 R0.33
- D511 P0.14 R0.33
- D129 P0.25 R0.66
- D187 P0.22 R0.66
- D038 P0.20 R0.66
- .
- D003 P0.20 R1.00
17Non-interpolated R-Precision
Retrieved 1000 Relevant 125 Rel_ret
77 Precision At 5 docs 1.0000
At 10 docs 1.0000 At 15 docs 0.9333
At 20 docs 0.8500 At 30 docs 0.7000
At 100 docs 0.3300 At 200 docs 0.2350
At 500 docs 0.1260 At 1000 docs
0.0770 R-Precision precision after R
num_rel for a query docs retrieved Exact
0.2960
18Further Evaluation Metrics
- Fallout (also false-alarm rate)
- Measures system ability to filter out
non-relevant documents - FR FA/R (all non-relevant documents)
- more stable than precision no dependence of R
size - can be interpolated at Recall reference points
- since FR is small usually log(FR) is plotted
- F-score
- A single value score
- F2PR/(PR)
19Additional Metrics
- Miss Rate
- Measures probability that the system misses
relevant documents - MR (RA)/R
- Can be interpolated vs. Recall or False Alarm
- ROC Curves
- Receiver Operating Characteristic Curve
- Plots cost of running a system as function of
e.g., false alarms and misses
20ROC Curve of Signal Detection
21Utility
- Combines rank information with value of items
- For retrieved items value to the user
- For missed/false alarms items cost incurred
- For fallout avoided cost savings
- U (R A) (N B) (R- C) (N- D)
22Reference Collections
- Classical
- CACM, ISI, Cranfield
- Small (1-10 Mbytes) everything works
- Complete manual judgments
- TREC (Text Retrieval Conference)
- Created in 1992
- Now 5GBytes (2 million documents) 500 queries
- Judgments obtained through pooling method
- TDT
- TDT2 60K broadcast stories 100 topics
- High-quality judgments obtained in multiple
iteration
23CACM-3204 Collection
- 3204 abstracts of Comm. ACM papers
- Documents short, some just title
- Short (12 word avg.) queries
- Avg. 15 relevants per query
- One of the standard pre-TREC collections
- Reputation fairly easy
24Performance Comparisons
25Significance of Comparison
- 11-pt precision frequently used as single-value
metric. - Sparck-Jones chart
- lt5 difference non noticeable (noise)
- lt10 noticeable, but not significant
- gt10 significant (material)
- Depends upon collection characteristics
- Chi-squared test (?2) measures deviation from
expectation - Measured at significance level 0.05 or 0.01
(2-tailed)
26Significance Testing
- Assume all systems performance is identical
- Observed precision values vs. expected
- HA 38,39,64 at recall level 40
- H0 47,47,47 is null-hypothesis
- ?2 measures deviation of sample from H0
- ?2 ? (vo ve)2 / ve
27How to set up IR experiments
- An IR system to be tested
- A document collection
- size, type of material
- training sub-collection (80) testing (20)
- A set of queries
- Best if obtained from users, e.g., Web,
- Relevance judgments
- Must be done objectively
- Use multiple assessors, compare their results
28Pooling method for Qrels
- Take top K (100) documents for each query for
each system - Remove duplicates, note overlap
- Have the pool judged manually
- Assume documents not in pool not relevant
- Assume most relevants in the pool
- Small difference for N100
- No significant difference for N200
29Formal Evaluations
- TREC IR
- TDT Filtering
- MUC/DUC Information extraction
- Measures recall and precision in filling data
templates with info extracted from text - SUMMAC automated summarization
- Can summary substitute the original?
- Would a human select the same sentences?
- How much time needed to comprehend a summary?