Automatically Evaluating Answers to Definition Questions - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Automatically Evaluating Answers to Definition Questions

Description:

Large-scale formal evaluations at TREC. Question Answering track, ... Who is Aaron Copland? What is a golden parachute? What are fractals? Evaluation Flow ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 25

Provided by: Jimm105

Category:

more less

Transcript and Presenter's Notes

Title: Automatically Evaluating Answers to Definition Questions

1
Automatically Evaluating Answers to Definition
Questions

Jimmy Lin and Dina Demner-Fushman
University of Maryland
HLT/EMNLP 2005
Saturday, October 8, 2005

2
Roadmap

Background
Current state of question answering
How definition questions are evaluated
Lack of automatic evaluation metrics until now
Introducing POURPRE
How well does POURPRE correlate with official
scores?
Discussion of evaluation methodology
Conclusions

3
Brief History of QA

Large-scale formal evaluations at TREC
Question Answering track, started in TREC 8
(1999)
Factoid questions
Automatic evaluation method match answer
patterns
More complex questions
Automatic evaluation method none available!

Whats the capital of Maryland? How many floors
does the Empire State Building have? Who won the
World Series in 1986?
Who is Colin Powell? The analyst is concerned
with arms trafficking to Colombian insurgents.
Specifically, the analyst would like to know of
the different routes used for arms entering
Colombia and the entities involved.
4
Definition Questions

Previously evaluated at
TREC 2003
TREC 2004 and 2005 (as other questions)
Tell me about
Find interesting facts about a topic
(organization, person, term, event, etc.)
System output is a collection of passages
No limit on length of system output

Who is Aaron Copland? What is a golden
parachute? What are fractals?
5
Evaluation Flow
1. NIST collects runs from every system
2. Assessor reads all runs and manually creates
an answer key
score1
run1
score2
run2
score3
run3
3. Assessor uses answer key to manually score
all runs
score4
run4
6
Creating the Answer Key

Performed once per set of questions
After reading all system responses, assessor
creates a list of nugget
Nugget basic unit of information fact
Categorized as vital or okay

What is the Cassini space probe? 1. 32 kilograms
plutonium powered vital 2. seven year journey
vital 3. Titan 4-B Rocket vital 4. send
Huygens to probe atmosphere of Titan, Saturn's
largest moon vital 5. parachute instruments to
planet's surface okay 6. oceans of ethane or
other hydrocarbons, frozen methane or water
okay 7. carries 12 packages scientific
instruments and a probe vital 8. NASA primary
responsible for Cassini orbiter okay
7
Scoring Each Run

Semantically match nuggets in answer key with
system output
Compute final F-score
Recall component fraction of vital nuggets
retrieved
Precision component length allowance for vital
and okay nuggets retrieved
Recall favored over precision (?3)

NYT19990816.0266 Early in the Saturn visit,
Cassini is to send a probe named Huygens into the
smog-shrouded atmosphere of Titan, the planet's
largest moon, and parachute instruments to its
hidden surface to see if it holds oceans of
ethane or other hydrocarbons over frozen layers
of methane or water. Nuggets found 4, 5, 6
8
F-Score Details
Note recall is only a function of vital nuggets
Getting okay nuggets doesnt increase a runs
score directly only gives more length allowance
9
Semantic Matching

Answer nuggets must be matched against system
output manually
Until now, there is no method for automatic
evaluation
Researchers must wait for the yearly TREC cycle
No hill to climb!

Who is Al Sharpton? Nugget Harlem civil rights
leader System answer New York civil rights
activist Who is Ari Fleischer? Nugget Elizabeth
Doles Press Secretary System answer Ari
Fleischer, spokesman for Elizabeth Dole What is
the medical condition shingles? Nugget tropical
sic capsaicin relieves pain of shingles System
answer Epilepsy drug relieves pain from...
shingles
10
Automatic Nugget Matching

Heres a simple idea can we replace nugget
matching with substring co-occurrence?
Works for machine translation (BLEU/NIST)
Works for summarization (ROUGE)
Definition question answering task is closer to
summarization than MT
Most systems are extractive no need to measure
fluency
Unigram co-occurrences should work well (from
ROUGE)
Heres a catchy name POURPRE
Following a rainbow of colors ROUGE, BLEU,
ORANGE, etc.

11
POURPRE

Nugget match (or Is this nugget present in the
system response?)
Previously a manual binary decision
Replace with unigram overlap between system
output and best-matching nugget from answer key
Calculation of F-score proceeds as before
Scoring variations
term counts fraction of terms matched
idf idf of matched terms / idf of all terms in
nugget
Other variations
Macroaveraging vs. microaveraging
Stemming vs. no stemming

12
Evaluating POURPRE

Rank previous TREC QA runs with POURPRE
54 runs from TREC 2003 (?5,3)
63 runs from TREC 2004 (?3)
POURPRE variants idf/count, macro/micro, ?stem
Correlate with official scores and official
rankings
Coefficient of correlation (R2)
Kendalls tau
Compare with ROUGE baseline
Concatenated all nuggets as the reference summary

13
Evaluation Methodology
Coefficient of correlation (R2)
Kendalls tau
14
Results

POURPRE macro, term counts, no stemming
Kendalls tau
R2
Neither stemming nor idf weighting helps

15
Scatter Plot
TREC 2004 (ß3)
16
Rank Swaps

Definition method1 ranks A before B, but method2
ranks B before A
Rank swaps indicate instabilities in evaluation
metrics
However, there is inherent measurement error
Small score differences are not consequential

17
Analysis of Rank Swaps

For all pairwise comparisons between POURPRE and
official scores
Noted if a rank swap occurred
Measured the difference in official scores
Plotted a histogram of rank swaps binned by
difference in official scores

18
Analysis of Rank Swaps

Conclusions
Rank swaps not much to worry about
Accuracy of POURPRE is within tolerance of
measurement accuracy

81 rank swaps out of 1431 pairwise comparisons
Histogram of rank swaps between official and
POUPRE scores (TREC 2003, ?5), binned by
difference in official score
Voorhees, 2003
19
Vital/Okay?

Answer nuggets are classified as vital or
okay
Distinction has large impact on scores
But hard to operationalize

Target Viagra vital Has caused deaths in
US. vital Heart patients especially warned
against use. okay Can cause vision and
headache problems. okay Viagra increases
nitric oxide's effect. okay Bob Dole rcvd
45000 weekly while endorsement ran. okay
After initial hesitation, Arab countries approved
use. okay Many insurers cover use of
drug. okay HMOs have labeled Viagra a
lifestyle drug. okay Called a 10-a-pill
therapy.
20
Impact of Vital/Okay Distinction

Experiments with nugget classification
Created variant answer keys
All nuggets vital
Vital/okay flipped
Random assignment of vital/okay
Scored all TREC runs using variant answer keys
Correlated results with official rankings
(Kendalls tau)
Results

21
QA and Summarization

Converging evolution of question answering and
multi-document summarization
QA move from factoids to more complex questions
(definition, relationships, opinions, etc.)
Summarization shift towards query-focused
multi-document information synthesis
Converging trends in evaluation similarity
between nuggets and the pyramid scheme
Both focus on basic factual units
Both diagnostic
Pyramid scheme has a more refined sense of
importance possible incorporation in QA?
Lots of synergy between the two communities

Nenkova et al., 2004
Lin and Demner-Fushman, 2005
22
Why POURPRE?

Fills an evaluation gap
Allows automatic evaluation of answers
Provides instant feedback on system performance
Facilitates rapid experimentation
POURPRE is preferred over ROUGE
Tailored specifically for the definition QA task
Better correlation with official scores and
rankings
More fine-grained nugget-level evaluation
Diagnostic
Nugget F-score as a general paradigm for
evaluating complex answers
Broad applicability of POURPRE to other QA tasks

23
Conclusions

No method exists to automatically evaluate
answers to complex questions
We propose a measure called POURPRE to address
this evaluation gap
We demonstrate on TREC QA data that our measure
correlates better with official scores than
direct application of metrics for other tasks

24
Questions?