Title: Automatically Evaluating Answers to Definition Questions
1Automatically Evaluating Answers to Definition
Questions
- Jimmy Lin and Dina Demner-Fushman
- University of Maryland
- HLT/EMNLP 2005
- Saturday, October 8, 2005
2Roadmap
- Background
- Current state of question answering
- How definition questions are evaluated
- Lack of automatic evaluation metrics until now
- Introducing POURPRE
- How well does POURPRE correlate with official
scores? - Discussion of evaluation methodology
- Conclusions
3Brief History of QA
- Large-scale formal evaluations at TREC
- Question Answering track, started in TREC 8
(1999) - Factoid questions
- Automatic evaluation method match answer
patterns - More complex questions
- Automatic evaluation method none available!
Whats the capital of Maryland? How many floors
does the Empire State Building have? Who won the
World Series in 1986?
Who is Colin Powell? The analyst is concerned
with arms trafficking to Colombian insurgents.
Specifically, the analyst would like to know of
the different routes used for arms entering
Colombia and the entities involved.
4Definition Questions
- Previously evaluated at
- TREC 2003
- TREC 2004 and 2005 (as other questions)
- Tell me about
- Find interesting facts about a topic
(organization, person, term, event, etc.) - System output is a collection of passages
- No limit on length of system output
Who is Aaron Copland? What is a golden
parachute? What are fractals?
5Evaluation Flow
1. NIST collects runs from every system
2. Assessor reads all runs and manually creates
an answer key
score1
run1
score2
run2
score3
run3
3. Assessor uses answer key to manually score
all runs
score4
run4
6Creating the Answer Key
- Performed once per set of questions
- After reading all system responses, assessor
creates a list of nugget - Nugget basic unit of information fact
- Categorized as vital or okay
What is the Cassini space probe? 1. 32 kilograms
plutonium powered vital 2. seven year journey
vital 3. Titan 4-B Rocket vital 4. send
Huygens to probe atmosphere of Titan, Saturn's
largest moon vital 5. parachute instruments to
planet's surface okay 6. oceans of ethane or
other hydrocarbons, frozen methane or water
okay 7. carries 12 packages scientific
instruments and a probe vital 8. NASA primary
responsible for Cassini orbiter okay
7Scoring Each Run
- Semantically match nuggets in answer key with
system output - Compute final F-score
- Recall component fraction of vital nuggets
retrieved - Precision component length allowance for vital
and okay nuggets retrieved - Recall favored over precision (?3)
NYT19990816.0266 Early in the Saturn visit,
Cassini is to send a probe named Huygens into the
smog-shrouded atmosphere of Titan, the planet's
largest moon, and parachute instruments to its
hidden surface to see if it holds oceans of
ethane or other hydrocarbons over frozen layers
of methane or water. Nuggets found 4, 5, 6
8F-Score Details
Note recall is only a function of vital nuggets
Getting okay nuggets doesnt increase a runs
score directly only gives more length allowance
9Semantic Matching
- Answer nuggets must be matched against system
output manually - Until now, there is no method for automatic
evaluation - Researchers must wait for the yearly TREC cycle
- No hill to climb!
Who is Al Sharpton? Nugget Harlem civil rights
leader System answer New York civil rights
activist Who is Ari Fleischer? Nugget Elizabeth
Doles Press Secretary System answer Ari
Fleischer, spokesman for Elizabeth Dole What is
the medical condition shingles? Nugget tropical
sic capsaicin relieves pain of shingles System
answer Epilepsy drug relieves pain from...
shingles
10Automatic Nugget Matching
- Heres a simple idea can we replace nugget
matching with substring co-occurrence? - Works for machine translation (BLEU/NIST)
- Works for summarization (ROUGE)
- Definition question answering task is closer to
summarization than MT - Most systems are extractive no need to measure
fluency - Unigram co-occurrences should work well (from
ROUGE) - Heres a catchy name POURPRE
- Following a rainbow of colors ROUGE, BLEU,
ORANGE, etc.
11POURPRE
- Nugget match (or Is this nugget present in the
system response?) - Previously a manual binary decision
- Replace with unigram overlap between system
output and best-matching nugget from answer key - Calculation of F-score proceeds as before
- Scoring variations
- term counts fraction of terms matched
- idf idf of matched terms / idf of all terms in
nugget - Other variations
- Macroaveraging vs. microaveraging
- Stemming vs. no stemming
12Evaluating POURPRE
- Rank previous TREC QA runs with POURPRE
- 54 runs from TREC 2003 (?5,3)
- 63 runs from TREC 2004 (?3)
- POURPRE variants idf/count, macro/micro, ?stem
- Correlate with official scores and official
rankings - Coefficient of correlation (R2)
- Kendalls tau
- Compare with ROUGE baseline
- Concatenated all nuggets as the reference summary
13Evaluation Methodology
Coefficient of correlation (R2)
Kendalls tau
14Results
- POURPRE macro, term counts, no stemming
- Kendalls tau
- R2
- Neither stemming nor idf weighting helps
15Scatter Plot
TREC 2004 (ß3)
16Rank Swaps
- Definition method1 ranks A before B, but method2
ranks B before A - Rank swaps indicate instabilities in evaluation
metrics - However, there is inherent measurement error
- Small score differences are not consequential
17Analysis of Rank Swaps
- For all pairwise comparisons between POURPRE and
official scores - Noted if a rank swap occurred
- Measured the difference in official scores
- Plotted a histogram of rank swaps binned by
difference in official scores
18Analysis of Rank Swaps
- Conclusions
- Rank swaps not much to worry about
- Accuracy of POURPRE is within tolerance of
measurement accuracy
81 rank swaps out of 1431 pairwise comparisons
Histogram of rank swaps between official and
POUPRE scores (TREC 2003, ?5), binned by
difference in official score
Voorhees, 2003
19Vital/Okay?
- Answer nuggets are classified as vital or
okay - Distinction has large impact on scores
- But hard to operationalize
Target Viagra vital Has caused deaths in
US. vital Heart patients especially warned
against use. okay Can cause vision and
headache problems. okay Viagra increases
nitric oxide's effect. okay Bob Dole rcvd
45000 weekly while endorsement ran. okay
After initial hesitation, Arab countries approved
use. okay Many insurers cover use of
drug. okay HMOs have labeled Viagra a
lifestyle drug. okay Called a 10-a-pill
therapy.
20Impact of Vital/Okay Distinction
- Experiments with nugget classification
- Created variant answer keys
- All nuggets vital
- Vital/okay flipped
- Random assignment of vital/okay
- Scored all TREC runs using variant answer keys
- Correlated results with official rankings
(Kendalls tau) - Results
21QA and Summarization
- Converging evolution of question answering and
multi-document summarization - QA move from factoids to more complex questions
(definition, relationships, opinions, etc.) - Summarization shift towards query-focused
multi-document information synthesis - Converging trends in evaluation similarity
between nuggets and the pyramid scheme - Both focus on basic factual units
- Both diagnostic
- Pyramid scheme has a more refined sense of
importance possible incorporation in QA? - Lots of synergy between the two communities
Nenkova et al., 2004
Lin and Demner-Fushman, 2005
22Why POURPRE?
- Fills an evaluation gap
- Allows automatic evaluation of answers
- Provides instant feedback on system performance
- Facilitates rapid experimentation
- POURPRE is preferred over ROUGE
- Tailored specifically for the definition QA task
- Better correlation with official scores and
rankings - More fine-grained nugget-level evaluation
- Diagnostic
- Nugget F-score as a general paradigm for
evaluating complex answers - Broad applicability of POURPRE to other QA tasks
23Conclusions
- No method exists to automatically evaluate
answers to complex questions - We propose a measure called POURPRE to address
this evaluation gap - We demonstrate on TREC QA data that our measure
correlates better with official scores than
direct application of metrics for other tasks
24Questions?
http//www.umiacs.umd.edu/jimmylin/downloads/