Automatically Evaluating Answers to Definition Questions - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Automatically Evaluating Answers to Definition Questions

Description:

Large-scale formal evaluations at TREC. Question Answering track, ... Who is Aaron Copland? What is a golden parachute? What are fractals? Evaluation Flow ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 25
Provided by: Jimm105
Category:

less

Transcript and Presenter's Notes

Title: Automatically Evaluating Answers to Definition Questions


1
Automatically Evaluating Answers to Definition
Questions
  • Jimmy Lin and Dina Demner-Fushman
  • University of Maryland
  • HLT/EMNLP 2005
  • Saturday, October 8, 2005

2
Roadmap
  • Background
  • Current state of question answering
  • How definition questions are evaluated
  • Lack of automatic evaluation metrics until now
  • Introducing POURPRE
  • How well does POURPRE correlate with official
    scores?
  • Discussion of evaluation methodology
  • Conclusions

3
Brief History of QA
  • Large-scale formal evaluations at TREC
  • Question Answering track, started in TREC 8
    (1999)
  • Factoid questions
  • Automatic evaluation method match answer
    patterns
  • More complex questions
  • Automatic evaluation method none available!

Whats the capital of Maryland? How many floors
does the Empire State Building have? Who won the
World Series in 1986?
Who is Colin Powell? The analyst is concerned
with arms trafficking to Colombian insurgents.
Specifically, the analyst would like to know of
the different routes used for arms entering
Colombia and the entities involved.
4
Definition Questions
  • Previously evaluated at
  • TREC 2003
  • TREC 2004 and 2005 (as other questions)
  • Tell me about
  • Find interesting facts about a topic
    (organization, person, term, event, etc.)
  • System output is a collection of passages
  • No limit on length of system output

Who is Aaron Copland? What is a golden
parachute? What are fractals?
5
Evaluation Flow
1. NIST collects runs from every system
2. Assessor reads all runs and manually creates
an answer key
score1
run1
score2
run2
score3
run3
3. Assessor uses answer key to manually score
all runs
score4
run4
6
Creating the Answer Key
  • Performed once per set of questions
  • After reading all system responses, assessor
    creates a list of nugget
  • Nugget basic unit of information fact
  • Categorized as vital or okay

What is the Cassini space probe? 1. 32 kilograms
plutonium powered vital 2. seven year journey
vital 3. Titan 4-B Rocket vital 4. send
Huygens to probe atmosphere of Titan, Saturn's
largest moon vital 5. parachute instruments to
planet's surface okay 6. oceans of ethane or
other hydrocarbons, frozen methane or water
okay 7. carries 12 packages scientific
instruments and a probe vital 8. NASA primary
responsible for Cassini orbiter okay
7
Scoring Each Run
  • Semantically match nuggets in answer key with
    system output
  • Compute final F-score
  • Recall component fraction of vital nuggets
    retrieved
  • Precision component length allowance for vital
    and okay nuggets retrieved
  • Recall favored over precision (?3)

NYT19990816.0266 Early in the Saturn visit,
Cassini is to send a probe named Huygens into the
smog-shrouded atmosphere of Titan, the planet's
largest moon, and parachute instruments to its
hidden surface to see if it holds oceans of
ethane or other hydrocarbons over frozen layers
of methane or water. Nuggets found 4, 5, 6
8
F-Score Details
Note recall is only a function of vital nuggets
Getting okay nuggets doesnt increase a runs
score directly only gives more length allowance
9
Semantic Matching
  • Answer nuggets must be matched against system
    output manually
  • Until now, there is no method for automatic
    evaluation
  • Researchers must wait for the yearly TREC cycle
  • No hill to climb!

Who is Al Sharpton? Nugget Harlem civil rights
leader System answer New York civil rights
activist Who is Ari Fleischer? Nugget Elizabeth
Doles Press Secretary System answer Ari
Fleischer, spokesman for Elizabeth Dole What is
the medical condition shingles? Nugget tropical
sic capsaicin relieves pain of shingles System
answer Epilepsy drug relieves pain from...
shingles
10
Automatic Nugget Matching
  • Heres a simple idea can we replace nugget
    matching with substring co-occurrence?
  • Works for machine translation (BLEU/NIST)
  • Works for summarization (ROUGE)
  • Definition question answering task is closer to
    summarization than MT
  • Most systems are extractive no need to measure
    fluency
  • Unigram co-occurrences should work well (from
    ROUGE)
  • Heres a catchy name POURPRE
  • Following a rainbow of colors ROUGE, BLEU,
    ORANGE, etc.

11
POURPRE
  • Nugget match (or Is this nugget present in the
    system response?)
  • Previously a manual binary decision
  • Replace with unigram overlap between system
    output and best-matching nugget from answer key
  • Calculation of F-score proceeds as before
  • Scoring variations
  • term counts fraction of terms matched
  • idf idf of matched terms / idf of all terms in
    nugget
  • Other variations
  • Macroaveraging vs. microaveraging
  • Stemming vs. no stemming

12
Evaluating POURPRE
  • Rank previous TREC QA runs with POURPRE
  • 54 runs from TREC 2003 (?5,3)
  • 63 runs from TREC 2004 (?3)
  • POURPRE variants idf/count, macro/micro, ?stem
  • Correlate with official scores and official
    rankings
  • Coefficient of correlation (R2)
  • Kendalls tau
  • Compare with ROUGE baseline
  • Concatenated all nuggets as the reference summary

13
Evaluation Methodology
Coefficient of correlation (R2)
Kendalls tau
14
Results
  • POURPRE macro, term counts, no stemming
  • Kendalls tau
  • R2
  • Neither stemming nor idf weighting helps

15
Scatter Plot
TREC 2004 (ß3)
16
Rank Swaps
  • Definition method1 ranks A before B, but method2
    ranks B before A
  • Rank swaps indicate instabilities in evaluation
    metrics
  • However, there is inherent measurement error
  • Small score differences are not consequential

17
Analysis of Rank Swaps
  • For all pairwise comparisons between POURPRE and
    official scores
  • Noted if a rank swap occurred
  • Measured the difference in official scores
  • Plotted a histogram of rank swaps binned by
    difference in official scores

18
Analysis of Rank Swaps
  • Conclusions
  • Rank swaps not much to worry about
  • Accuracy of POURPRE is within tolerance of
    measurement accuracy

81 rank swaps out of 1431 pairwise comparisons
Histogram of rank swaps between official and
POUPRE scores (TREC 2003, ?5), binned by
difference in official score
Voorhees, 2003
19
Vital/Okay?
  • Answer nuggets are classified as vital or
    okay
  • Distinction has large impact on scores
  • But hard to operationalize

Target Viagra vital Has caused deaths in
US. vital Heart patients especially warned
against use. okay Can cause vision and
headache problems. okay Viagra increases
nitric oxide's effect. okay Bob Dole rcvd
45000 weekly while endorsement ran. okay
After initial hesitation, Arab countries approved
use. okay Many insurers cover use of
drug. okay HMOs have labeled Viagra a
lifestyle drug. okay Called a 10-a-pill
therapy.
20
Impact of Vital/Okay Distinction
  • Experiments with nugget classification
  • Created variant answer keys
  • All nuggets vital
  • Vital/okay flipped
  • Random assignment of vital/okay
  • Scored all TREC runs using variant answer keys
  • Correlated results with official rankings
    (Kendalls tau)
  • Results

21
QA and Summarization
  • Converging evolution of question answering and
    multi-document summarization
  • QA move from factoids to more complex questions
    (definition, relationships, opinions, etc.)
  • Summarization shift towards query-focused
    multi-document information synthesis
  • Converging trends in evaluation similarity
    between nuggets and the pyramid scheme
  • Both focus on basic factual units
  • Both diagnostic
  • Pyramid scheme has a more refined sense of
    importance possible incorporation in QA?
  • Lots of synergy between the two communities

Nenkova et al., 2004
Lin and Demner-Fushman, 2005
22
Why POURPRE?
  • Fills an evaluation gap
  • Allows automatic evaluation of answers
  • Provides instant feedback on system performance
  • Facilitates rapid experimentation
  • POURPRE is preferred over ROUGE
  • Tailored specifically for the definition QA task
  • Better correlation with official scores and
    rankings
  • More fine-grained nugget-level evaluation
  • Diagnostic
  • Nugget F-score as a general paradigm for
    evaluating complex answers
  • Broad applicability of POURPRE to other QA tasks

23
Conclusions
  • No method exists to automatically evaluate
    answers to complex questions
  • We propose a measure called POURPRE to address
    this evaluation gap
  • We demonstrate on TREC QA data that our measure
    correlates better with official scores than
    direct application of metrics for other tasks

24
Questions?
  • Download POURPRE at

http//www.umiacs.umd.edu/jimmylin/downloads/
Write a Comment
User Comments (0)
About PowerShow.com