Evaluating Question Answering Systems - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Evaluating Question Answering Systems

Description:

strings limited to 50 or 250 bytes. document must support answer. Guidelines ... ethnologist primatologist animals behaviorist. wife.*vans*Lawick ... – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 51
Provided by: elle128
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Question Answering Systems


1
Evaluating Question Answering Systems
  • Ellen M. Voorhees

2
So, enough already!
  • Evaluations in HLT proliferating
  • DARPA speech evals
  • MUC, ACE
  • TREC, NTCIR, CLEF
  • senseval, parseval, DUC, TDT...
  • Lets stop evaluating and get some real work
    done!

3
Case for Community Evaluations
  • Form/solidify a research community
  • Establish the research methodology
  • Facilitate technology transfer
  • Document the state-of-the-art
  • Amortize the costs of infrastructure

4
Of course, some downside
  • Evaluations do take resources from other efforts
  • money to defray evaluation costs
  • researcher time
  • minimize effect by keeping evaluation-only tasks
    such as result reporting simple
  • Overfitting
  • entire community trains to peculiarities of test
    set
  • minimize effect by having multiple test sets,
    evolving the evaluation task

5
What is a Good Eval Task?
  • Abstraction of real-world task so variables
    affecting performance can be controlled...
  • ...but must capture salient aspects of real task
    or exercise is pointless
  • Metrics must accurately predict relative
    effectiveness on those aspects
  • Adequate level of difficulty
  • Best if measures are diagnostic

6
TREC QA Track
  • Goal
  • encourage research into systems that return
    answers, rather than document lists
  • Motivation bring benefits of large-scale
    evaluation to QA task
  • provide a common problem for the IR and IE
    communities
  • investigate appropriate evaluation
    methodologiesfor QA

7
Original QA Track Task
  • Given
  • set of fact-based, short-answer questions
  • 3 gb newspaper/newswire text
  • Return
  • ranked list of document, answer-string pairs
  • strings limited to 50 or 250 bytes
  • document must support answer
  • Guidelines
  • completely automatic processing
  • answer guaranteed to exist in collection
  • assume documents are factual

8
Sample Questions
  • How much folic acid should an expectant mother
    get daily?
  • Who invented the paper clip?
  • What university was Woodrow Wilson president of?
  • Where is Rider College located?
  • Name a film in which Jude Law acted.

9
Evaluation
  • Human assessors judge correctness of responses
  • Score using mean reciprocal rank
  • score for individual question is the reciprocal
    of the rank at which the first correct response
    returned (0 if no correct response returned)
  • score of a run is mean over set of questions

10
Question Answering Techniques
  • 250 bytes enough for passage retrieval
    techniques 50 bytes is not
  • For 50 byte limit
  • classify question type
  • look for close-by entities that match entailed
    answer type
  • fall back to passage retrieval if failure

11
Template Matching
  • Determine question type by template matching
  • Template-matching is successful when questions
    are predictable
  • Who vs. What ltperson-descriptiongt
  • Occasional problems even when templates match
  • Who was the first American in space?
  • As for Wilson himself, he became a senator by
    defeating Jerry Brown, who has been called the
    first American in space.

12
Question Answering Techniques
  • TREC-8 approaches generally retained
  • better question classification
  • wider variety of methods for finding entailed
    answer types
  • frequent use of WordNet
  • High-quality document search still helpful
  • TREC 2001 saw onset of methods that substitute
    massive amounts of data for sophisticated
    processing

13
TREC 2001 Main Task Results
Scores for the best run of the top 8 groups using
strict evaluation
14
Source of Questions
  • TREC-8 developed for track
  • as unambiguous as possible
  • tended to be back-formulations
  • TREC-9 ideas mined from Excite log
  • no reference to docs during creation
  • TREC 2001 literal questions from logs
  • logs donated by AskJeeves and Microsoft
  • Had major impact on task real questions much
    harder because more ambiguous

15
Evaluation Methodology
  • Different philosophies
  • IR the user is the sole judge of a
    satisfactory response
  • human assessors judge responses
  • flexible interpretation of correct response
  • final scores comparative, not absolute
  • IE there exists the answer
  • answer keys developed by application expert
  • requires enumeration of all acceptable responses
    at outset
  • subsequent scoring trivial final scores absolute

16
QA Evaluation Methodology
  • NIST assessors judge answer strings
  • binary judgment of correct/incorrect
  • document provides context for answer
  • Each question independently judged by 3 assessors
  • can build high-quality final judgment set
  • provided data for measuring effect of differences
    on final scores

17
User Evaluation Necessary
  • Even for these questions, context matters
  • Taj Mahal casino in Atlantic City
  • Legitimate differences in opinion as to whether
    string contains correct answer
  • granularity of dates
  • completeness of names
  • confusability of answer string
  • If assessors opinions differ, so will eventual
    end-users opinions

18
Validating the Methodology
  • User-based evaluation is appropriate, but is it
    reliable?
  • how do judgment differences affect scores?
  • Does the methodology produce the equivalent of an
    IR test collection for QA?
  • researchers able to evaluate their own runs

19
Mean Reciprocal Rank by Qrels
20
Kendall Correlations
21
Comparative Scores are Stable
  • Mean Kendall t of .96 (18 swaps) equivalent to
    variation found in TREC IR test collections
  • Judgment sets using 1 judges opinion equivalent
    to adjudicated judgments
  • adjudicated gt 3 times the cost of 1 judge qrels

22
QA Test Collections
  • Goal is reusable test collections
  • researchers evaluate own variants
  • main source of improvement for IR systems in TREC
  • But not the case for QA
  • strings judged little overlap across runs
  • need procedure for deciding if unjudged string is
    okay given a set of judged strings

23
Automatic Evaluation
  • In general, evaluating strings is equivalent to
    solving original problem
  • Approximations
  • U. Ottawa accept any string that contains a
    string judged correct
  • MITRE have human produce answer key and use
    word recall threshold
  • NIST produce patterns from judged strings
    accept any string with match

24
Example Patterns
  • Who invented Silly Putty?
  • General\sElectric
  • Where is the location of the Orange Bowl?
  • \sMiami\s \sin\sMiami\s\.?\sto\sMiam
    i at\sMiamiMiami\s'?\ss\sdowntown Orange.\
    sin\s.MiamiOrange\sBowl\s,\sMiami Miami\s'
    ?\ss OrangeDade\sCounty
  • Who was Jane Goodall?
  • naturalist expert\son\schimpschimpanzee\s
    specialist chimpanzee\sresearcherchimpanzee\s
    -?\sobserver ethologists?pioneered.study\sof\
    sprimates anthropologistethnologist primatolo
    gist animal\sbehaviorist
  • wife.van\sLawickscientist\sof\sunquestionabl
    e\sreputation most\srecognizable\sliving\sscie
    ntist

25
Judging Strings Using Patterns
  • Consider a string correct if any pattern for that
    question matches
  • Compute MRR score for each run based on the
    pattern judgments
  • Compute Kendall t between ranking based on
    adjudicated judgments and ranking based on
    pattern judgments
  • t.96, equivalent to different humans, for
    TREC-8
  • t.89 for TREC-9

26
Problem Solved?
  • Not really...
  • patterns were created from runs that were then
    re-scored
  • patterns dont differentiate between documents
  • patterns dont penalize answer stuffing
  • errors are correlated with system functionality
  • document pools woefully incomplete
  • Useful if limitations understood
  • Still need full solution

27
Extensions to the Original Task
  • TREC 2001
  • systems must determine if there is an answer
    present in the collection
  • list task added
  • TREC 2002
  • exact answer rather than text snippet
  • TREC 2003
  • definition questions added
  • main task included all question types
  • TREC 2004
  • question series as abstraction of dialog

28
Motivation for Exact Answers
What river in the US is known as the Big Muddy?
  • the Mississippi
  • Known as Big Muddy, the Mississippi is the
    longest
  • as Big Muddy , the Mississippi is the longest
  • messed with . Known as Big Muddy , the Mississip
  • Mississippi is the longest river in the US
  • the Mississippi is the longest river in the US,
  • the Mississippi is the longest river(Mississippi)
  • has brought the Mississippi to ist lowest
  • ipes.In Life on the Mississippi,Mark Twain wrote
    t
  • SoutheastMississippiMark Twainofficials began
  • Known Mississippi US, Minnesota Gulf Mexico
  • Mud Island,MississippiThe-- history,Memphis

29
Motivation for Exact Answers
  • Text snippets masking important differences among
    systems
  • Pinpointing precise extent of answer important to
    driving technology
  • not a statement that deployed systems should
    return only exact answers
  • exact answers may be important as component in
    larger language systems

30
Recognizing Exact Answers
  • Gave assessors guidelines
  • most minimal response possible not the only exact
    answer
  • e.g., accept Mississippi river for What is the
    longest river in the United States?
  • ungrammatical responses not exact
  • e.g., in Mississippi vs. Mississippi in
  • justification is not exact
  • e.g., At 2,348 miles the Mississippi river is
    the longest US river is inexact

31
Assessors Continue to Disagree
  • 80 judgments Wrong
  • 50 of responses where at least one judgment was
    not W had disagreements
  • Of those, 33 involved disagreements between
    Right and ineXact
  • well-known granularity issue now reflected here
  • For dates and quantities, disagreement among
    Wrong and ineXact

32
But Comparative Results Still Stable
  • Kendall t scores between system rankings gt 0.9
  • Scores for rankings using adjudicatedjudgments gt
    0.94

33
QA List Task
  • Goal force systems to assemble an answer from
    multiple documents
  • Instance-finding task
  • Name 4 U.S. cities that have a Shubert theater
  • What are 9 novels written by John Updike?
  • later tracks did not give target number of
    instances
  • response is an unordered set of docid,
    answer-string pairs

34
QA List Evaluation
  • Each list judged as a unit
  • individual instances marked correct/unsupported/in
    correct/inexact
  • subset of correct instances could be marked
    distinct
  • Evaluation metric
  • accuracy when target number of instances given
  • average F(b1) when no target number specified

35
TREC 2003 List Results
F score of best run for top 15 groups for list
component
36
Definition Questions
  • Represented about 25 of test set in TREC 2001
  • What is an atom? What is epilepsy?
  • What are invertebrates? Who is Colin Powell?
  • Are hard for systems to answer assessors to
    judge
  • lack of context/user model unrealistic
  • while real, there are better ways of finding
    definitions than looking in large corpus
  • What is a good answer?

37
Issues
  • Have same concept-matching problem as in other
    NLP evals (e.g., summarization)
  • want to reward systems for retrieving all of the
    important concepts required penalize systems
    for retrieving irrelevant or redundant concepts
  • Recall RetrievedRequired/Required
  • Precision RetrievedRequired/Retrieved
  • but concepts represented in English in many ways
  • no one-to-one correspondence between items
    concepts
  • Different questions have very different sizes for
    Required

38
Definition Evaluation
  • Have assessor create list of concepts that
    definition should contain
  • indicate essential concepts
  • okay concepts
  • Mark concepts in system responses
  • mark a concept at most once
  • individual item may have multiple, one, or no
    concepts

39
What is a golden parachute?
  • Assessor nuggets
  • Agreement between companies and top executives
  • Provides remuneration to executives who lose jobs
  • Remuneration is usually very generous
  • Encourages executives not to resist takeover
    beneficial to shareholders
  • Incentive for executives to join companies
  • Arrangement for which IRS can impose excise tax

Judged system response
40
Evaluation
  • With this methodology, concept recall computable,
    but not concept precision
  • no satisfactory way to list all concepts
    retrieved
  • assessors cannot enumerate all concepts in text
  • granularity issue
  • unnatural task
  • items not well correlated, very easy to game
  • Rough approximation to concept precision length
  • count (non-white-space) characters in all items
  • intuition is that users prefer shorter of two
    definitions with same concepts

41
TREC 2003 Definition Results
F(ß5) score of best run for top 15 groups for
definition component
42
Reliability of Definition Evaluation
  • Mistakes by assessors
  • exist in all evaluations
  • can be directly measured since no pooling
  • Differences of opinion
  • different assessors disagree as to correctness
  • inherent in NLP tasks
  • Sample of questions
  • different systems do relatively differently on
    different questions
  • particular sample of questions can skew results
  • more questions lead to more stable results

43
Mistakes by Assessors
  • 14 pairs of identical definition components
  • across all pairs, 19 different definition
    questions judged differently
  • (roughly) uniform across assessors
  • number of questions affected ranges from 0 to 10
  • difference in F(ß5) scores ranges from 0.0 to
    0.043, with a mean of 0.013
  • differences in F scores 0.043 for some
    different systems clearly must be considered
    equivalent
  • New task
  • consistency improved somewhat as assessors gained
    experience
  • better training re granularity will help some
  • will never eliminate all errors

44
Differences of Opinion
  • Each question independently judged by 2 assessors
  • assessors differed in what nuggets desired
  • assessors differed in whether nuggets vital
  • assessors did not differ as much in whether a
    nugget was present (modulo mistakes)
  • Correlation among system rankings when questions
    judged by different assessors
  • compute Kendall t correlation between rankings
  • t0.848, representing 113/1485 pairwise swaps
  • 8 swaps among systems whose F(ß5) scores as
    judged by original assessors differed by gt 0.1
  • largest F(ß5) difference with swap was 0.123

45
Sample of Questions in Test Set

Need differencegt 0.1 for error rate lt 5
5
46
Definition Evaluation
  • Noise within definition evaluation comparatively
    large
  • need to consider F scores within 0.1 of one
    another equivalent
  • coarse evaluation
  • large equivalence classes of runs
  • one fix is to increase number of questions
  • larger sample of questions
  • individual mistakes have less effect
  • evaluation more costly

47
TREC 2004 Task
  • Process question series
  • each series is about a specified target, and the
    goal is to gather info about, or define, the
    target
  • a series contains factoid and list questions,
    plus a final other question
  • questions are tagged as to type and scored by
    type
  • final score is a weighted average of 3 component
    scores
  • FinalScore 1/2Factoid 1/4List 1/4Other

48
QA Track Results
  • Solidify a community
  • enormous growth in QA community
  • world-wide interest (e.g., QA tasks in NTCIR,
    CLEF)
  • Establish the research methodology
  • showed that even facts are context-sensitive
  • first steps toward evaluation complex answers
  • Facilitate technology transfer
  • common architecture for factoid questions
  • Document the state-of-the-art
  • task for which NLP techniques shows real benefit
  • rough boundary when IR techniques insufficient
  • Amortize the costs of infrastructure
  • patterns partial solution use with care

49
Where to next?
  • Synergy between QA and summarization
  • continue to explore evaluation methodologies for
    complex questions
  • reduce emphasis on factoids to support role
  • Context-sensitive QA
  • personalized to user
  • takes account of implicit background as well as
    explicit cues within interaction

50
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com