An Evaluation Competition Eight Reasons to be Cautious - PowerPoint PPT Presentation

About This Presentation
Title:

An Evaluation Competition Eight Reasons to be Cautious

Description:

Evaluation requires a gold standard, i.e., clearly specified input ... 8. Stifling Science. To push this forward, we have to agree on the input (and interfaces) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 13
Provided by: ds470
Category:

less

Transcript and Presenter's Notes

Title: An Evaluation Competition Eight Reasons to be Cautious


1
An Evaluation Competition?Eight Reasons to be
Cautious
  • Donia Scott
  • Open University
  • Johanna Moore
  • University of Edinburgh

2
1. All that Glitters is not Gold
  • Evaluation requires a gold standard, i.e.,
    clearly specified input/output pairs
  • Does this make sense for NLG?
  • For most NLG tasks, there is no one right answer
    (Walker, LREC 2005)
  • Any output that allows the user to successfully
    perform the task is acceptable
  • Using human outputs assumes they are properly
    geared to the larger purpose the outputs are
    meant for. (KSJ, p.c.)

3
2.Whats good for the goose
  • Most important criterion is fitness for purpose
  • Cant compare output of systems designed for
    different purposes
  • NLG systems (unlike parsing and MT?) serve a
    wide range of functions

4
3. Dont count on metrics
  • Summarization and MT communities are questioning
    the usefulness of their shared metrics
  • BLEU does not correlate with human judgements of
    translation quality (Callison-Burch et al. EACL
    2006)
  • BLEU should only be used to compare versions of
    the same system (Knight, EACL 2006 invited talk)
  • Will nuggets of pyramids topple over?

5
4. Whats the input?
  • There is no agreed input for any stage of the NLG
    pipeline
  • Or even where the NLG problem starts, e.g.,
    weather report generation
  • Is input raw weather data or significant events
    determined by weather analysis program?
  • Weather forecasting not part of the NLG problem!
  • But, quality of the text depends on quality of
    the data analysis!

6
5. What to standardize/evaluate?
  • Realization (for example)
  • Should input contain
  • rhetorical goals (a la Hovy)
  • information structure
  • Should output contain
  • prosodic markup

7
6. Plug and play delusion
  • Requires agreeing on interfaces at each stage of
    the pipeline
  • Not, its gonna be XML
  • Must define representations to be passed and
    their semantics (a la RAGS)

8
7. Who will pay the piper?
  • Its pretty clear why DARPA pays for ASR, MT,
    Summarization, TDT, TREC, etc.
  • Whats the killer-app for NLG?
  • The fact that NSF is holding this workshop and
    consulting the research community is a very good
    sign

9
8. Stifling Science
  • To push this forward, we have to agree on the
    input (and interfaces)
  • Whatever we agree on will limit the phenomena we
    study and the theories we can test
  • E.g., SPUD
  • Hard to find a task the allows study of all
    phenomena community is interested in
  • E.g., MapTask

10
What are we evaluating?
  • Is the text (speech) generated
  • Grammatical?
  • Natural?
  • Easy to comprehend?
  • Memorable?
  • Suitable to enable user to achieve their intended
    purpose?

11
Recommendations
  • Must be clear about who is going to learn what
    from the (very large) effort
  • Task chosen must
  • be realistic, i.e., reflect how effective text
    (or speech) generated is to enable user to
    achieve their purpose
  • inform NLG research, i.e., help us learn things
    that enable development of better systems

12
Evaluation competition for NLG?Donia Scott and
Johanna Moore
  • Evaluation is (obviously!) important
  • but doing this properly in NLG is very hard
  • The progress of the field is (obviously!)
    important
  • but NLG has always lagged behind NLU all signs
    point to this gap widening
  • There is no evidence to suggest that an
    evaluation competition, as described by AE,
    would be a remedy to either problem could even
    be further damaging
  • Read our paper-ette

13
How can we progress the field?
  • Very careful consideration given to why NLG is in
    decline
  • Science few people tackling the hard theoretical
    problems
  • Applications no killer app has yet emerged
  • for most current applications, NLG component only
    an engineering problem
  • Very careful consideration given to how to best
  • meet the real evaluation needs of the field
  • enable sharing/re-use of data and components
  • progress the science
  • Do better science!
  • Where possible/suitable, use RAGS and take it
    further
  • Evaluation competition?
  • notion of task as conceived is
    simplistic/shallow
  • who is supposed to learn from this?? Not enough
    to say that well learn about NLG
  • field will need to be in much better shape to
    benefit from such a beast

14
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com