Ccile Paris, Nathalie Colineau and Ross Wilkinson - PowerPoint PPT Presentation

About This Presentation
Title:

Ccile Paris, Nathalie Colineau and Ross Wilkinson

Description:

... are required or favoured over others: fitness for purpose ... Needs 4 wheel drive for camping trips. Seating capacity: large ... Fitness into other modules: ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 21
Provided by: par5157
Category:

less

Transcript and Presenter's Notes

Title: Ccile Paris, Nathalie Colineau and Ross Wilkinson


1
NLG Systems Evaluation Establishing the Big
Picture
  • Cécile Paris, Nathalie Colineau and Ross
    Wilkinson
  • CSIRO ICT Centre
  • Sydney, Australia

2
What have we learnt from a shared task approach
from our siblings (e.g., IR)
  • Advantages
  • Some algorithm system comparability tests
    (e.g., inverse frequency works, length
    normalisation does not)
  • Some shared resources
  • (Recognised) Disadvantages
  • It will only tell you some of the things one
    needs to know important elements will be missed
  • It does not allow the community to answer some
    important theoretical and practical questions
  • Too narrow
  • Note
  • No beliefs that there is a perfect system
  • There are some standards, but not a gold standard

3
Some beliefs?
  • Subtasks and input/output requirements need to be
    standardised to make core technologies truly
    comparable.
  • What needs to be evaluated is an approach with
    its characteristics and its application context.
  • To evaluate systems/approaches we need to compare
    them in a shared task.
  • Comparison is not a requirement for evaluation
    nor shared task for comparison.
  • Quality of systems equates quality of output
    (i.e., English text).
  • A system cannot be reduced to its output there
    are other attributes.
  • There has to be a gold standard.
  • One measure cannot account for everything (even
    if we were to look at the quality of the output
    only).
  • One NLG technique works better than another.
  • Vive la difference! There is no such thing as
    one size fits all typically there are pros and
    cons.

4
We can compare apples and pears
  • We do it in every day for many things (e.g.,
    usefulness of comparisons as found in consumer
    reports)
  • Comparison
  • Does not require exact similarity
  • But focuses on a set of characteristics/attributes
    .
  • Depending on situations and needs, different
    characteristics are required or favoured over
    others fitness for purpose
  • We propose a framework in which to describe
    characteristics of NLG systems, modules or
    approaches

5
Example Buying a car
Hard constraint must be between 30,000 and
40,000
RRP 36,490   Manual, 4WD, 5 doors, 7 seats,
145kw, 3.50L, Origin Korea, 2007
RRP 31,990   Manual, Convertible, 2 doors, 4
seats, 82kw, 1.60L Origin Spain, 2004
  • General
  • Convertible
  • Safety
  • Front side airbag
  • Brake assist system
  • Convertible rollover protection
  • Rain sensor windshield wipers
  • Dimensions
  • 277 mm lower
  • 600 mm smaller curb to curb turning circle
  • Engine
  • Clutchless manual gearbox
  • General
  • 4 wheel drive
  • Larger seating capacity
  • Safety
  • Rear window wipers
  • Dimensions
  • 766 mm longer 160 mm wider
  • Engine
  • 1.9 l larger engine
  • 2.1 faster acceleration 0-100 km/h
  • 28 l larger fuel tank
  • Set of attributes that characterise a car
  • General
  • Safety
  • Dimensions
  • Engine
  • Etc

6
How can we compare (and choose)?
  • Depends on the criteria of a person (or of a
    situation)
  • Roberts Priorities
  • Sports car
  • Size wants a smaller car
  • Safety important
  • Bills Priorities
  • Needs 4 wheel drive for camping trips
  • Seating capacity large
  • Does this mean that one car is better than
    another? No
  • Comparison and evaluation in the abstract is not
    necessarily meaningful
  • What is required is a way a framework to
    describe characteristics

7
Can we apply these ideas to generation systems
(or modules)?
Example Generating Referring Expressions (GRE)
  • Input
  • Type (e.g., numerical, semantic)
  • Output
  • Type (e.g., English, logical form)
  • Quality
  • Number of expressions generated
  • Fitness into other modules
  • Place into overall NLG architecture (e.g.,
    requires a text planning or a grammar component)
  • Configuration
  • Availability of parameters to fine-tune (e.g.,
    user model, domain model)
  • General
  • Execution time

Input Output Fitness into other
modules Configuration General
8
An example Comparison of GRE components
Hard constraint need referring expressions in
English
System Y GRE module English LanguageOrigin
University Y
System X GRE moduleEnglish LanguageOrigin Lab
X
  • Input
  • Type numerical
  • Output
  • Type English
  • Quality has been shown to allow people to select
    specific objects in a landscape
  • Fitness into other modules
  • Place into overall NLG architecture
  • Requires a text planning component
  • No additional lexico-grammatical component
    needed
  • Configuration
  • Parameters to fine-tune
  • Yes, user model
  • Requirements creation of user model
  • Input
  • Type knowledge base
  • Output
  • Type logical form
  • Quality produces appropriate input to a
    functional grammar
  • Fitness into other modules
  • Place into overall NLG architecture
  • Requires a text planning component
  • Requires a functional grammar for realisation
  • Configuration
  • Parameters to fine-tune no

9
Possible situations/criteria
  • My situation
  • My input is numerical data
  • I need parameters to fine-tune
  • Your situation
  • You have a domain model available
  • You already have a grammar component
  • You need a GRE to plug in
  • Different systems/approaches will be appropriate
  • (Similar debate has taken place for template vs
    planning no best method depends on what one
    needs to do)

10
What we need to develop/agree upon
  • Comprehensive set of characteristics that
    describe and specify NLG components and systems
  • How to measure them? (when they need to be
    measured)
  • Might be qualitative or quantitative
  • Might not be a gold standard
  • Might depend on the characteristics
  • (e.g., different measure for fluency, task
    effectiveness, user satisfaction or cost/ease of
    building a system)

11
A framework for evaluation
  • Inspired by other work --looking beyond
    ourimmediate siblings, e.g.,
  • Information systems
  • Delone and McLean 92
  • Cornford et al. 94
  • ISO 9126
  • UM (effectiveness)

12
Need for a more general framework for evaluation
  • Enlarge the view of evaluation
  • Ensure we have a big picture(avoid dangers of
    local view)
  • Organise the possible criteria/ways to think
    about the questions to ask
  • Guides the experimental work
  • Consider NLG in its context andthat of its
    stakeholders
  • Consider costs and benefits
  • Allows one to choose system/module best fit for
    purpose
  • Allows for specific evaluation tasks, placing
    them and their results into a larger context

13
A proposed framework
14
Refining the characteristics (with our work)
15
Using the framework to define characteristics --
GRE
What does this allow?
Choice Given a requirement, choose system
with characteristics that fit the environment
New attributes, guided by theframework
Comparison EvaluationGiven a system/module
for specific requirements, evaluation with
other systems can be done for a
specific characteristic (e.g., user satisfaction,
task completion, ease of building required input)
16
Impact of such a framework
  • Way to describe system (component, approach)
    better understanding of strengths and
    weaknesses.
  • Useful for evaluations and comparisons.
  • But also in general
  • Someone needing a component can choose
    appropriate one
  • Someone outside the NLG community can choose a
    module for their own purposes, without knowing
    much about it increase visibility of
    field in other communities
  • Way to compare systems (modules, approaches)
    without need to standardise
  • Fit-for-purpose vs generic not an issue
  • Researchers constrained to work on a specific
    domain/application can still describe their work
    and be part of this activity no
    exclusion

17
(Almost final) Remarks
  • Big picture
  • Funding
  • Fine-tuning a system for specific task no
    longer an issue
  • Attention to important theoretical problems
  • Understanding of weaknesses strengths of
    systems (modules, approaches)
  • Orthogonal issues
  • Finding balance between talking and doing
  • Shared resources vs shared tasks

N/A
18
Moving forward as a community
  • What should we do?
  • Define set of characteristics to
  • Understand position and specificity of an
    approach (module, system)
  • Allow descriptions and comparisons
  • How?
  • Reflect on our own work and characterise it in
    terms of its strengths (and weaknesses!) e.g.,
    think about different stakeholders involved in
    construction, maintenance, funding, etc.
  • Use framework as guidance
  • To understand an approach (module, system) from a
    variety of perspectives (e.g., not just the
    output)
  • To know what to evaluate depending on the
    situation
  • To ensure we see the big picture

19
References
  • Cornford, T, Doukidis, G.I. Forster, D. (1994).
    Experience with a structure, process and outcome
    framework for evaluating an information system,
    Omega, International Journal of Management
    Science, 22 (5), 491-504.
  • DeLone, W. H. McLean, E. R. (1992). Information
    Systems Success The Quest for the Dependent
    Variable. In Information Systems Research, Volume
    3, Issue 1 (March, 1992), 60-96.

20
Outline
  • Misconceptions what we commonly think is true
  • Can we compare apples and pears to get rid of the
    lemons?
  • How does this apply to NLG?
  • Enlarging the view of evaluation The Big
    Picture
  • Remarks
  • Moving forward
Write a Comment
User Comments (0)
About PowerShow.com