Ccile Paris, Nathalie Colineau and Ross Wilkinson - PowerPoint PPT Presentation

About This Presentation

Title:

Ccile Paris, Nathalie Colineau and Ross Wilkinson

Description:

... are required or favoured over others: fitness for purpose ... Needs 4 wheel drive for camping trips. Seating capacity: large ... Fitness into other modules: ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 21

Provided by: par5157

Learn more at: https://www.ling.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ccile Paris, Nathalie Colineau and Ross Wilkinson

1
NLG Systems Evaluation Establishing the Big
Picture

Cécile Paris, Nathalie Colineau and Ross
Wilkinson
CSIRO ICT Centre
Sydney, Australia

2
What have we learnt from a shared task approach
from our siblings (e.g., IR)

Advantages
Some algorithm system comparability tests
(e.g., inverse frequency works, length
normalisation does not)
Some shared resources
(Recognised) Disadvantages
It will only tell you some of the things one
needs to know important elements will be missed
It does not allow the community to answer some
important theoretical and practical questions
Too narrow
Note
No beliefs that there is a perfect system
There are some standards, but not a gold standard

3
Some beliefs?

Subtasks and input/output requirements need to be
standardised to make core technologies truly
comparable.
What needs to be evaluated is an approach with
its characteristics and its application context.
To evaluate systems/approaches we need to compare
them in a shared task.
Comparison is not a requirement for evaluation
nor shared task for comparison.
Quality of systems equates quality of output
(i.e., English text).
A system cannot be reduced to its output there
are other attributes.
There has to be a gold standard.
One measure cannot account for everything (even
if we were to look at the quality of the output
only).
One NLG technique works better than another.
Vive la difference! There is no such thing as
one size fits all typically there are pros and
cons.

4
We can compare apples and pears

We do it in every day for many things (e.g.,
usefulness of comparisons as found in consumer
reports)
Comparison
Does not require exact similarity
But focuses on a set of characteristics/attributes
.
Depending on situations and needs, different
characteristics are required or favoured over
others fitness for purpose
We propose a framework in which to describe
characteristics of NLG systems, modules or
approaches

5
Example Buying a car
Hard constraint must be between 30,000 and
40,000
RRP 36,490 Manual, 4WD, 5 doors, 7 seats,
145kw, 3.50L, Origin Korea, 2007
RRP 31,990 Manual, Convertible, 2 doors, 4
seats, 82kw, 1.60L Origin Spain, 2004

General
Convertible
Safety
Front side airbag
Brake assist system
Convertible rollover protection
Rain sensor windshield wipers
Dimensions
277 mm lower
600 mm smaller curb to curb turning circle
Engine
Clutchless manual gearbox

General
4 wheel drive
Larger seating capacity
Safety
Rear window wipers
Dimensions
766 mm longer 160 mm wider
Engine
1.9 l larger engine
2.1 faster acceleration 0-100 km/h
28 l larger fuel tank

Set of attributes that characterise a car
General
Safety
Dimensions
Engine
Etc

6
How can we compare (and choose)?

Depends on the criteria of a person (or of a
situation)
Roberts Priorities
Sports car
Size wants a smaller car
Safety important
Bills Priorities
Needs 4 wheel drive for camping trips
Seating capacity large
Does this mean that one car is better than
another? No
Comparison and evaluation in the abstract is not
necessarily meaningful
What is required is a way a framework to
describe characteristics

7
Can we apply these ideas to generation systems
(or modules)?
Example Generating Referring Expressions (GRE)

Input
Type (e.g., numerical, semantic)
Output
Type (e.g., English, logical form)
Quality
Number of expressions generated
Fitness into other modules
Place into overall NLG architecture (e.g.,
requires a text planning or a grammar component)
Configuration
Availability of parameters to fine-tune (e.g.,
user model, domain model)
General
Execution time

Input Output Fitness into other
modules Configuration General
8
An example Comparison of GRE components
Hard constraint need referring expressions in
English
System Y GRE module English LanguageOrigin
University Y
System X GRE moduleEnglish LanguageOrigin Lab
X

Input
Type numerical
Output
Type English
Quality has been shown to allow people to select
specific objects in a landscape
Fitness into other modules
Place into overall NLG architecture
Requires a text planning component
No additional lexico-grammatical component
needed
Configuration
Parameters to fine-tune
Yes, user model
Requirements creation of user model

Input
Type knowledge base
Output
Type logical form
Quality produces appropriate input to a
functional grammar
Fitness into other modules
Place into overall NLG architecture
Requires a text planning component
Requires a functional grammar for realisation
Configuration
Parameters to fine-tune no

9
Possible situations/criteria

My situation
My input is numerical data
I need parameters to fine-tune
Your situation
You have a domain model available
You already have a grammar component
You need a GRE to plug in
Different systems/approaches will be appropriate
(Similar debate has taken place for template vs
planning no best method depends on what one
needs to do)

10
What we need to develop/agree upon

Comprehensive set of characteristics that
describe and specify NLG components and systems
How to measure them? (when they need to be
measured)
Might be qualitative or quantitative
Might not be a gold standard
Might depend on the characteristics
(e.g., different measure for fluency, task
effectiveness, user satisfaction or cost/ease of
building a system)

11
A framework for evaluation

Inspired by other work --looking beyond
ourimmediate siblings, e.g.,
Information systems
Delone and McLean 92
Cornford et al. 94
ISO 9126
UM (effectiveness)

12
Need for a more general framework for evaluation

Enlarge the view of evaluation
Ensure we have a big picture(avoid dangers of
local view)
Organise the possible criteria/ways to think
about the questions to ask
Guides the experimental work
Consider NLG in its context andthat of its
stakeholders
Consider costs and benefits
Allows one to choose system/module best fit for
purpose
Allows for specific evaluation tasks, placing
them and their results into a larger context

13
A proposed framework
14
Refining the characteristics (with our work)
15
Using the framework to define characteristics --
GRE
What does this allow?
Choice Given a requirement, choose system
with characteristics that fit the environment
New attributes, guided by theframework
Comparison EvaluationGiven a system/module
for specific requirements, evaluation with
other systems can be done for a
specific characteristic (e.g., user satisfaction,
task completion, ease of building required input)
16
Impact of such a framework

Way to describe system (component, approach)
better understanding of strengths and
weaknesses.
Useful for evaluations and comparisons.
But also in general
Someone needing a component can choose
appropriate one
Someone outside the NLG community can choose a
module for their own purposes, without knowing
much about it increase visibility of
field in other communities
Way to compare systems (modules, approaches)
without need to standardise
Fit-for-purpose vs generic not an issue
Researchers constrained to work on a specific
domain/application can still describe their work
and be part of this activity no
exclusion

17
(Almost final) Remarks

Big picture
Funding
Fine-tuning a system for specific task no
longer an issue
Attention to important theoretical problems
Understanding of weaknesses strengths of
systems (modules, approaches)
Orthogonal issues
Finding balance between talking and doing
Shared resources vs shared tasks

N/A
18
Moving forward as a community

What should we do?
Define set of characteristics to
Understand position and specificity of an
approach (module, system)
Allow descriptions and comparisons
How?
Reflect on our own work and characterise it in
terms of its strengths (and weaknesses!) e.g.,
think about different stakeholders involved in
construction, maintenance, funding, etc.
Use framework as guidance
To understand an approach (module, system) from a
variety of perspectives (e.g., not just the
output)
To know what to evaluate depending on the
situation
To ensure we see the big picture

19
References

Cornford, T, Doukidis, G.I. Forster, D. (1994).
Experience with a structure, process and outcome
framework for evaluating an information system,
Omega, International Journal of Management
Science, 22 (5), 491-504.
DeLone, W. H. McLean, E. R. (1992). Information
Systems Success The Quest for the Dependent
Variable. In Information Systems Research, Volume
3, Issue 1 (March, 1992), 60-96.

20
Outline