Title: Ccile Paris, Nathalie Colineau and Ross Wilkinson
1NLG Systems Evaluation Establishing the Big
Picture
- Cécile Paris, Nathalie Colineau and Ross
Wilkinson - CSIRO ICT Centre
- Sydney, Australia
2What have we learnt from a shared task approach
from our siblings (e.g., IR)
- Advantages
- Some algorithm system comparability tests
(e.g., inverse frequency works, length
normalisation does not) - Some shared resources
- (Recognised) Disadvantages
- It will only tell you some of the things one
needs to know important elements will be missed
- It does not allow the community to answer some
important theoretical and practical questions - Too narrow
- Note
- No beliefs that there is a perfect system
- There are some standards, but not a gold standard
3Some beliefs?
- Subtasks and input/output requirements need to be
standardised to make core technologies truly
comparable. - What needs to be evaluated is an approach with
its characteristics and its application context. - To evaluate systems/approaches we need to compare
them in a shared task. - Comparison is not a requirement for evaluation
nor shared task for comparison. - Quality of systems equates quality of output
(i.e., English text). - A system cannot be reduced to its output there
are other attributes. - There has to be a gold standard.
- One measure cannot account for everything (even
if we were to look at the quality of the output
only). - One NLG technique works better than another.
- Vive la difference! There is no such thing as
one size fits all typically there are pros and
cons.
4We can compare apples and pears
- We do it in every day for many things (e.g.,
usefulness of comparisons as found in consumer
reports) - Comparison
- Does not require exact similarity
- But focuses on a set of characteristics/attributes
. - Depending on situations and needs, different
characteristics are required or favoured over
others fitness for purpose - We propose a framework in which to describe
characteristics of NLG systems, modules or
approaches
5Example Buying a car
Hard constraint must be between 30,000 and
40,000
RRP 36,490 Manual, 4WD, 5 doors, 7 seats,
145kw, 3.50L, Origin Korea, 2007
RRP 31,990 Manual, Convertible, 2 doors, 4
seats, 82kw, 1.60L Origin Spain, 2004
- General
- Convertible
- Safety
- Front side airbag
- Brake assist system
- Convertible rollover protection
- Rain sensor windshield wipers
- Dimensions
- 277 mm lower
- 600 mm smaller curb to curb turning circle
- Engine
- Clutchless manual gearbox
- General
- 4 wheel drive
- Larger seating capacity
- Safety
- Rear window wipers
- Dimensions
- 766 mm longer 160 mm wider
- Engine
- 1.9 l larger engine
- 2.1 faster acceleration 0-100 km/h
- 28 l larger fuel tank
-
- Set of attributes that characterise a car
- General
- Safety
- Dimensions
- Engine
- Etc
6How can we compare (and choose)?
- Depends on the criteria of a person (or of a
situation) - Roberts Priorities
- Sports car
- Size wants a smaller car
- Safety important
- Bills Priorities
- Needs 4 wheel drive for camping trips
- Seating capacity large
- Does this mean that one car is better than
another? No - Comparison and evaluation in the abstract is not
necessarily meaningful - What is required is a way a framework to
describe characteristics
7Can we apply these ideas to generation systems
(or modules)?
Example Generating Referring Expressions (GRE)
- Input
- Type (e.g., numerical, semantic)
- Output
- Type (e.g., English, logical form)
- Quality
- Number of expressions generated
- Fitness into other modules
- Place into overall NLG architecture (e.g.,
requires a text planning or a grammar component) - Configuration
- Availability of parameters to fine-tune (e.g.,
user model, domain model) - General
- Execution time
Input Output Fitness into other
modules Configuration General
8An example Comparison of GRE components
Hard constraint need referring expressions in
English
System Y GRE module English LanguageOrigin
University Y
System X GRE moduleEnglish LanguageOrigin Lab
X
- Input
- Type numerical
- Output
- Type English
- Quality has been shown to allow people to select
specific objects in a landscape -
- Fitness into other modules
- Place into overall NLG architecture
- Requires a text planning component
- No additional lexico-grammatical component
needed - Configuration
- Parameters to fine-tune
- Yes, user model
- Requirements creation of user model
- Input
- Type knowledge base
- Output
- Type logical form
- Quality produces appropriate input to a
functional grammar - Fitness into other modules
- Place into overall NLG architecture
- Requires a text planning component
- Requires a functional grammar for realisation
- Configuration
- Parameters to fine-tune no
9Possible situations/criteria
- My situation
- My input is numerical data
- I need parameters to fine-tune
- Your situation
- You have a domain model available
- You already have a grammar component
- You need a GRE to plug in
- Different systems/approaches will be appropriate
- (Similar debate has taken place for template vs
planning no best method depends on what one
needs to do)
10What we need to develop/agree upon
- Comprehensive set of characteristics that
describe and specify NLG components and systems - How to measure them? (when they need to be
measured) - Might be qualitative or quantitative
- Might not be a gold standard
- Might depend on the characteristics
- (e.g., different measure for fluency, task
effectiveness, user satisfaction or cost/ease of
building a system)
11A framework for evaluation
- Inspired by other work --looking beyond
ourimmediate siblings, e.g., - Information systems
- Delone and McLean 92
- Cornford et al. 94
- ISO 9126
- UM (effectiveness)
12Need for a more general framework for evaluation
- Enlarge the view of evaluation
- Ensure we have a big picture(avoid dangers of
local view) - Organise the possible criteria/ways to think
about the questions to ask - Guides the experimental work
- Consider NLG in its context andthat of its
stakeholders - Consider costs and benefits
- Allows one to choose system/module best fit for
purpose - Allows for specific evaluation tasks, placing
them and their results into a larger context
13A proposed framework
14Refining the characteristics (with our work)
15Using the framework to define characteristics --
GRE
What does this allow?
Choice Given a requirement, choose system
with characteristics that fit the environment
New attributes, guided by theframework
Comparison EvaluationGiven a system/module
for specific requirements, evaluation with
other systems can be done for a
specific characteristic (e.g., user satisfaction,
task completion, ease of building required input)
16Impact of such a framework
- Way to describe system (component, approach)
better understanding of strengths and
weaknesses. - Useful for evaluations and comparisons.
- But also in general
- Someone needing a component can choose
appropriate one - Someone outside the NLG community can choose a
module for their own purposes, without knowing
much about it increase visibility of
field in other communities - Way to compare systems (modules, approaches)
without need to standardise - Fit-for-purpose vs generic not an issue
- Researchers constrained to work on a specific
domain/application can still describe their work
and be part of this activity no
exclusion
17(Almost final) Remarks
- Big picture
- Funding
- Fine-tuning a system for specific task no
longer an issue - Attention to important theoretical problems
- Understanding of weaknesses strengths of
systems (modules, approaches) - Orthogonal issues
- Finding balance between talking and doing
- Shared resources vs shared tasks
N/A
18Moving forward as a community
- What should we do?
- Define set of characteristics to
- Understand position and specificity of an
approach (module, system) - Allow descriptions and comparisons
- How?
- Reflect on our own work and characterise it in
terms of its strengths (and weaknesses!) e.g.,
think about different stakeholders involved in
construction, maintenance, funding, etc. - Use framework as guidance
- To understand an approach (module, system) from a
variety of perspectives (e.g., not just the
output) - To know what to evaluate depending on the
situation - To ensure we see the big picture
19References
- Cornford, T, Doukidis, G.I. Forster, D. (1994).
Experience with a structure, process and outcome
framework for evaluating an information system,
Omega, International Journal of Management
Science, 22 (5), 491-504. - DeLone, W. H. McLean, E. R. (1992). Information
Systems Success The Quest for the Dependent
Variable. In Information Systems Research, Volume
3, Issue 1 (March, 1992), 60-96.
20Outline
- Misconceptions what we commonly think is true
- Can we compare apples and pears to get rid of the
lemons? - How does this apply to NLG?
- Enlarging the view of evaluation The Big
Picture - Remarks
- Moving forward