Title: Natural Language Generation
1Natural Language Generation
- Alice Oh
- aliceo_at_cs.cmu.edu
- 18 November 2014
2What is NLG?
- Natural Language Understanding (NLU)
- Natural Language Generation (NLG)
- There has been active research in machine
translation and automatic summarization (e.g.,
stock quotes, medical information) communities.
Text
Semantic (Syntactic) Representation
Text
Semantic (Syntactic) Representation
3Application of NLG in MT
- Generation of output
- Generation of paraphrase
- Generation of interlingua (SQL queries,
summarization tables, etc.) - Generation of controlled language
- The whole MT problem can be viewed as an NLG
problem!
Source Language
Target Language
NLG
4Why spend time/effort on NLG?
- What is the problem here? There are many more
researchers working on NLU than NLG (esp. in the
U.S.) - Why? Because of the squeezing out toothpaste
analogy -- which is not true! - Why is NLG important? Its the only thing users
see/hear (do not ruin your otherwise great
system) - What makes NLG difficult? NLG needs to KNOW vs.
UNDERSTAND - Existing comprehension systems as a rule extract
considerably less information from a text than a
generator must appreciate in generating one.
Examples include the reasons why a given word or
syntactic construction is used rather than an
alternative, what constitutes the style and
rhetoric appropriate to a given genre and
situation, or why information is clustered in one
pattern of sentences rather than another. -
McDonald, 1993
5Levels of NLG
6Surface Realization
- Determining how the underlying content of a text
should be mapped into a sequence of grammatically
correct sentences. An NLG system has to decide
which syntactic form to use, and it has to ensure
that the resulting text is syntactically and
morphologically correct. - (Mellish and Dale, 1998)
7Why Surface Realization?
- Relative agreement on
- input
- output
- goals
- More universally needed in MT than
- content planning
- text planning
- Somewhat easier to compare different techniques
8Surface Realization Different Techniques
- Rule-based
- Templates
- Corpus-based
- Rules Corpus
- Rules Templates
9Surface Realization Rule-Based
- Generates text using generation grammar rules
(similar to rule-based understanding techniques) - Most popular in the research community
- Long development time
- Often based on a specific linguistic theory
(Systemic Functional Grammar seems most popular) - Portable across domains (when grammar coverage is
good) - Systems FUF/SURGE, Penman, KPML
10Surface Realization Rule-Based
- Input extensive semantic/syntactic features
(e.g., from interlingua) - Output high-quality sentences
- Knowledge Sources generation grammar,
(domain-specific) lexicon - Knowledge Acquisition hand-crafted
- Degradation from underspecified input default
handling - Degradation from lacking knowledge lower-quality
output
11Surface Realization Templates
- Generates text using canned expressions and
hand-crafted templates - Popular in commercial applications where similar
documents are produced in large quantities (e.g.,
customer service letter writing) - Also popular in systems where generated output
spans a narrow range (e.g., spoken dialog
systems) - Systems CLINT (business-letter writing)
12Surface Realization Templates
- Input minimally specified
- Output limited set of sentences
- Knowledge Sources templates
- Knowledge Acquisition hand-crafted by looking at
domain-specific corpora - Degradation from underspecified input N/A
- Degradation from lacking knowledge no output
13Surface Realization Corpus-based
- Developed for a task-oriented spoken dialog
system - Implemented in the CMU Communicator System
- Fast prototyping for different domains
- Natural output for spoken dialog
- A first attempt at a truly corpus-based
stochastic generation - Systems CMU, IBM
14Surface Realization Corpus-Based
- Input dialog act
- Output medium-quality sentences
- Knowledge Sources language models
- Knowledge Acquisition domain-specific corpora
- Degradation from underspecified input N/A
- Degradation from lacking knowledge no output
15Surface Realization Rules Corpus
- Nitrogen
- Developed to account for underspecified input
which causes problems for rule-based techniques - Developed for a machine translation project at
ISI/USC - Stochastic Generation at U. Edinburgh
- Accounts for sentence-level attributes (in
addition to word-level attributes) - An interesting technique to apply a certain
authors style to another text (applying
Shakespeares style, e.g., sentence length
distribution and vocabulary diversity, to Mark
Twain!)
16Surface Realization Rules Corpus
- Input underspecified syntactic/semantic features
- Output high-quality sentences
- Knowledge Sources language models
- Knowledge Acquisition (domain-specific) corpora
- Degradation from underspecified input accounted
for by language models - Degradation from lacking knowledge lower-quality
output
17Surface Realization Rules Templates
- Enables more efficient development of NLG system
for a concrete application - Combines generation grammar rules, templates, and
canned expressions - Compensates for the shortcomings of each
technique (i.e., utilizes the advantages of each
technique)
18Surface Realization Rules Templates
- Input underspecified syntactic/semantic features
- Output high-quality sentences
- Knowledge Sources generation grammar rules,
lexicon, templates - Knowledge Acquisition hand-crafted
- Degradation from underspecified input accounted
for by templates and canned expressions - Degradation from lacking knowledge lower-quality
output
19Future Directions
- Corpus-Based Techniques
- Langkilde, USC
- Multimodal (Multimedia) Generation
- McKeown, et al. Columbia University
- Concept-to-Speech Generation
- Hitzeman, et al. U. Edinburgh
- Hypertext (Web documents) Generation
- Dale, Macquarie University
- Reference Architecture for NLG
- RAGS Project (U. Edinburgh, U. Brighton)