Title: Spoken Dialogue Systems
1Spoken Dialogue Systems
- Bob Carpenter and Jennifer Chu-Carroll
- June 20, 1999
2Tutorial Overview Data Flow
Part I
Part II
Discourse Interpretation
Signal Processing
Speech Recognition
Semantic Interpretation
Dialogue Management
Response Generation
Speech Synthesis
3Speech and Audio Processing
- Signal processing
- Convert the audio wave into a sequence of feature
vectors - Speech recognition
- Decode the sequence of feature vectors into a
sequence of words - Semantic interpretation
- Determine the meaning of the recognized words
- Speech synthesis
- Generate synthetic speech from a marked-up word
string
4Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
5Acoustic Waves
- Human speech generates a wave
- like a loudspeaker moving
- A wave for the words speech lab looks like
s p ee ch
l a b
l to a transition
Graphs from Simon Arnfields web tutorial on
speech, Sheffield http//lethe.leeds.ac.uk/resear
ch/cogn/speech/tutorial/
6Acoustic Sampling
- 10 ms frame (ms millisecond 1/1000 second)
- 25 ms window around frame to smooth signal
processing
25 ms
. . .
10ms
Result Acoustic Feature Vectors
a1 a2 a3
7Spectral Analysis
- Frequency gives pitch amplitude gives volume
- sampling at 8 kHz phone, 16 kHz mic (kHz1000
cycles/sec) - Fourier transform of wave yields a spectrogram
- darkness indicates energy at each frequency
- hundreds to thousands of frequency samples
s p ee ch
l a b
amplitude
frequency
8Acoustic Features Mel Scale Filterbank
- Derive Mel Scale Filterbank coefficients
- Mel scale
- models non-linearity of human audio perception
- mel(f) 2595 log10(1 f / 700)
- roughly linear to 1000Hz and then logarithmic
- Filterbank
- collapses large number of FFT parameters by
filtering with 20 triangular filters spaced on
mel scale
...
frequency
m1 m2 m3 m4 m5 m6
coefficients
9Cepstral Coefficients
- Cepstral Transform is a discrete cosine transform
of log filterbank amplitudes - Result is 12 Mel Frequency Cepstral Coefficients
(MFCC) - Almost independent (unlike mel filterbank)
- Use Delta (velocity / first derivative) and
Delta2 (acceleration / second derivative) of MFCC
( 24 features)
10Additional Signal Processing
- Pre-emphasis prior to Fourier transform to boost
high level energy - Liftering to re-scale cepstral coefficients
- Channel Adaptation to deal with line and
microphone characteristics (example cepstral
mean normalization) - Echo Cancellation to remove background noise
(including speech generated from the synthesizer) - Adding a Total (log) Energy feature (/-
normalization) - End-pointing to detect signal start and stop
11Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
12Properties of Recognizers
- Speaker Independent vs. Speaker Dependent
- Large Vocabulary (2K-200K words) vs. Limited
Vocabulary (2-200) - Continuous vs. Discrete
- Speech Recognition vs. Speech Verification
- Real Time vs. multiples of real time
- Spontaneous Speech vs. Read Speech
- Noisy Environment vs. Quiet Environment
- High Resolution Microphone vs. Telephone vs.
Cellphone - Adapt to speaker vs. non-adaptive
- Low vs. High Latency
- With online incremental results vs. final results
13The Speech Recognition Problem
- Bayes Law
- P(a,b) P(ab) P(b) P(ba) P(a)
- Joint probability of a and b probability of b
times the probability of a given b - The Recognition Problem
- Find most likely sequence w of words given the
sequence of acoustic observation vectors a - Use Bayes law to create a generative model
- ArgMaxw P(wa) ArgMaxw P(aw) P(w) / P(a)
- ArgMaxw
P(aw) P(w) - Acoustic Model P(aw)
- Language Model P(w)
14Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
15Hidden Markov Models (HMMs)
- HMMs provide generative acoustic models P(aw)
- probabilistic, non-deterministic finite-state
automaton - state n generates feature vectors with density
Pn - transitions from state j to n are probabilistic
Pj,n
P1,1
P2,2
P3,3
P3(.)
P1(.)
P2(.)
P1,2
P2,3
P3,4
16HMMs Single Gaussian Distribution
- Outgoing likelihoods
- Feature vector a generated by normal density
(Gaussian) with mean h and covariance matrix S
17HMMs Gaussian Mixtures
- To account for variable pronunciations
- Each state generates acoustic vectors according
to a linear combination of m Gaussian models,
weighted by lm
Three-component mixture model in two dimensions
18Acoustic Modeling with HMMs
- Train HMMs to represent subword units
- Units typically segmental may vary in
granularity - phonological (40 for English)
- phonetic (60 for English)
- context-dependent triphones (14,000 for
English) models temporal and spectral
transitions between phones - silence and noise are usually additional symbols
- Standard architecture is three successive states
per phone
19Pronunciation Modeling
- Needed for speech recognition and synthesis
- Maps orthographic representation of words to
sequence(s) of phones - Dictionary doesnt cover language due to
- open classes
- names
- inflectional and derivational morphology
- Pronunciation variation can be modeled with
multiple pronunciation and/or acoustic mixtures - If multiple pronunciations are given, estimate
likelihoods - Use rules (e.g. assimilation, devoicing,
flapping), or statistical transducers
20Lexical HMMs
- Create compound HMM for each lexical entry by
concatenating the phones making up the
pronunciation - example of HMM for lab (following speech for
crossword triphone) - Multiple pronunciations can be weighted by
likelihood into compound HMM for a word - (Tri)phone models are independent parts of word
models
triphone ch-la l-ab
a-b
phone l a
b
21HMM Training Baum-Welch Re-estimation
- Determines the probabilities for the acoustic HMM
models - Bootstraps from initial model
- hand aligned data, previous models or flat start
- Allows embedded training of whole utterances
- transcribe utterance to words w1,,wk and
generate a compound HMM by concatenating compound
HMMs for words m1,,mk - calculate acoustic vectors a1,,an
- Iteratively converges to a new estimate
- Re-estimates all paths because states are hidden
- Provides a maximum likelihood estimate
- model that assigns training data the highest
likelihood
22Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
23Probabilistic Language Modeling History
- Assigns probability P(w) to word sequence w w1
,w2,,wk - Bayes Law provides a history-based model
- P(w1 ,w2,,wk)
- P(w1) P(w2w1) P(w3w1,w2)
P(wkw1,,wk-1) - Cluster histories to reduce number of parameters
24N -gram Language Modeling
- n-gram assumption clusters based on last n-1
words - P(wjw1,,wj-1) P(wjwj-n-1,,wj-2 ,wj-1)
- unigrams P(wj)
- bigrams P(wjwj-1)
- trigrams P(wjwj-2 ,wj-1)
- Trigrams often interpolated with bigram and
unigram - the li typically estimated by maximum likelihood
estimation on held out data (F(..) are relative
frequencies) - many other interpolations exist (another standard
is a non-linear backoff)
25Extended Probabilistic Language Modeling
- Histories can include some indication of semantic
topic - latent-semantic indexing (vector-based
information retrieval model) - topic-spotting and blending of topic-specific
models - dialogue-state specific language models
- Language models can adapt over time
- recent history updates model through
re-estimation or blending - often done by boosting estimates for seen words
(triggers) - new words and/or pronunciations can be added
- Can estimate category tags (syntactic and/or
semantic) - Joint word/category model P(word1tag1,,wordkta
gk) - example P(wordtagHistory) P(wordtag)
P(tagHistory)
26Finite State Language Modeling
- Write a finite-state task grammar (with
non-recursive CFG) - Simple Java Speech API example (from users
guide) - public ltCommandgt ltPolitegt ltActiongt
ltObjectgt (and ltObjectgt) - ltActiongt open close delete
- ltObjectgt the window the file
- ltPolitegt please
- Typically assume that all transitions are
equi-probable - Technology used in most current applications
- Can put semantic actions in the grammar
27Information Theory Perplexity
- Perplexity is standard model of recognition
complexity given a language model - Perplexity measures the conditional likelihood of
a corpus, given a language model P(.) - Roughly the number of equi-probable choices per
word - Typically computed by taking logs and applying
history-based Bayesian decomposition - But lower perplexity doesnt guarantee better
recognition
28Zipfs Law
- Lexical frequency is inversely proportional to
rank - Frequency(n) Frequency of n-th most frequent
word - Zipfs Law Frequency(Rank) Frequency(1)/Rank
- Thus log Frequency(Rank) ? - log Rank
From G.R. Turners web site on Zipfs
law http//www.btinternet.com/g.r.turner/ZipfDoc
.htm
29Vocabulary Acquisition
- IBM personal E-mail corpus of PDB (by R.L.
Mercer) - static coverage is given by most frequent n words
- dynamic coverage is most recent n words
30Language HMMs
- Can take HMMs for each word and combine into a
single HMM for the whole language (allows
cross-word models) - Result is usually too large to expand statically
in memory - A two word example is given by
P(word1word1)
word1
P(word1)
P(word2word1)
P(word1word2)
word2
P(word2)
P(word2word2)
31Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
32HMM Decoding
- Decoding Problem is finding best word sequence
- ArgMax w1,,wm P(w1,,wm a1,,an)
- Words w1wm are fully determined by sequences of
states - Many state sequences produce the same words
- The Viterbi assumption
- the word sequence derived from the most likely
path will be the most likely word sequence (as
would be computed over all paths) -
Max over previous states r
likelihood previous state is r
Transition probability from r to s
Acoustics for state s for input a
33Visualizing Viterbi Decoding The Trellis
ai-1
ai1
ai
input
P1,1
P1,1
s1
s1
s1
fi(s1)
fi1(s1)
fi-1(s1)
P2,1
P1,2
...
s2
s2
...
s2
Pk,1
fi(s2)
fi1(s2)
fi-1(s2)
P1,k
best path
...
...
...
sk
sk
sk
fi(sk)
fi1(sk)
fi-1(sk)
time
ti-1
ti
ti1
34Viterbi Search Dynamic Programming Token Passing
- Algorithm
- Initialize all states with a token with a null
history and the likelihood that its a start
state - For each frame ak
- For each token t in state s with probability
P(t), history H - For each state r
- Add new token to s with probability P(t) Ps,r
Pr(ak), and history s.H - Time synchronous from left to right
- Allows incremental results to be evaluated
35Pruning the Search Space
- Entire search space for Viterbi search is much
too large - Solution is to prune tokens for paths whose score
is too low - Typical method is to use
- histogram only keep at most n total hypotheses
- beam only keep hypotheses whose score is a
fraction of best score - Need to balance small n and tight beam to limit
search and minimal search error (good hypotheses
falling off beam) - HMM densities are usually scaled differently than
the discrete likelihoods from the language model - typical solution boost language models dynamic
range, using P(w)n P(aw), usually with with n
15 - Often include penalty for each word to favor
hypotheses with fewer words
36N-best Hypotheses and Word Graphs
- Keep multiple tokens and return n-best
paths/scores - p1 flights from Boston today
- p2 flights from Austin today
- p3 flights for Boston to pay
- p4 lights for Boston to pay
- Can produce a packed word graph (a.k.a. lattice)
- likelihoods of paths in lattice should equal
likelihood for n-best
37Search-based Decoding
- A search
- Compute all initial hypotheses and place in
priority queue - For best hypothesis in queue
- extend by one observation, compute next state
score(s) and place into the queue - Scoring now compares derivations of different
lengths - would like to, but cant compute cost to complete
until all data is seen - instead, estimate with simple normalization for
length - usually prune with beam and/or histogram
constraints - Easy to include unbounded amounts of history
because no collapsing of histories as in dynamic
programming n-gram - Also known as stack decoder (priority queue is
stack)
38Multiple Pass Decoding
- Perform multiple passes, applying successively
more fine-grained language models - Can much more easily go beyond finite state or
n-gram - Can use for Viterbi or stack decoding
- Can use word graph as an efficient interface
- Can compute likelihood to complete hypotheses
after each pass and use in next round to tighten
beam search - First pass can even be a free phone decoder
without a word-based language model
39Measuring Recognition Accuracy
- Word Error Rate
- Example scoring
- actual utterance four six seven
nine three three seven - recognizer four oh six seven
five three seven - insert
subst delete - WER (1 1 1)/7 43
- Would like to study concept accuracy
- typically count only errors on content words
application dependent - ignore case marking (singular, plural, etc.)
- For word/concept spotting applications
- recall percentage of target words (concept)
found - precision percentage of hypothesized words
(concepts) in target
40Empirical Recognition Accuracies
- Cambridge HTK, 1997 multipass HMM w. lattice
rescoring - Top Performer in ARPAs HUB-4 Broadcast News
Task - 65,000 word vocabulary Out of Vocabulary 0.5
- Perplexities
- word bigram 240
(6.9 million bigrams) - backoff trigram of 1000 categories 238
(803K bi, 7.1G tri) - word trigram 159
(8.4 million trigrams) - word 4-gram 147
(8.6 million 4-grams) - word 4-gram category trigram 137
- Word Error Rates
- clean, read speech 9.4
- clean, spontaneous speech 15.2
- low fidelity speech 19.5
41Empirical Recognition Accuracies (contd)
- Lucent 1998, single pass HMM
- Typical of real-time telephony performance (low
fidelity) - 3,000 word vocabulary Out of Vocabulary 1.5
- Blended models from customer/operator
customer/system - Perplexities customer/op customer/system
- bigram 105.8 (27,200) 32.1
(12,808) - trigram 99.5 (68,500) 24.4
(25,700) - Word Error Rate 23
- Content Term (single, pair, triple of words)
Precision/Recall - one-word terms 93.7 / 88.4
- two-word terms 96.9 / 85.4
- three-word terms 98.5 / 84.3
42Confidence Scoring and Rejection
- Alternative to standard acoustic density scoring
- compute HMM acoustic score for word(s) in usual
way - baseline score for an anti-model
- compute hypothesis ratio (Word Score / Baseline
Score) - test hypothesis ratio vs. threshold
-
- Can be applied to
- free word spotting (given pronunciations)
- (word-by-word) acoustic confidence scoring for
later processing - verbal information verification
- existing info name, address, social security
number - password
43Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
44Semantic Interpretation Word Strings
- Content is just words
- System What is your address?
- User fourteen eleven main street
- Can also do concept extraction / keyword(s)
spotting - User My address is fourteen eleven main
street - Applications
- template filling
- directory services
- information retrieval
45Semantic Interpretation Pattern-Based
- Simple (typically regular) patterns specify
content - ATIS (Air Traffic Information System) Task
- System What are your travel plans?
- User On Monday, Im going from Boston
to San Francisco. - Content DATEMonday, ORIGINBoston,
DESTINATIONSFO - Can combine content-extraction and language
modeling - but can be too restrictive as a language model
- Java Speech API (curly brackets show semantic
actions) - public ltcommandgt ltactiongt ltobjectgt
ltpolitegt - ltactiongt open OP close
CL move MV - ltobjectgt ltthis_that_etcgt
window door - ltthis_that_etcgt a the
this that the current - ltpolitegt please kindly
- Can be generated and updated on the fly (eg. Web
Apps)
46Semantic Interpretation Parsing
- In general case, have to uncover who did what to
whom - System What would you like me to do next?
- User Put the block in the box on Platform 1.
ambiguous - System How can I help you?
- User Where is A Bugs Life playing in Summit?
- Requires some kind of parsing to produce
relations - Who did what to whom ?(where(present(in(Summit,p
lay(BugsLife))))) - This kind of representation often used for
machine translation - Often transferred to flatter frame-based
representation - Utterance type where-question
- Movie A Bugs Life
- Town Summit
47Robustness and Partiality
- Controlled Speech
- limited task vocabulary limited task grammar
- Spontaneous Speech
- Can have high out-of-vocabulary (OOV) rate
- Includes restarts, word fragments, omissions,
phrase fragments, disagreements, and other
disfluencies - Contains much grammatical variation
- Causes high word error-rate in recognizer
- Parsing is often partial, allowing
- omission
- parsing fragments
48Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
49Recorded Prompts
- The simplest (and most common) solution is to
record prompts spoken by a (trained) human - Produces human quality voice
- Limited by number of prompts that can be recorded
- Can be extended by limited cut-and-paste or
template filling
50Speech Synthesis
- Rule-based Synthesis
- Uses linguistic rules (/- training) to generate
features - Example DECTalk
- Concatenative Synthesis
- Record basic inventory of sounds
- Retrieve appropriate sequence of units at run
time - Concatenate and adjust durations and pitch
- Waveform synthesis
51Diphone and Polyphone Synthesis
- Phone sequences capture co-articulation
- Cut speech in positions that minimize context
contamination - Need single phones, diphones and sometimes
triphones - Reduce number collected by
- phonotactic constraints
- collapsing in cases of no co-articulation
- Data Collection Methods
- Collect data from a single (professional) speaker
- Select text with maximal coverage (typically with
greedy algorithm), or - Record minimal pairs in desired contexts (real
words or nonsense)
52Duration Modeling
- Must generate segments with the appropriate
duration - Segmental Identity
- /ai/ in like twice as long as /I/ in lick
- Surrounding Segments
- vowels longer following voiced fricatives than
voiceless stops - Syllable Stress
- onsets and nuclei of stressed syllables longer
than in unstressed - Word importance
- word accent with major pitch movement lengthens
- Location of Syllable in Word
- word ending longer than word starting longer than
word internal - Location of the Syllable in the Phrase
- phrase final syllables longer than same syllable
in other positions
53Intonation Tone Sequence Models
- Functional Information can be encoded via tones
- given/new information (information status)
- contrastive stress
- phrasal boundaries (clause structure)
- dialogue act (statement/question/command)
- Tone Sequence Models
- F0 contours generated from phonologically
distinctive tones/pitch accents which are locally
independent - generate a sequence of tonal targets and fit with
signal processing
54Intonation for Function
- ToBI (Tone and Break Index) System, is one
example - Pitch Accent (H, L, HL, HL, LH,
LH) - Phrase Accent - (H-, L-)
- Boundary Tone (H, L)
- Intonational Phrase
- ltPitch Accentgt ltPhrase
Accentgt ltBoundary Tonegt -
statement vs. question example
source Multilingual Text-to-Speech Synthesis, R.
Sproat, ed., Kluwer, 1998
55Text Markup for Synthesis
- Bell Labs TTS Markup
- r(0.9) LH(0.8) Humpty LH(0.8) Dumpty
r(0.85) L(0.5) sat on a H(1.2) wall. - Tones Tone(Prominence)
- Speaking Rate r(Rate) and pauses
- Top Line (highest pitch) Reference Line
(reference pitch) Base Line (lowest pitch) - SABLE is an emerging standard extending SGML
http//www.cstr.ed.ac.uk/projects/sable.html - marks emphasis(), break(), pitch(base/mid/rang
e,), rate(), volume(), semanticMode(date/time/e
mail/URL/...), speaker(age,sex) - Implemented in Festival Synthesizer (free for
research, etc.) - http//www.cstr.ed.ac.uk/projec
ts/festival.html
56Intonation in Bell Labs TTS
- Generate a sequence of F0 targets for synthesis
- Example
- We were away a year ago.
- phones w E w R w A y E r g O
- Default Declarative intonation (H) H L- L
question L H- H
source Multilingual Text-to-Speech Synthesis, R.
Sproat, ed., Kluwer, 1998
57Signal Processing for Speech Synthesis
- Diphones recorded in one context must be
generated in other contexts - Features are extracted from recorded units
- Signal processing manipulates features to smooth
boundaries where units are concatenated - Signal processing modifies signal via
interpolation - intonation
- duration
58The Source-Filter Model of Synthesis
- Model of features to be extracted and fitted
- Excitation or Voicing Source(s) to model sound
source - standard wave of glottal pulses for voiced sounds
- randomly varying noise for unvoiced sounds
- modification of airflow due to lips, etc.
- high frequency (F0 rate), quasi-periodic, choppy
- modeled with vector of glottal waveform patterns
in voiced regions - Acoustic Filter(s)
- shapes the frequency character of vocal tract and
radiation character at the lips - relatively slow (samples around 5ms suffice) and
stationary - modeled with LPC (linear predictive coding)
59Barge-in
- Technique to allow speaker to interrupt the
systems speech - Combined processing of input signal and output
signal - Signal detector runs looking for speech start and
endpoints - tests a generic speech model against noise model
- typically cancels echoes created by outgoing
speech - If speech is detected
- Any synthesized or recorded speech is cancelled
- Recognition begins and continues until end point
is detected
60Speech Application Programming Interfaces
- Abstract from recognition/synthesis engines
- Recognizer and synthesizer loading
- Acoustic and grammar model loading (dynamic
updates) - Recognition
- online
- n-best or lattice
- Synthesis
- markup
- barge in
- Acoustic control
- telephony interface
- microphone/speaker interface
61Speech API Examples
- SAPI Microsoft Speech API (recsynth)
- communicates through COM objects
- instances most systems implement all or some of
this (Dragon, IBM, Lucent, LH, etc.) - JSAPI Java Speech API (rec synth)
- communicates through Java events (like GUI)
- concurrency through threads
- instances IBM ViaVoice (rec), LH (synth)
- (J)HAPI (Java) HTK API (recognition)
- communicates through C or Java port of C
interface - eg Entropics Cambridge Research Labs HMM Tool
Kit (HTK) - Galaxy (rec synth)
- communicates through a production system
scripting language - MIT System, ported by MITRE for DARPA Communicator
62Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
63Discourse Dialogue Processing
- Discourse interpretation
- Understand what the user really intends by
interpreting utterances in context - Dialogue management
- Determine system goals in response to user
utterances based on user intention - Response generation
- Generate natural language utterances to achieve
the selected goals
64Discourse Interpretation
- Goal understand what the user really intends
- Example Can you move it?
- What does it refer to?
- Is the utterance intended as a simple yes-no
query or a request to perform an action? - Issues addressed
- Reference resolution
- Intention recognition
- Interpret user utterances in context
65Reference Resolution
U Where is A Bugs Life playing in Summit? S A
Bugs Life is playing at the Summit theater. U
When is it playing there? S Its playing at 2pm,
5pm, and 8pm. U Id like 1 adult and 2 children
for the first show. How much would that cost?
- Knowledge sources
- Domain knowledge
- Discourse knowledge
- World knowledge
66Reference Resolution In Theory
- Focus stacks
- Maintain recent objects in stack
- Select objects that satisfy semantic/pragmatic
constraints starting from top of stack - May take into account discourse structure
- Centering
- Backward-looking center (Cb) object connecting
the current sentence with the previous sentence - Forward-looking centers (Cf) potential Cb of the
next sentence - Rule-based filtering ranking of objects for
pronoun resolution
67Reference Resolution In Practice
- Non-existent does not allow the use of anaphoric
references - Allows only simple references
- utilizes the focus stack reference resolution
mechanism - does not take into account discourse structure
information - Example
U Where is A Bugs Life playing in Summit?
Summit
A Bugs Life
68 S A Bugs Life is playing at the
Summit theater.
Summit theater
A Bugs Life
69 U When is it playing there?
70Intention Recognition
B I have to wash my hair.
71 A Would you like to go to the hairdresser?
- Bs utterance should be interpreted as an
acceptance of As proposal.
72 A Whats that smell around here?
- Bs utterance should be interpreted as an answer
to As question.
73 A Would you be interested in going out to
dinner tonight?
- Bs utterance should be interpreted as a
rejection of As proposal.
74Intention Recognition (Contd)
- Goal to recognize the intent of each user
utterance as one (or more) of a set of dialogue
acts based on context - Sample dialogue actions
- Switchboard DAMSL
- Conventional-closing
- Statement-(non-)opinion
- Agree/Accept
- Acknowledgment
- Yes-No-Question/Yes-Answer
- Non-verbal
- Abandoned
- On-going standardization efforts (Discourse
Resource Initiative)
- Verbmobil
- Greet/Thank/Bye
- Suggest
- Accept/Reject
- Confirm
- Clarify-Query/Answer
- Give-Reason
- Deliberate
75Intention Recognition In Theory
- Knowledge sources
- Overall dialogue goals
- Orthographic features, e.g.
- punctuation
- cue words/phrases but, furthermore, so
- transcribed words would you please, I want
to - Dialogue history, i.e., previous dialogue act
types - Dialogue structure, e.g.
- subdialogue boundaries
- dialogue topic changes
- Prosodic features of utterance duration, pause,
F0, speaking rate
76Intention Recognition In Theory (Contd)
- Finite-state dialogue grammar
- e.g.
- Plan-based discourse understanding
- Recipes templates for performing actions
- Inference rules to construct plausible plans
- Empirical methods
- Probabilistic dialogue act classifiers HMMs
- Rule-based dialogue act recognition CART,
Transformation-based learning
S / Greet
U / Question
S / Answer
U / Bye
U / Question
77Intention Recognition In Practice
- Makes assumptions about (high-level)
task-specific intentions e.g., - Call routing giving destination information
- ATIS requesting flight information
- Movie information system movie showtime or
theater playlist information - Does not allow user-initiated complex dialogue
acts, e.g. confirmation, clarification, or
indirect responses
S1 Whats your account number? U1 Is that the
number on my ATM card? S2 Would you like to
transfer 1,500 from savings to checking? U2 If
I have enough in savings.
78Intention Recognition In Practice (Contd)
- User utterances can play one of two roles
- Identify one of a set of possible task intentions
- Provide necessary information for performing a
task - Based on either keywords in an utterance or its
syntactic/semantic representation - Maps keywords or representations to intentions
using - Template matching
- Probabilistic model
- Vector-based similarity measures
79Intention Recognition Example
U What time is A Bugs Life playing at the
Summit theater?
- Using keyword extraction and vector-based
similarity measures - Intention Ask-Reference _time
- Movie A Bugs Life
- Theater the Summit quadplex
80Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
81Dialogue Management Motivating Examples
S Would you like movie showtime or theater
playlist information? U Movie showtime. S
What movie do you want showtime information
about? U Saving Private Ryan. S At what theater
do you want to see Saving Private Ryan? U
Paramount theater. S Saving Private Ryan is not
playing at the Paramount theater.
82DM Motivating Examples (Contd)
S How may I help you? U When is Saving Private
Ryan playing? S For what theater? U The
Paramount theater. S Saving Private Ryan is not
playing at the Paramount theater, but its
playing at the Madison theater at 300, 530,
800, and 1030.
83DM Motivating Examples (Contd)
S How may I help you? U When is Saving Private
Ryan playing? S For what theater? U The
Paramount theater. S Saving Private Ryan is
playing at the Fairmont theater at 600 and
830. U I wanted to know about the Paramount
theater, not the Fairmont theater. S Saving
Private Ryan is not playing at the Paramount
theater, but its playing at the Madison
theater at 300, 530, 800, and 1030.
84Comparison of Sample Dialogues
- Dialogue 1
- System-initiative
- Implicit confirmation
- Merely informs user of failed query
- Mechanical
- Least efficient
- Dialogue 2
- Mixed-initiative
- No confirmation
- Suggests alternative when query fails
- More natural
- Most efficient
- Dialogue 3
- Mixed-initiative
- No confirmation
- Suggests alternative when query fails
- More natural
- Moderately efficient
85Dialogue Management
- Goal determine what to accomplish in response to
user utterances, e.g. - Answer user question
- Solicit further information
- Confirm/Clarify user utterance
- Notify invalid query
- Notify invalid query and suggest alternative
- Interface between user/language processing
components and system knowledge base
86Dialogue Management (Contd)
- Main design issues
- Functionality how much should the system do?
- Process how should the system do them?
- Affected by
- Task complexity how hard the task is
- Dialogue complexity what dialogue phenomena are
allowed - Affects
- robustness
- naturalness
- perceived intelligence
87Task Complexity
- Application dependent
- Examples
Weather Information ATIS
Call Routing
Travel Planning
Automatic Banking
University Course Advisement
Simple
Complex
- Directly affects
- Types and quantity of system knowledge
- Complexity of systems reasoning abilities
88Dialogue Complexity
- Determines what can be talked about
- The task only
- Subdialogues e.g., clarification, confirmation
- The dialogue itself meta-dialogues
- Could you hold on for a minute?
- What was that click? Did you hear it?
- Determines who can talk about them
- System only
- User only
- Both participants
89Dialogue Management Functionality
- Determines the set of possible goals that the
system may select at each turn - At the task level, dictated by task complexity
- At the dialogue level, determined by system
designer in terms of dialogue complexity - Are subdialogues allowed?
- Are meta-dialogues allowed?
- Only by the system, by the user, or by both
agents?
90DM Functionality In Theory
- Task complexity moderate to complex
- Travel planning
- University course advisement
- Dialogue complexity
- System/user-initiated complex subdialogues
- Embedded negotiation subdialogues
- Expressions of doubt
- Meta-dialogues
- Multiple dialogue threads
91DM Functionality In Practice
- Task complexity simple to moderate
- Call routing
- Weather information query
- Train schedule inquiry
- Dialogue complexity
- About task only
- Limited system-initiated subdialogues
92Dialogue Management Process
- Determines how the system will go about selecting
among the possible goals - At the dialogue level, determined by system
designer in terms of initiative strategies - System-initiative system always has control,
user only responds to system questions - User-initiative user always has control, system
passively answers user questions - Mixed-initiative control switches between system
and user using fixed rules - Variable-initiative control switches between
system and user dynamically based on participant
roles, dialogue history, etc.
93DM Process In Theory
- Initiative strategies
- Mixed-initiative
- Variable-initiative
- Mechanisms for modeling initiative
- Planning and reasoning
- Theorem proving
- Belief modeling
- Knowledge sources for modeling initiative
- System beliefs, user beliefs, and mutual beliefs
- System domain knowledge
- Dialogue history
- User preferences
94DM Process In Practice
- Initiative strategies
- User-initiative
- System-initiative
- Mixed-initiative
- Variable-initiative
- Mechanisms for modeling initiative
- System and mixed-initiative finite-state
machines - Variable-initiative evidential model for
computing initiative - Knowledge sources
- Dialogue history e.g. user fails to make
progress in task - Participant roles advisor/advisee vs.
collaborators - Features of current utterance e.g. ambiguous
utterance, underspecified utterance
95Dialogue Management Summary
- In theory
- Focuses on developing algorithms for addressing
difficult dialogue phenomena in complex tasks - In practice
- Focuses on developing robust, real-time dialogue
strategies for handling common dialogue phenomena
in simpler tasks
96Dialogue Management Example
- System for providing movie showtime and theater
playlist information - Functionality
- Task provide simple movie information
- Dialogue allow system-initiated subdialogues
- Process
- Variable-initiative
97Dialogue Management Example (Contd)
ltU has task dialogue initiativegt U When
is Antz playing at the Fairmont theater? ltS
has dialogue initiative (low score for theater)gt
ltgoal confirm Paramount theatergt S Did you
say the Paramount theater? U No, the Fairmont
theater. ltS has no initiativegt ltgoal
answer user questiongt S Antz is playing at the
Fairmont theater at 7pm. U When is it playing at
the Summit theater? ltS has task dialogue
initiativegt ltgoal notify failure suggest
alternativegt S Antz is not playing at the Summit
theater, but its playing at the AMC theater
in Summit at 8pm.
98Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
99Response Generation
U Is Saving Private Ryan playing at the Chatham
cinema?
100 S No, its not.
- S provides elliptical response
101 S No, Saving Private Ryan is not playing
at the Chatham cinema.
- S provides full response (which provides
grounding information)
102 S No, Saving Private Ryan is not playing
at the Chatham cinema the theaters under
renovation.
- S provides full response and supporting evidence
103Response Generation (Contd)
- Goal generate natural language utterances to
achieve goal(s) selected by the dialogue manager - Issues
- Content selection determining what to say
- Surface realization determining how to say it
- Generation gap discrepancy between the actual
output of the content selection process and the
expected input of the surface realization process
104Content Selection
- Goal determine the propositional content of
utterances to achieve goal(s) - Examples
- Antz is not playing at the Maplewood theater
Nucleus - Would you like the suite? Nucleus
- Can you get the groceries from the car?
Nucleus
105 - the theaters under renovation. (evidence)
Satellite - Its the same price as the regular room.
(motivation) Satellite - The key is on the dryer. (enablement) Satellite
106Content Selection In Theory
- Knowledge sources
- Domain knowledge base
- User beliefs
- User model user characteristics, preferences,
etc. - Dialogue history
- Content selection mechanisms
- Schemas patterns of predicates
- Rule-based generation
- Plan-based generation
- Recipes templates for performing actions
- Planner to construct plans for given goal
- Case-based reasoning
107Content Selection In Practice
- Knowledge sources
- Domain knowledge base
- Dialogue history
- Pre-determined content selection strategies
- Nucleus only, no satellite information
- Nucleus fixed satellite
108Surface Realization
- Goal determine how the selected content will be
conveyed by natural language utterances - Examples
- Antz is showing (shown) at the Maplewood theater.
- The Maplewood theater is showing Antz.
- It is at the Maplewood theater that Antz is
shown. - Antz, thats whats being shown at the Maplewood
theater. - Issues
- Clausal structure construction
- Lexical selection
109Surface Realization In Theory
- Typical surface generator requires as input
- Semantic representation to be realized
- Clausal structure for generated utterance
- Surface realization component utilizes a grammar
to generate utterance that conveys the given
semantic representation
110Surface Realization In Practice
- Canned utterances
- Pre-determined utterances for goals e.g.
- Greetings Hello, this is the ABC banks
operator. - Repeat Could you please repeat your request?
- Facilitates pre-recorded prompts for speech
output - Template-based generation
- Templates for goals e.g.
- Notification Your call is being transferred to
X. - Inform A,B,C,D, and E are playing at the F
theater. - Clarify Did you say X or Y?
- Needs cut-and-paste of pre-recorded segments or
full TTS system
111Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
112Dialogue Evaluation
- Goal determine how well a dialogue system
performs - Main difficulties
- No strict right or wrong answers
- Difficult to determine what features make a
dialogue system better than another - Difficult to select metrics that contribute to
the overall goodness of the system - Difficult to determine how the metrics compensate
for one another - Expensive to collect new data for evaluating
incremental improvement of systems
113Dialogue Evaluation (Contd)
- System-initiative, explicit confirmation
- better task success rate
- lower WER
- longer dialogues
- fewer recovery subdialogues
- less natural
- Mixed-initiative, no confirmation
- lower task success rate
- higher WER
- shorter dialogues
- more recovery subdialogues
- more natural
114Dialogue Evaluation Paradigms
- Evaluating the end result only
- Reference answers
- Evaluating both the end result and the process
toward it - Evaluation metrics
- Performance functions
115Evaluation Paradigms Reference Answers
- Evaluates the task success rate only
- Suitable for query-answering systems for which a
correct answer can be defined for each query - For each query
- Obtain answer from dialogue system
- Compare with reference answer
- Score system performance
- Advantage simple
- Disadvantage ignores many other important
factors that contribute to quality of dialogue
systems
116Evaluation Paradigms Evaluation Metrics
- Different metrics for evaluating different
components of a dialogue system - Speech recognizer word error rate / word
accuracy - Understanding component attribute value matrix
- Dialogue manager
- appropriateness of system responses
- error recovery abilities
- Overall system
- task success
- average number of turns
- elapsed time
- turn correction ratio
117Paradigms Evaluation Metrics (Contd)
- Advantage
- Takes into account the process toward completing
the task - Limitations
- Difficult to determine how different metrics
compensate for one another - Metrics may not be independent of one another
- Does not generalize across domains and tasks
118Paradigms Performance Functions
- PARADISE Walker et al. derives performance
functions using both task-based and
dialogue-based metrics - User satisfaction
- Maximize task success
- Minimize costs
- Efficiency measures e.g., number of utterances,
elapsed time - Qualitative measures e.g., repair ratio,
inappropriate utt. ratio - Performance function derivation
- Obtain user satisfaction ratings (questionnaire)
- Obtain values for various metrics (automatic or
manual) - Apply multiple linear regression to derive a
function relating user satisfaction and various
cost factors, e.g.,
119Paradigms Performance Functions (Contd)
- Advantages
- Allows for comparison of dialogue systems
performing different tasks - Specifies relative contributions of cost factors
to overall performance - Can be used to make predictions about future
versions of the dialogue system - Disadvantages
- Data collection cost for deriving performance
function is high - Cost for deriving performance function for
multiple systems to draw general conclusions is
high
120Tutorial Overview Outline
- Part I
- Signal processing
- Speech recognition
- acoustic modeling
- language modeling
- decoding
- Semantic interpretation
- Speech synthesis
- Part II
- Discourse and dialogue
- Discourse interpretation
- Dialogue management
- Response generation
- Dialogue evaluation
- Data collection
121Data Collection Wizard of Oz Paradigm
- Setup for initial data collection
- User communicates with system through telephone
(speech) or keyboard (text) - System is actually a human, typically given
instructions on how to behave like a system - Users are typically given tasks to perform in the
target domain - Subjects are the users and the system can be
played by one person - Dialogues between system and user are recorded
and transcribed - Setup for intermediate system evaluation
- Use actual running system, with wizard
supervision - Wizard can override undesirable system behavior,
e.g., correct ASR errors
122Data Collection Wizard of Oz (Contd)
- Features of collected data
- Typically much less complex than actual
human-human dialogues performing the same tasks - Captures how humans behave when they talk to
computers - Captures variations among different subjects in
both language and approach when performing the
same tasks - Use of collected data
- Particularly useful for designing the