Title: misunderstandings, corrections and beliefs in spoken language interfaces
1misunderstandings, corrections and beliefs in
spoken language interfaces
- Dan Bohus Computer Science Department
- www.cs.cmu.edu/dbohus Carnegie Mellon
University - dbohus_at_cs.cmu.edu Pittsburgh, PA 15213
2problem
- spoken language interfaces lack robustness when
faced with understanding errors
- stems mostly from speech recognition
- spans most domains and interaction types
- exacerbated by operating conditions
3more concretely
- S What city are you leaving from?
- U Urbana Champaign OKAY IN THAT SAME PAY
- S Sorry Im not sure I understood what you said.
What city are you leaving from ? - U Urbana Champaign FOR MINUTE SINCE HEY
- S Im still having trouble understanding you. To
go on, I need you to answer the following
question. What city are you leaving from? - U Chicago CHICAGO
- S traveling from Chicago. Where would you like
to go? - U Huntsville SEOUL
- S traveling to Seoul. What day did you need to
travel? - U no no Im traveling to Birmingham THE
TRAVELING TO BERLIN P_M - S traveling in the afternoon. Okay what day
would you be departing chicago - U /uh/ the tenth of august AT THE TENTH OF
AUGUST - S a flight tomorrow I have a Korean Airlines
flight departing Chicago at 140pm, arrives Seoul
at 5pm the next day. Do you want to take that? - U my destination is Birmingham FLIGHT
DESTINATION MR WEEKEND - S traveling on Saturday, August 12th I have a
flight departing Chicago at 140pm arrives Seoul
at
4some statistics
- semantic error rates 25-35
SpeechActs SRI 25
CU Communicator CU 27
Jupiter MIT 28
CMU Communicator CMU 32
How May I Help You? ATT 36
- corrections Krahmer, Swerts, Litman, Levow
- 30 of utterances correct system mistakes
- 2-3 times more likely to be misrecognized
5two types of understanding errors
6misunderstandings
- detect potential misunderstandings do something
about them
7outline
- detecting misunderstandings
- detecting user correctionslate-detection of
misunderstandings - belief updatingconstruct accurate beliefs by
integrating information from multiple turns
8detecting misunderstandings
- recognition confidence scores
S What city are you leaving from? U Birmingham
BERLIN PM
conf0.63
- traditionally Bansal, Chase, Cox, Kemp, many
others - speech recognition confidence scores
- use acoustic, language model and search info
- frame, phoneme, word-level
9semantic confidence scores
- were interested in semantics, not words
- YES YEAH, NO NO WAY
- use machine learning to build confidence
annotators - in-domain, manually labeled data
- utterance BERLIN PM Birmingham
- labels correct / misunderstood
- features from different knowledge sources
- binary classification problem
- probability of misunderstanding regression
problem
10a typical result
- Identifying User Corrections Automatically in a
Spoken Dialog System Walker, Wright, Langkilde - HowMayIHelpYou corpus call routing for phone
services - 11787 turns
- features
- ASR recog, numwords, duration, dtmf, rg-grammar,
tempo - understanding confidence, context-shift,
top-task, diff-conf, - dialog history sys-label, confirmation,
num-reprompts, num-confirms, num-subdials, - binary classification task
- majority baseline (error) 36.5
- RIPPER (error) 14
11outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
12detect user corrections
- is the user trying to correct the system?
S Where would you like to go? U Huntsville
SEOUL S traveling to Seoul. What day did you
need to travel? U no no Im traveling to
Birmingham THE TRAVELING TO BERLIN P_M
misunderstanding
user correction
misunderstanding
- same story use machine learning
- in-domain, manually labeled data
- features from different knowledge sources
- binary classification problem
- probability of correction regression problem
13typical result
- Identifying User Corrections Automatically in a
Spoken Dialog System Hirschberg, Litman, Swerts - TOOT corpus access to train information
- 2328 turns, 152 dialogs
- features
- prosodic f0max, f0mn, rmsmax, dur, ppau, tempo
- ASR gram, str, conf, ynstr,
- dialog position diadist
- dialog history preturn, prepreturn, pmeanf
- binary classification task
- majority baseline 29
- RIPPER 15.7
14outline
- detecting misunderstandings
- detecting user correctionslate-detection of
misunderstandings - belief updatingconstruct accurate beliefs by
integrating information from multiple turns
15belief updating problem an easy case
S on which day would you like to travel? U on
September 3rd AN DECEMBER THIRD CONF0.25
departure_date Dec-03/0.25
S did you say you wanted to leave on December
3rd?
U no
NO CONF0.88
departure_date Ø
16belief updating problem a trickier case
S Where would you like to go? U Huntsville SEO
UL CONF0.65
destination seoul/0.65
S traveling to Seoul. What day did you need to
travel?
U no no Im traveling to Birmingham
THE TRAVELING TO BERLIN P_M CONF0.60
COR0.35
destination ?
17belief updating problem formalized
destination seoul/0.65
S traveling to Seoul. What day did you need to
travel?
- given
- an initial belief Pinitial(C) over concept C
- a system action SA
- a user response R
- construct an updated belief
- Pupdated(C) ? f (Pinitial(C), SA, R)
THE TRAVELING TO BERLIN P_M CONF0.60
COR0.35
destination ?
18outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
- current solutions
- a restricted version
- data
- user response analysis
- experiments and results
- discussion. caveats. future work
19belief updating current solutions
- most systems only track values, not beliefs
- new values overwrite old values
- explicit confirm yes ? trust hypothesis
- explicit confirm no ? kill hypothesis
- explicit confirm other ? non-understanding
- implicit confirm not much
- users who discover errors through incorrect
implicitconfirmations have a harder time getting
back on track - Shin et al, 2002
20outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
- current solutions
- a restricted version
- data
- user response analysis
- experiments and results
- discussion. caveats. future work
21belief updating general form
- given
- an initial belief Pinitial(C) over concept C
- a system action SA
- a user response R
- construct an updated belief
- Pupdated(C) ? f (Pinitial(C), SA, R)
22restricted version 2 simplifications
- compact belief
- system unlikely to hear more than 3 or 4 values
- single vs. multiple recognition results
- in our data max 3 values, only 6.9 have gt1
value - confidence score of top hypothesis
- updates after confirmation actions
- reduced problem
- ConfTopupdated(C) ? f (ConfTopinitial(C), SA, R)
23outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
- current solutions
- a restricted version
- data
- user response analysis
- experiments and results
- discussion. caveats. future work
24data
- collected with RoomLine
- a phone-based mixed-initiative spoken dialog
system - conference room reservation
- search and negotiation
- explicit and implicit confirmations
- confidence threshold model ( some exploration)
- implicit confirmation task
- I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?
- I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?
25user study
- 46 participants, 1st time users
- 10 scenarios, fixed order
- presented graphically (explained during briefing)
- compensated per task success
26corpus statistics
- 449 sessions, 8848 user turns
- orthographically transcribed
- manually annotated
- misunderstandings (concept-level)
- non-understandings
- user corrections
- correct concept values
27outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
- current solutions
- a restricted version
- data
- user response analysis
- experiments and results
- discussion. caveats. future work
28user response types
- following Krahmer and Swerts
- study on Dutch train-table information system
- 3 user response types
- YES yes, right, thats right, correct, etc.
- NO no, wrong, etc.
- OTHER
- cross-tabulated against correctness of
confirmations
29user responses to explicit confirmations
- from transcripts
- numbers in brackets from KrahmerSwerts
- from decoded
YES NO Other
CORRECT 94 93 0 0 5 7
INCORRECT 1 6 72 57 27 37
YES NO Other
CORRECT 87 1 12
INCORRECT 1 61 38
30other responses to explicit confirmations
- 70 users repeat the correct value
- 15 users dont address the question
- attempt to shift conversation focus
User does not correct User corrects
CORRECT 1159 0
INCORRECT 29 10 of incor 250 90 of incor
31user responses to implicit confirmations
- transcripts
- numbers in brackets from KrahmerSwerts
- decoded
YES NO Other
CORRECT 30 0 7 0 63 100
INCORRECT 6 0 33 15 61 85
YES NO Other
CORRECT 28 5 67
INCORRECT 7 27 66
32ignoring errors in implicit confirmations
User does not correct User corrects
CORRECT 552 2
INCORRECT 118 51 of incor 111 49 of incor
- users correct later (40 of 118)
- users interact strategically
- correct only if essential
correct later correct later
critical 55 2
critical 14 47
33outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
- current solutions
- a restricted version
- data
- user response analysis
- experiments and results
- discussion. caveats. future work
34machine learning approach
- need good probability outputs
- low cross-entropy between model predictions and
reality - cross-entropy negative average log posterior
- logistic regression
- sample efficient
- stepwise approach ? feature selection
- logistic model tree for each action
- root splits on response-type
35features. target.
- initial situation
- initial confidence score
- concept identity, dialog state, turn number
- system action
- other actions performed in parallel
- features of the user response
- acoustic / prosodic features
- lexical features
- grammatical features
- dialog-level features
- target was the value correct?
36baselines
- initial baseline
- accuracy of system beliefs before the update
- heuristic baseline
- accuracy of heuristic rule currently used in the
system - oracle baseline
- accuracy if we knew exactly when the user is
correcting the system
37results explicit confirmation
Hard error ()
Soft error
38results implicit confirmation
Hard error ()
Soft error
39results unplanned implicit confirmation
Hard error ()
Soft error
40informative features
- initial confidence score
- prosody features
- barge-in
- expectation match
- repeated grammar slots
- concept id
- priors on concept values not included in these
results
41outline
- detecting misunderstandings
- detecting user corrections
- late-detection of misunderstandings
- belief updating
- construct accurate beliefs by integrating
information from multiple turns
- current solutions
- a restricted version
- data
- user response analysis
- experiments and results
- discussion. caveats. future work
42discussion
- evaluation
- does it make sense?
- what would be a better evaluation?
- current limitation belief compression
- extending models to N hypothesis other
- current limitation system actions
- extending models to cover all system actions
43thank you!
44a more subtle caveat
- distribution of training data
- confidence annotator heuristic update rules
- distribution of run-time data
- confidence annotator learned model
- always a problem when interacting with the world!
- hopefully, distribution shift will not cause
large degradation in performance - remains to validate empirically
- maybe a bootstrap approach?
45KL-divergence cross-entropy
- KL divergence D(pq)
- Cross-entropy CH(p, q) H(p) D(pq)
- Negative log likelihood
46logistic regression
- regression model for binomial (binary) dependent
variables
- fit a model using max likelihood (avg
log-likelihood) - any stats package will do it for you
- no R2 measure
- test fit using likelihood ratio test
- stepwise logistic regression
- keep adding variables while data likelihood
increases signif. - use Bayesian information criterion to avoid
overfitting
47logistic regression
48logistic model tree
- regression tree, but with logistic models on
leaves
f
f0
f1
g
ggt10
glt10