misunderstandings, corrections and beliefs in spoken language interfaces - PowerPoint PPT Presentation

About This Presentation

Title:

misunderstandings, corrections and beliefs in spoken language interfaces

Description:

probability of correction: regression problem. 13. typical result ... accuracy if we knew exactly when the user is correcting the system. 37 ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 49

Provided by: danb7

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: misunderstandings, corrections and beliefs in spoken language interfaces

1
misunderstandings, corrections and beliefs in
spoken language interfaces

Dan Bohus Computer Science Department
www.cs.cmu.edu/dbohus Carnegie Mellon
University
dbohus_at_cs.cmu.edu Pittsburgh, PA 15213

2
problem

spoken language interfaces lack robustness when
faced with understanding errors

stems mostly from speech recognition
spans most domains and interaction types
exacerbated by operating conditions

3
more concretely

S What city are you leaving from?
U Urbana Champaign OKAY IN THAT SAME PAY
S Sorry Im not sure I understood what you said.
What city are you leaving from ?
U Urbana Champaign FOR MINUTE SINCE HEY
S Im still having trouble understanding you. To
go on, I need you to answer the following
question. What city are you leaving from?
U Chicago CHICAGO
S traveling from Chicago. Where would you like
to go?
U Huntsville SEOUL
S traveling to Seoul. What day did you need to
travel?
U no no Im traveling to Birmingham THE
TRAVELING TO BERLIN P_M
S traveling in the afternoon. Okay what day
would you be departing chicago
U /uh/ the tenth of august AT THE TENTH OF
AUGUST
S a flight tomorrow I have a Korean Airlines
flight departing Chicago at 140pm, arrives Seoul
at 5pm the next day. Do you want to take that?
U my destination is Birmingham FLIGHT
DESTINATION MR WEEKEND
S traveling on Saturday, August 12th I have a
flight departing Chicago at 140pm arrives Seoul
at

4
some statistics

semantic error rates 25-35

SpeechActs SRI 25
CU Communicator CU 27
Jupiter MIT 28
CMU Communicator CMU 32
How May I Help You? ATT 36

corrections Krahmer, Swerts, Litman, Levow
30 of utterances correct system mistakes
2-3 times more likely to be misrecognized

5
two types of understanding errors
6
misunderstandings

fix recognition

detect potential misunderstandings do something
about them

7
outline

detecting misunderstandings
detecting user correctionslate-detection of
misunderstandings
belief updatingconstruct accurate beliefs by
integrating information from multiple turns

8
detecting misunderstandings

recognition confidence scores

S What city are you leaving from? U Birmingham
BERLIN PM
conf0.63

traditionally Bansal, Chase, Cox, Kemp, many
others
speech recognition confidence scores
use acoustic, language model and search info
frame, phoneme, word-level

9
semantic confidence scores

were interested in semantics, not words
YES YEAH, NO NO WAY
use machine learning to build confidence
annotators
in-domain, manually labeled data
utterance BERLIN PM Birmingham
labels correct / misunderstood
features from different knowledge sources
binary classification problem
probability of misunderstanding regression
problem

10
a typical result

Identifying User Corrections Automatically in a
Spoken Dialog System Walker, Wright, Langkilde
HowMayIHelpYou corpus call routing for phone
services
11787 turns
features
ASR recog, numwords, duration, dtmf, rg-grammar,
tempo
understanding confidence, context-shift,
top-task, diff-conf,
dialog history sys-label, confirmation,
num-reprompts, num-confirms, num-subdials,
binary classification task
majority baseline (error) 36.5
RIPPER (error) 14

11
outline

detecting misunderstandings
detecting user corrections
late-detection of misunderstandings
belief updating
construct accurate beliefs by integrating
information from multiple turns

12
detect user corrections

is the user trying to correct the system?

S Where would you like to go? U Huntsville
SEOUL S traveling to Seoul. What day did you
need to travel? U no no Im traveling to
Birmingham THE TRAVELING TO BERLIN P_M
misunderstanding
user correction
misunderstanding

same story use machine learning
in-domain, manually labeled data
features from different knowledge sources
binary classification problem
probability of correction regression problem

13
typical result

Identifying User Corrections Automatically in a
Spoken Dialog System Hirschberg, Litman, Swerts
TOOT corpus access to train information
2328 turns, 152 dialogs
features
prosodic f0max, f0mn, rmsmax, dur, ppau, tempo
ASR gram, str, conf, ynstr,
dialog position diadist
dialog history preturn, prepreturn, pmeanf
binary classification task
majority baseline 29
RIPPER 15.7

14
outline

detecting misunderstandings
detecting user correctionslate-detection of
misunderstandings
belief updatingconstruct accurate beliefs by
integrating information from multiple turns

15
belief updating problem an easy case
S on which day would you like to travel? U on
September 3rd AN DECEMBER THIRD CONF0.25
departure_date Dec-03/0.25
S did you say you wanted to leave on December
3rd?
U no
NO CONF0.88
departure_date Ø
16
belief updating problem a trickier case
S Where would you like to go? U Huntsville SEO
UL CONF0.65
destination seoul/0.65
S traveling to Seoul. What day did you need to
travel?
U no no Im traveling to Birmingham
THE TRAVELING TO BERLIN P_M CONF0.60
COR0.35
destination ?
17
belief updating problem formalized
destination seoul/0.65
S traveling to Seoul. What day did you need to
travel?

given
an initial belief Pinitial(C) over concept C
a system action SA
a user response R
construct an updated belief
Pupdated(C) ? f (Pinitial(C), SA, R)

THE TRAVELING TO BERLIN P_M CONF0.60
COR0.35
destination ?
18
outline

detecting misunderstandings
detecting user corrections
late-detection of misunderstandings
belief updating
construct accurate beliefs by integrating
information from multiple turns

current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work

19
belief updating current solutions

most systems only track values, not beliefs
new values overwrite old values
explicit confirm yes ? trust hypothesis
explicit confirm no ? kill hypothesis
explicit confirm other ? non-understanding
implicit confirm not much
users who discover errors through incorrect
implicitconfirmations have a harder time getting
back on track
Shin et al, 2002

20
outline

detecting misunderstandings
detecting user corrections
late-detection of misunderstandings
belief updating
construct accurate beliefs by integrating
information from multiple turns

current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work

21
belief updating general form

given
an initial belief Pinitial(C) over concept C
a system action SA
a user response R
construct an updated belief
Pupdated(C) ? f (Pinitial(C), SA, R)

22
restricted version 2 simplifications

compact belief
system unlikely to hear more than 3 or 4 values
single vs. multiple recognition results
in our data max 3 values, only 6.9 have gt1
value
confidence score of top hypothesis
updates after confirmation actions
reduced problem
ConfTopupdated(C) ? f (ConfTopinitial(C), SA, R)

23
outline

detecting misunderstandings
detecting user corrections
late-detection of misunderstandings
belief updating
construct accurate beliefs by integrating
information from multiple turns

current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work

24
data

collected with RoomLine
a phone-based mixed-initiative spoken dialog
system
conference room reservation
search and negotiation
explicit and implicit confirmations
confidence threshold model ( some exploration)
implicit confirmation task

I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?

I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?

25
user study

46 participants, 1st time users
10 scenarios, fixed order
presented graphically (explained during briefing)

compensated per task success

26
corpus statistics

449 sessions, 8848 user turns
orthographically transcribed
manually annotated
misunderstandings (concept-level)
non-understandings
user corrections
correct concept values

27
outline

detecting misunderstandings
detecting user corrections
late-detection of misunderstandings
belief updating
construct accurate beliefs by integrating
information from multiple turns

current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work

28
user response types

following Krahmer and Swerts
study on Dutch train-table information system
3 user response types
YES yes, right, thats right, correct, etc.
NO no, wrong, etc.
OTHER
cross-tabulated against correctness of
confirmations

29
user responses to explicit confirmations

from transcripts
numbers in brackets from KrahmerSwerts
from decoded

YES NO Other
CORRECT 94 93 0 0 5 7
INCORRECT 1 6 72 57 27 37
YES NO Other
CORRECT 87 1 12
INCORRECT 1 61 38
30
other responses to explicit confirmations

70 users repeat the correct value
15 users dont address the question
attempt to shift conversation focus

User does not correct User corrects
CORRECT 1159 0
INCORRECT 29 10 of incor 250 90 of incor
31
user responses to implicit confirmations

transcripts
numbers in brackets from KrahmerSwerts
decoded

YES NO Other
CORRECT 30 0 7 0 63 100
INCORRECT 6 0 33 15 61 85
YES NO Other
CORRECT 28 5 67
INCORRECT 7 27 66
32
ignoring errors in implicit confirmations
User does not correct User corrects
CORRECT 552 2
INCORRECT 118 51 of incor 111 49 of incor

users correct later (40 of 118)
users interact strategically
correct only if essential

correct later correct later
critical 55 2
critical 14 47
33
outline

detecting misunderstandings
detecting user corrections
late-detection of misunderstandings
belief updating
construct accurate beliefs by integrating
information from multiple turns

current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work

34
machine learning approach

need good probability outputs
low cross-entropy between model predictions and
reality
cross-entropy negative average log posterior
logistic regression
sample efficient
stepwise approach ? feature selection
logistic model tree for each action
root splits on response-type

35
features. target.

initial situation
initial confidence score
concept identity, dialog state, turn number
system action
other actions performed in parallel
features of the user response
acoustic / prosodic features
lexical features
grammatical features
dialog-level features
target was the value correct?

36
baselines

initial baseline
accuracy of system beliefs before the update
heuristic baseline
accuracy of heuristic rule currently used in the
system
oracle baseline
accuracy if we knew exactly when the user is
correcting the system

37
results explicit confirmation
Hard error ()
Soft error
38
results implicit confirmation
Hard error ()
Soft error
39
results unplanned implicit confirmation
Hard error ()
Soft error
40
informative features