Title: Error Detection and Correction in SDS
1Error Detection and Correction in SDS
2Today
- Avoiding errors
- Detecting errors
- From the user side what cues does the user
provide to indicate an error? - From the system side how likely is it the
system made an error? - Dealing with Errors what can the system do when
it thinks an error has occurred? - Evaluating SDS evaluating problem dialogues
3Avoiding misunderstandings
- The problem
- By imitating human performance
- Timing and grounding (Clark 03)
- Confirmation strategies
- Clarification and repair subdialogues
4Today
- Avoiding errors
- Detecting errors
- From the user side what cues does the user
provide to indicate an error? - From the system side how likely is it the
system made an error? - Dealing with Errors what can the system do when
it thinks an error has occurred? - Evaluating SDS evaluating problem dialogues
5Learning from Human Behavior Features in
repetition corrections (KTH)
50
adults
40
children
30
Percentage of all repetitions
20
10
0
more
shifting of
increased
clearly
focus
loudness
articulated
6Learning from Human Behavior (Krahmer et al 01)
- Learning from human behavior
- go on and go back signals in grounding
situations (implicit/explicit verification) - Positive short turns, unmarked word order,
confirmation, answers, no corrections or
repetitions, new info - Negative long turns, marked word order,
disconfirmation, no answer, corrections,
repetitions, no new info
7- Hypotheses supported but
- Can these cues be identified automatically?
- How might they affect the design of SDS?
8Today
- Avoiding errors
- Detecting errors
- From the user side what cues does the user
provide to indicate an error? - From the system side how likely is it the
system made an error? - Dealing with Errors what can the system do when
it thinks an error has occurred? - Evaluating SDS evaluating problem dialogues
9Systems Have Trouble Knowing When Theyve Made a
Mistake
- Hard for humans to correct system misconceptions
(Krahmer et al 99) - User I want to go to Boston.
- System What day do you want to go to Baltimore?
- Easier answering explicit requests for
confirmation or responding to ASR rejections - System Did you say you want to go to Baltimore?
- System I'm sorry. I didn't understand you. Could
you please repeat your utterance?
10- But constant confirmation or over-cautious
rejection lengthens dialogue and decreases user
satisfaction
11And Systems Have Trouble Recognizing User
Corrections
- Probability of recognition failures increases
after a misrecognition (Levow 98) - Corrections of system errors often
hyperarticulated (louder, slower, more internal
pauses, exaggerated pronunciation) ? more ASR
error (Wade et al 92, Oviatt et al 96, Swerts
Ostendorf 97, Levow 98, Bell Gustafson 99)
12Can Prosodic Information Help Systems Perform
Better?
- If errors occur where speaker turns are
prosodically marked. - Can we recognize turns that will be misrecognized
by examining their prosody? - Can we modify our dialogue and recognition
strategies to handle corrections more
appropriately?
13Approach
- Collect corpus from interactive voice response
system - Identify speaker turns
- incorrectly recognized
- where speakers first aware of error
- that correct misrecognitions
- Identify prosodic features of turns in each
category and compare to other turns - Use Machine Learning techniques to train a
classifier to make these distinctions
automatically
misrecognition
aware site
correction
14Turn Types
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User
New York.
misrecognition
correction
aware site
15TOOT Dialogues
- Collected to study effects of differences in
dialogue strategy on user performance and
satisfaction (Litman Pan 99) - type of initiative (system, user, mixed)
- type of confirmation (explicit, implicit, none)
- adaptability condition
- Subjects
- 39 summer students
- 16/23 (F/M)
- 20/19 (native speaker/non)
16- Platform combined over-the-phone ASR and TTS
(Kamm et al 97) with web access to train
information - Task find train information for 4 scenarios
- Corpus for current study
- 2328 speaker turns
- 52 dialogues
- Misrecognitions
- Overall word accuracy 61
- Overall concept accuracy(CA) 71
- I want to go to Boston from Philadelphia (2
domain concepts) - recognized as I want to go to Boston (one
concept) 50
17A Successful Dialogue
- S Hi, this is ATT Amtrak schedule system. This
is TOOT. How may I help you? - U I would like a train to New York City from
Philadelphia on Sunday at ten thirty P M - S I heard you say go to New York City from
Philadelphia on Sunday approximately at ten
thirty p.m." Do you want me to find the trains
from Philadelphia to New York City on Sunday
approximately at ten thirty p. m. now? - U Yes.
- S I am going to get the train schedule for
you....
18Are Misrecognitions, Aware Turns, Corrections
Measurably Different from Other Turns?
- For each type of turn
- For each speaker, for each prosodic feature,
calculate mean values for e.g. all correctly
recognized speaker turns and for all incorrectly
recognized turns - Perform paired t-tests on these speaker pairs of
means (e.g., for each speaker, pairing mean
values for correctly and incorrectly recognized
turns)
19How Prosodic Features Examined per Turn
- Raw prosodic/acoustic features
- f0 maximum and mean (pitch excursion/range)
- rms maximum and mean (amplitude)
- total duration
- duration of preceding silence
- amount of silence within turn
- speaking rate (estimated from syllables of
recognized string per second) - Normalized versions of each feature (compared to
first turn in task, to previous turn in task, Z
scores)
20Distinguishing Correct Recognitions from
Misrecognitions (NAACL 00)
- Misrecognitions differ prosodically from correct
recognitions in - F0 maximum (higher)
- RMS maximum (louder)
- turn duration (longer)
- preceding pause (longer)
- slower
- Effect holds up across speakers and even when
hyperarticulated turns are excluded
21WER-Based Results
Misrecognitions are higher in pitch, louder,
longer, more preceding pause and less internal
silence
22Predicting Turn Types Automatically
- Ripper (Cohen 96) automatically induces rule
sets for predicting turn types - greedy search guided by measure of information
gain - input vectors of feature values
- output ordered rules for predicting dependent
variable and (X-validated) scores for each rule
set - Independent variables
- all prosodic features, raw and normalized
- experimental conditions (adaptability of system,
initiative type, confirmation style, subject,
task) - gender, native/non-native status
- ASR recognized string, grammar, and acoustic
confidence score
23ML Results WER-defined Misrecognition
24Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR string, ASR grammar
if (conf lt -2.85 (duration gt 1.27) then
F if (conf lt -4.34) then F if (tempo lt .81)
then F If (conf lt -4.09 then F If (conf lt
-2.46 str contains help then F If conf lt
-2.47 ppau gt .77 tempo lt .25 then F If str
contains nope then F If dur gt 1.71 tempo lt
1.76 then F else T
25Today
- Avoiding errors
- Detecting errors
- From the user side what cues does the user
provide to indicate an error? - From the system side how likely is it the
system made an error? - Dealing with Errors what can the system do when
it thinks an error has occurred? - Evaluating SDS evaluating problem dialogues
26Error Handling Strategies
- If systems can recognize their lack of
recognition, how should they inform the user that
they dont understand (Goldberg et al 03)? - System rephrasing vs. repetitions vs. statement
of not understanding - Apologies
- What behaviors might these produce?
- Hyperarticulation
- User frustration
- User repetition vs. rephrasing
27- What lessons do we learn?
- When users are frustrated they are generally
harder to recognize accurately - When users are increasingly misrecognized they
tend to be misrecognized more often and become
increasingly frustrated - Apologies combined with rephrasing of system
prompts tend to decrease frustration and improve
WER Dont just repeat! - Users are better recognized when they rephrase
their input
28How does an SDS Recognize a Correction? (ICSLP
00)
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User New
York.
correction
29Serious Problem for Spoken Dialogue Systems
- 29 of turns in our corpus are corrections
- 52 of corrections are hyperarticulated but only
12 of other turns - Corrections are misrecognized at least twice as
often as non-corrections (60 vs. 31) - But corrections are no more likely to be rejected
than non-corrections. (9 vs. 8) - Are corrections also measurably distinct from
non-corrections?
30Prosodic Indicators of Corrections
- Corrections differ from other turns prosodically
longer, louder, higher in pitch excursion,
longer preceding pause, less internal silence - ML results
- Baseline 30 error
- normd prosody non-prosody 18.45 /- 0.78
- automatic 21.48 /- 0.68
31Prosodic Indicators of Corrections
- Corrections differ from other turns prosodically
longer, louder, higher in pitch excursion,
longer preceding pause, less internal silence
32ML Rules for Correction Prediction
- Baseline 30 error (predict not correction)
- normd prosody non-prosody 18.45 /- 0.78
- automatic 21.48 /- 0.68
- TRUE - gramuniversal, f0maxgt0.96, durgt6.55
- TRUE - gramuniversal, zerosgt0.57, asrlt-2.95
- TRUE - gramuniversal, f0maxlt1.98, durlt1.10,
tempogt1.21, zerosgt0.71 - TRUE - durgt0.76, asrlt-2.97, stratUsrNoConf
- TRUE - durgt2.28, ppault0.86
- TRUE - rmsavgt1.11, stratMixedImplicit,
gramcityname, f0maxgt0.70 - default FALSE
33Corrections in Context
- Similar in prosodic features but
- What about their form and content?
- How do system behaviors affect the corrections
users produce? - What sort of corrections are most, least
effective? - When users correct the same mistake more than
once, do they vary their strategy in productive
ways?
34User Correction Behavior
- Correction classes
- omits and repetitions lead to fewer
misrecognitions than adds and paraphrases - Turns that correct rejections are more likely to
be repetitions, while turns correcting
misrecognitions are more likely to be omits
35- Type of correction sensitive to strategy
- much more likely to exactly repeat their
misrecognized utterance in a system-initiative
environment - much more likely to correct by omitting
information if no system confirmation than with
explicit confirmation - omits used more in MixedImplicit and
UserNoConfirm conditions - Restarts unlikely to be recognized (77
misrecognized) and skewed in distribution - 31 of corrections are restarts in MI and UNC
36- None for SE, where initial turns well recognized
- It doesnt pay to start over!
37Today
- Avoiding errors
- Detecting errors
- From the user side what cues does the user
provide to indicate an error? - From the system side how likely is it the
system made an error? - Dealing with Errors what can the system do when
it thinks an error has occurred? - Evaluating SDS evaluating problem dialogues
38Recognizing Problematic Dialogues
- Hastie et al, Whats the Trouble? ACL 2002
- How to define a dialogue as problematic?
- User satisfaction is low
- Task is not completed
- How to recognize?
- Train on a corpus of recorded dialogues (1242
DARPA Communicator dialogues) - Predict
- User Satisfaction
- Task Completion (0,1,2)
39- User Satisfaction features
40Results
41Next Class
- Speech data mining
- HW3c due