Error Detection and Correction in SDS - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Error Detection and Correction in SDS

Description:

U: I would like a train to New York City from Philadelphia on Sunday at ten thirty P M ... to New York City on Sunday approximately at ten thirty p. m. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 33
Provided by: juliahir
Category:

less

Transcript and Presenter's Notes

Title: Error Detection and Correction in SDS


1
Error Detection and Correction in SDS
  • Julia Hirschberg
  • CS 4706

2
Today
  • Avoiding errors
  • Detecting errors
  • From the user side what cues does the user
    provide to indicate an error?
  • From the system side how likely is it the
    system made an error?
  • Dealing with Errors what can the system do when
    it thinks an error has occurred?
  • Evaluating SDS evaluating problem dialogues

3
Avoiding misunderstandings
  • The problem
  • By imitating human performance
  • Timing and grounding (Clark 03)
  • Confirmation strategies
  • Clarification and repair subdialogues

4
Today
  • Avoiding errors
  • Detecting errors
  • From the user side what cues does the user
    provide to indicate an error?
  • From the system side how likely is it the
    system made an error?
  • Dealing with Errors what can the system do when
    it thinks an error has occurred?
  • Evaluating SDS evaluating problem dialogues

5
Learning from Human Behavior Features in
repetition corrections (KTH)
50
adults
40
children
30
Percentage of all repetitions
20
10
0
more
shifting of
increased
clearly
focus
loudness
articulated
6
Learning from Human Behavior (Krahmer et al 01)
  • Learning from human behavior
  • go on and go back signals in grounding
    situations (implicit/explicit verification)
  • Positive short turns, unmarked word order,
    confirmation, answers, no corrections or
    repetitions, new info
  • Negative long turns, marked word order,
    disconfirmation, no answer, corrections,
    repetitions, no new info

7
  • Hypotheses supported but
  • Can these cues be identified automatically?
  • How might they affect the design of SDS?

8
Today
  • Avoiding errors
  • Detecting errors
  • From the user side what cues does the user
    provide to indicate an error?
  • From the system side how likely is it the
    system made an error?
  • Dealing with Errors what can the system do when
    it thinks an error has occurred?
  • Evaluating SDS evaluating problem dialogues

9
Systems Have Trouble Knowing When Theyve Made a
Mistake
  • Hard for humans to correct system misconceptions
    (Krahmer et al 99)
  • User I want to go to Boston.
  • System What day do you want to go to Baltimore?
  • Easier answering explicit requests for
    confirmation or responding to ASR rejections
  • System Did you say you want to go to Baltimore?
  • System I'm sorry. I didn't understand you. Could
    you please repeat your utterance?

10
  • But constant confirmation or over-cautious
    rejection lengthens dialogue and decreases user
    satisfaction

11
And Systems Have Trouble Recognizing User
Corrections
  • Probability of recognition failures increases
    after a misrecognition (Levow 98)
  • Corrections of system errors often
    hyperarticulated (louder, slower, more internal
    pauses, exaggerated pronunciation) ? more ASR
    error (Wade et al 92, Oviatt et al 96, Swerts
    Ostendorf 97, Levow 98, Bell Gustafson 99)

12
Can Prosodic Information Help Systems Perform
Better?
  • If errors occur where speaker turns are
    prosodically marked.
  • Can we recognize turns that will be misrecognized
    by examining their prosody?
  • Can we modify our dialogue and recognition
    strategies to handle corrections more
    appropriately?

13
Approach
  • Collect corpus from interactive voice response
    system
  • Identify speaker turns
  • incorrectly recognized
  • where speakers first aware of error
  • that correct misrecognitions
  • Identify prosodic features of turns in each
    category and compare to other turns
  • Use Machine Learning techniques to train a
    classifier to make these distinctions
    automatically

misrecognition
aware site
correction
14
Turn Types
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User
New York.
misrecognition
correction
aware site
15
TOOT Dialogues
  • Collected to study effects of differences in
    dialogue strategy on user performance and
    satisfaction (Litman Pan 99)
  • type of initiative (system, user, mixed)
  • type of confirmation (explicit, implicit, none)
  • adaptability condition
  • Subjects
  • 39 summer students
  • 16/23 (F/M)
  • 20/19 (native speaker/non)

16
  • Platform combined over-the-phone ASR and TTS
    (Kamm et al 97) with web access to train
    information
  • Task find train information for 4 scenarios
  • Corpus for current study
  • 2328 speaker turns
  • 52 dialogues
  • Misrecognitions
  • Overall word accuracy 61
  • Overall concept accuracy(CA) 71
  • I want to go to Boston from Philadelphia (2
    domain concepts)
  • recognized as I want to go to Boston (one
    concept) 50

17
A Successful Dialogue
  • S Hi, this is ATT Amtrak schedule system. This
    is TOOT. How may I help you?
  • U I would like a train to New York City from
    Philadelphia on Sunday at ten thirty P M
  • S I heard you say go to New York City from
    Philadelphia on Sunday approximately at ten
    thirty p.m." Do you want me to find the trains
    from Philadelphia to New York City on Sunday
    approximately at ten thirty p. m. now?
  • U Yes.
  • S I am going to get the train schedule for
    you....

18
Are Misrecognitions, Aware Turns, Corrections
Measurably Different from Other Turns?
  • For each type of turn
  • For each speaker, for each prosodic feature,
    calculate mean values for e.g. all correctly
    recognized speaker turns and for all incorrectly
    recognized turns
  • Perform paired t-tests on these speaker pairs of
    means (e.g., for each speaker, pairing mean
    values for correctly and incorrectly recognized
    turns)

19
How Prosodic Features Examined per Turn
  • Raw prosodic/acoustic features
  • f0 maximum and mean (pitch excursion/range)
  • rms maximum and mean (amplitude)
  • total duration
  • duration of preceding silence
  • amount of silence within turn
  • speaking rate (estimated from syllables of
    recognized string per second)
  • Normalized versions of each feature (compared to
    first turn in task, to previous turn in task, Z
    scores)

20
Distinguishing Correct Recognitions from
Misrecognitions (NAACL 00)
  • Misrecognitions differ prosodically from correct
    recognitions in
  • F0 maximum (higher)
  • RMS maximum (louder)
  • turn duration (longer)
  • preceding pause (longer)
  • slower
  • Effect holds up across speakers and even when
    hyperarticulated turns are excluded

21
WER-Based Results
Misrecognitions are higher in pitch, louder,
longer, more preceding pause and less internal
silence
22
Predicting Turn Types Automatically
  • Ripper (Cohen 96) automatically induces rule
    sets for predicting turn types
  • greedy search guided by measure of information
    gain
  • input vectors of feature values
  • output ordered rules for predicting dependent
    variable and (X-validated) scores for each rule
    set
  • Independent variables
  • all prosodic features, raw and normalized
  • experimental conditions (adaptability of system,
    initiative type, confirmation style, subject,
    task)
  • gender, native/non-native status
  • ASR recognized string, grammar, and acoustic
    confidence score

23
ML Results WER-defined Misrecognition
24
Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR string, ASR grammar
if (conf lt -2.85 (duration gt 1.27) then
F if (conf lt -4.34) then F if (tempo lt .81)
then F If (conf lt -4.09 then F If (conf lt
-2.46 str contains help then F If conf lt
-2.47 ppau gt .77 tempo lt .25 then F If str
contains nope then F If dur gt 1.71 tempo lt
1.76 then F else T
25
Today
  • Avoiding errors
  • Detecting errors
  • From the user side what cues does the user
    provide to indicate an error?
  • From the system side how likely is it the
    system made an error?
  • Dealing with Errors what can the system do when
    it thinks an error has occurred?
  • Evaluating SDS evaluating problem dialogues

26
Error Handling Strategies
  • If systems can recognize their lack of
    recognition, how should they inform the user that
    they dont understand (Goldberg et al 03)?
  • System rephrasing vs. repetitions vs. statement
    of not understanding
  • Apologies
  • What behaviors might these produce?
  • Hyperarticulation
  • User frustration
  • User repetition vs. rephrasing

27
  • What lessons do we learn?
  • When users are frustrated they are generally
    harder to recognize accurately
  • When users are increasingly misrecognized they
    tend to be misrecognized more often and become
    increasingly frustrated
  • Apologies combined with rephrasing of system
    prompts tend to decrease frustration and improve
    WER Dont just repeat!
  • Users are better recognized when they rephrase
    their input

28
How does an SDS Recognize a Correction? (ICSLP
00)
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User New
York.
correction
29
Serious Problem for Spoken Dialogue Systems
  • 29 of turns in our corpus are corrections
  • 52 of corrections are hyperarticulated but only
    12 of other turns
  • Corrections are misrecognized at least twice as
    often as non-corrections (60 vs. 31)
  • But corrections are no more likely to be rejected
    than non-corrections. (9 vs. 8)
  • Are corrections also measurably distinct from
    non-corrections?

30
Prosodic Indicators of Corrections
  • Corrections differ from other turns prosodically
    longer, louder, higher in pitch excursion,
    longer preceding pause, less internal silence
  • ML results
  • Baseline 30 error
  • normd prosody non-prosody 18.45 /- 0.78
  • automatic 21.48 /- 0.68

31
Prosodic Indicators of Corrections
  • Corrections differ from other turns prosodically
    longer, louder, higher in pitch excursion,
    longer preceding pause, less internal silence

32
ML Rules for Correction Prediction
  • Baseline 30 error (predict not correction)
  • normd prosody non-prosody 18.45 /- 0.78
  • automatic 21.48 /- 0.68
  • TRUE - gramuniversal, f0maxgt0.96, durgt6.55
  • TRUE - gramuniversal, zerosgt0.57, asrlt-2.95
  • TRUE - gramuniversal, f0maxlt1.98, durlt1.10,
    tempogt1.21, zerosgt0.71
  • TRUE - durgt0.76, asrlt-2.97, stratUsrNoConf
  • TRUE - durgt2.28, ppault0.86
  • TRUE - rmsavgt1.11, stratMixedImplicit,
    gramcityname, f0maxgt0.70
  • default FALSE

33
Corrections in Context
  • Similar in prosodic features but
  • What about their form and content?
  • How do system behaviors affect the corrections
    users produce?
  • What sort of corrections are most, least
    effective?
  • When users correct the same mistake more than
    once, do they vary their strategy in productive
    ways?

34
User Correction Behavior
  • Correction classes
  • omits and repetitions lead to fewer
    misrecognitions than adds and paraphrases
  • Turns that correct rejections are more likely to
    be repetitions, while turns correcting
    misrecognitions are more likely to be omits

35
  • Type of correction sensitive to strategy
  • much more likely to exactly repeat their
    misrecognized utterance in a system-initiative
    environment
  • much more likely to correct by omitting
    information if no system confirmation than with
    explicit confirmation
  • omits used more in MixedImplicit and
    UserNoConfirm conditions
  • Restarts unlikely to be recognized (77
    misrecognized) and skewed in distribution
  • 31 of corrections are restarts in MI and UNC

36
  • None for SE, where initial turns well recognized
  • It doesnt pay to start over!

37
Today
  • Avoiding errors
  • Detecting errors
  • From the user side what cues does the user
    provide to indicate an error?
  • From the system side how likely is it the
    system made an error?
  • Dealing with Errors what can the system do when
    it thinks an error has occurred?
  • Evaluating SDS evaluating problem dialogues

38
Recognizing Problematic Dialogues
  • Hastie et al, Whats the Trouble? ACL 2002
  • How to define a dialogue as problematic?
  • User satisfaction is low
  • Task is not completed
  • How to recognize?
  • Train on a corpus of recorded dialogues (1242
    DARPA Communicator dialogues)
  • Predict
  • User Satisfaction
  • Task Completion (0,1,2)

39
  • User Satisfaction features

40
Results
41
Next Class
  • Speech data mining
  • HW3c due
Write a Comment
User Comments (0)
About PowerShow.com