Title: baselines for recovering from nonunderstandings
1baselines for recovering from non-understandings
- a planned investigationDan Bohus
- DoD Reading Group, May 2004
2understanding errors
- Two main types of errors
- System cannot form a valid interpretation for the
users input ? non-understanding - System forms an incorrect interpretation of the
users input ? misunderstanding
3non-understandings
- Can be a big pain, theres not much to act upon
- S When do you need a room?
- U blah bla blahaha 3 unparsed words
- S Im sorry I didnt understand you. can you
please repeat that? - What would you do in the systems place?
4an example
- 1 S Ok. I will now make the reservation for
Newell Simon thirty three oh five. Could I have
your name please? If you are not registered with
the system, just say 'guest'. - U guest YES
- 2 S Sorry, I'm not sure I understood what you
said - U guest DATE
- 3 S I am currently trying to make a room
reservation for Newell Simon thirty three oh
five. Right now I need to know your name in order
to make a conference room reservation. Tell me
your name, or say 'guest' if you are not
registered with the system. If you do not want to
make this reservation, say 'cancel reservation'. - U guest THE
- 4 S I apologize, I still did not understand you
Could you try to rephrase that please? - U guest YEAH
- 5 S Sorry, I didn't catch that.
- U oh my god guest we're a guest .WHAT. .YES.
.YES. FOR TUESDAY - 6 S Newell Simon thirty five oh one is the only
room available Tuesday, April twentieth between
ten and noon. This room can accommodate seventeen
people, and has a projector , a whiteboard and
network access. Do you want a reservation for
Newell Simon thirty five oh one? - U yes YES
- 7 S Ok. I will now make the reservation for
Newell Simon thirty five oh one. Could I have
your name please? If you are not registered with
the system, just say 'guest'. - U guest OKAY
- 8 S I'm sorry, I don't think I understood you
correctly. Could you please repeat that? - U I'm a f guest UP AGAIN
- 9 S Sorry, I still did not understand you.
5causes
- Users language does not match systems input
language - Languages match, but recognition problems
- Recognition OK, but parsing problems
- Parsing OK, but interpretation problems
6causes
- Users language does not match systems input
language - Languages match, but recognition problems
- Recognition OK, but parsing problems
- Parsing OK, but interpretation problems
- To Do a more detailed analysis!
7what can one do ?
- Notify non-understanding
- Repeat system prompt
- Ask repeat
- Ask rephrase
- Help give state specific help
- Help give help about what the user can say
- Help establish the context
8what else?
- Try an alternative dialog plan to achieve the
same goal - including ignore, take default value
- Extract more information/content from the
non-understanding, and do something smarter with
that - Use fall-back parses on the recognition
hypothesis - Explicit confirm turn (Antoine)
- Targeted help
- Other ideas?
9the decision process
POLICY
Strategies
True causes
- Handcraft a policy
- Learn it for instance in a reinforcement
learning framework
10markov decision processes
- States
- Various non-understanding states
- 1 understanding state (final)
- Actions
- Recovery strategies
- Rewards
- -10 on each transition to a non-understanding
state
-10
NU2
Repeat
NU3
NU1
-10
U
0
11pros and cons of learning
- Cons
- Would a heuristic be good enough?
- Is there going to be enough data?
- Pros
- Adaptive (different levels)
- Harder to devise heuristics with a large number
of strategies () more justification - Less development effort (?)
12better policy or strategies?
POLICY
Strategies
True causes
?
?
- Hypothesis
- This set of strategies is sufficient, and a good
policy would make a whole lot of difference
13a checkpoint experiment
- Run an experiment
- Let a human make the non-understanding recovery
decisions - Goal can we do significantly better than a
random policy? (given a fixed set of strategies) - Create a second, higher (upper-bound) baseline,
and hence a frame for the learning approach - Validating the set of strategies/ Green light
for concentrating on the policy (?)
14experimental design
- Goal
- How well does random do? Preliminary results
- Variables
- System / Setup
- Participants
- Tasks
- Potential outcomes, alternatives, discussion
15random baseline (preliminary)
- 103 sessions (1040 utterances) RoomLine
- 274 non-understandings (26.3)
- 172 non-understanding segments
- 1 6 turns (distribution on next slide)
- avg. segment length 1.6 turns
- To Do more stats
- Identify trouble spots
- Correlation of success to various indicators
16random baseline (preliminary)
17random baseline (preliminary)
18random baseline (preliminary)
19random baseline (preliminary)
20confidence intervals
21experimental design
- Goal
- How well does random do? Preliminary results
- Variables
- System / Setup
- Participants
- Tasks
- Potential outcomes, alternatives, discussion
22variables
- Independent variable recovery policy
- 2 levels random and human
- 3 levels? expert-designed policy?
- Dependent variable recovery performance
- Evaluating efficiencies of each strategy
- Data requirements are problematic in WoZ
condition - Evaluating global, dialog-level metrics
- Task completion rates
- Various statistics of error segments
- To Do Assess data requirements
23variables (2)
- Potential confounding variable response time
- Wizard response will be slower (how much so?)
- Compensate?
- Using distribution of wait times from pilot
experiments - Conditions would be consistent, but both
different from reality (lowered performance) - Dont compensate? (it will presumably lower the
performance) - Hmm Other ideas?
24experimental design
- Goal
- How well does random do? Preliminary results
- Variables
- System / Setup
- Participants
- Tasks
- Potential outcomes, alternatives, discussion
25system setup
- Random condition
- RoomLine current system
- Wizard condition
- RoomLine guides all interaction, except for the
non-understanding recovery decisions ? wizard - Physical setup all in speech lab, wizard _at_ rack
- noise conditions okay?
- Alternative for random condition, call from home
- can be done for both between and within-subjects
- are there other confounding variables? (phone
line?)
26system setup / strategies
- Notify non-understanding
- Repeat prompt / w. notify
- Ask repeat / w. notify
- Ask rephrase / w. notify
- Help state dependent / w. notify
- Help you can say / w. notify
- Help full help / w. notify
- To Do add Alternative plans
27system setup / who is the wizard
- Me?
- Pros already familiar with the process
- Cons might already be biased in various ways
- does bias matter if Im trying to do my best?
- should I avoid biasing myself?
- or should I actively try and do my homework?
- Someone else?
- Cons will have to train, explain
- Multiple wizards?
- Would probably be the way to go, but too expensive
28system setup / what should the wizard see?
- Full Knowledge
- audio
- recognition results, conf scores, etc
- parsing results
- non-understanding type
- System Knowledge
- no audio only what the system knows
- that seems like a hard task for a human
29experimental design
- Goal
- How well does random do? Preliminary results
- Variables
- System / Setup
- Participants
- Tasks
- Potential outcomes, alternatives, discussion
30participants / data
- 100 trials / strategy (0.15 conf interval) ?
200 sessions for each condition (this is _at_ 7
strategies) - Within subjects (?)
- 40 users, 5 session in each condition
(randomized) - Between subjects (?)
- 2x20 users, 10 sessions
- 20 random condition can they call from home?
- System could still have simulated response delay
(?) - Balance for gender, computer-saviness(?)
- Anything else?
31experimental design
- Goal
- How well does random do? Preliminary results
- Variables
- System / Setup
- Participants
- Tasks
- Potential outcomes, alternatives, discussion
32tasks
- 5/10 scenarios (out of a pool of multiple?)
- How does one design those?
- Any papers? Any rules?
- Use graphical representation? to avoid lexical
entrainment - 2 free interactions, 1 _at_ beginning, 1 _at_ end
- Briefing
- Debriefing SASSI
33experimental design
- Goal
- How well does random do? Preliminary results
- Variables
- System / Setup
- Participants
- Tasks
- Potential outcomes, alternatives, discussion
34outcomes /when wizard knows all
- There is a statistically significant improvement
- We have a frame for learning
- Theres space for improvement given this set of
strategies - But we cant really claim an upper baseline!
- Can use data for further analysis
- correlation of indicators to strategy invocation
success - There is no statistically significant difference
- Not guaranteed what that means
- Is the set of strategies too inefficient?
- Are strategies insensitive to conditions?
- Is task too complex for a human? (least likely)
35outcomes /when wizard knows system
- There is a statistically significant improvement
- That result is even stronger than before
- There is no statistically significant difference
- Probably task is inappropriate for a human, but
other explanations could be valid, too
36most likely plan (as of before this talk)
- wizard has full audio
- i am the wizard
- train myself
- add the alternative plan strategy
- between-subjects experiments
37most likely plan (as of now)
38alternative directions
POLICY
Observables / Indicators
Strategies
True causes
True causes
- Concentrate more on strategies
- A comparative experiment to assess the benefits
of having more strategies
39alternative directions
POLICY
True causes
Observables / Indicators
Strategies
?
- Different approach
- Infer true causes and use a simple policy
40conclusion next time