Title: USNA experiments FebMar
1USNA experiments (Feb/Mar)
- Approx. 210 USNA midshipmen
- 12 sections of about 18 cadets
- 6 GB of Data 15 GB video arrived back at
Stanford March 7! - Simulator databases
- tutor logfiles
- Speech wavefiles to simulator and tutor
- Webcam footage (video only) of some subjects
- User questionnaires background, satisfaction,
comments - Pre Post Tests
2USNA Location
- 1 hour 50 minute lab
- Time pressure
- Computer classroom 32 dual P4 2.4GHz w/ 512MB,
17" flat screens, nVidia GEForce FX 5200 - Subjects seated very close together
3Noise-Canceling Headphones
- Protravelgear.com
- Every 3 db in ANR approx double the noise
reduction
4ANR for speaking and listening
- Speech Andrea ANC-700 microphones
- Listening Protravelgear PlaneQuiet headsets
- Subjects wore both at once
- Impressionistically, Stanford students in dry run
didnt mind this, but USNA students seemed to be
bothered by the headsets
5Graphic interaction
- What tutor and student can interact with
- Compartments
- Bulkheads
- Regions
- Labels
- Compartment groups
- Methods
- Single-click
- Click and drag
- Circling
6Active vs. Passive Tutoring
7Tutoring vs. No Tutoring
- No tutoring at all (USNA, winter 2005)
- Students played computer solitaire between
simulator sessions - By knowledge area (Stanford, spring 2004)
- Boundaries, jurisdiction, sequencing
- Students were tutored on a single knowledge area
in each of 3 tutoring sessions
8Distribution of USNA subjects per condition
9Tutoring Beats No Tutoring
- Consider all the active and passive tutoring
conditions against the one no tutoring condition - being tutored is positively correlated with
improvement in the test score. - R .241, with a significance (2-tailed) of .005,
with N132. The correlation is significant at
the 0.01 level (2-tailed). - Being tutored is also positively correlated
with the proportion of the student's actions that
were correct. - R .245, with a significance of .018, N92. The
correlation is significant at the 0.05 level
(2-tailed).
10More Test Score Improvement for Tutored Subjects
11Tutored Subjects improve more in Ordering Correct
Actions
12Passive Tutoring Better for Sequencing Test Score
Impr.
- Take out the no tutoring condition, and compare
active vs. passive tutoring - active tutoring is negatively correlated with
improvement in the test score. - R -.239, with a significance (2-tailed) of
.009, with N119. The correlation is significant
at the 0.01 level (2-tailed). - Break down the test score improvement by
knowledge area - active tutoring is negatively correlated with
the improvement in sequencing test score. R
-.198, with a significance of .031, with N119.
The correlation is signficant at the 0.05 level
(2-tailed). - With respect to the boundaries/jurisdiction test
score improvement, active tutoring is not
significantly correlated. - If we look at all the performance areas, there is
no significant correlation with active vs.
passive tutoring in performance metrics.
13Passive vs. Active Test Scores
14Passive Tutoring Improves Sequencing Test Scores
More
15Completion of Tutoring Content
- Almost no active tutoring subjects completed a
tutoring session - Almost all passive tutoring subjects completed
the tutoring session - Does material covered make the difference?
16Test Score by Time Taken
17Mean Time for Pre Post Tests
18Pre-Test Time Taken Score
19No correlations with graphic input or output
- both active and passive tutoring sessions
together, graphic input vs. no graphic input - no correlations with test score improvement or
- No correlations with improvement in any
performance metrics - Same for graphic output, with both active and
passive tutoring session together - only active tutoring, graphic input vs. those
without it - no correlations with test score improvement
- No correlations with improvement in any
performance metrics. - Same for graphic output.
20Test Scores vs. Performance Metrics
- no correlation between improvement in test score
on sequencing and improvement in performance of
sequencing actions - (either correctness of action, or amount of
pending expert actions performed). - no correlation between improvement in test score
on jurisdiction/boundaries and improvement in
performance of jurisdiction or boundary actions - (either correctness of action, or amount of
pending expert actions performed). - no correlation between the post test scores and
the simulator performance statistics by area.
21Test Score Improvement by Condition
22Performance Improvement by Area
23Post-Test Score by Condition
24Satisfaction with Tutor (preliminary)
25Tutor Satisfaction
- Highest rating on tutor being accurate mean
5.07, std. dev. 1.67 - Lowest rating on tutor understanding student
mean 3.85, std. dev.1.81
26Speech Synthesis
- Festival (University of Edinburgh)
- Concatenative synthesis
- Can use many different standard voices
- Allows customized limited domain voices (FestVox)
- Allows markup for emphasis, phrase types
- FestVox limited domain voices allow templates and
slots - Lets first discuss the fire in the access trunk.
- Lets discuss the flood in the cleaning gear
locker. - Lets discuss the smoke in the fan room.
- We recorded 1,764 utterances, many duplicates
27FestVox limited domain voice
- Issues of scale
- speed of synthesis gets noticeably slow as the
recorded voice corpus approaches 2000 utterances - may need to break into related voices
- We cached many utterances in advance, so not a
big problem - Could study benefits of limited domain voice vs.
general diphone voice synthesis
28Open-Ended Questions
- Definition Questions
- First of all can you tell me what primary
boundaries are? - And now can you tell me what secondary boundaries
are? - What did we define primary boundaries as earlier?
- What did we define secondary boundaries as
earlier? - Why Questions
- Why is it necessary to investigate after the
alarm sounds? - Why is it necessary to isolate when you have a
report of fire? - Why is it necessary to set fire boundaries when
you have a report of fire? - Why is it necessary to set flood boundaries when
you have a report of flood? - Why is it necessary to set smoke boundaries when
you have a report of smoke?
29Some Sample Answers to Open-Ended Questions
- System Why is it necessary to investigate after
the alarm sounds? - Student to see if it's a false fire
- System Why is it necessary to set smoke
boundaries when you have a report of smoke? - Student prevent smoke from spreading to other
compartments - System First of all can you tell me what primary
boundaries are? - Student first two bulkheads around the crisis
30Answers to Open-Ended questions
- Did students produce longer, more complex
answers? - Did their speech recognition experience earlier
in the session influence their answer length and
complexity?
31Statistical Language Model
- Benefits
- Smaller process size
- Quicker development cycle
- More robust coverage
- Issues
- A class-based LM generalizes beyond specific
corpus - The tagging grammar producing classes can obscure
distinctions - repair two, three, five vs. compartment numbers
32SLM data
- 8585 utterances to SCoT tutor
- 10880 utterances to DC-Train simulator
- Data sources
- San Diego Fleet Training Center
- Spring 2004 experiment
- Summer 2004 experiment
- Fictional data for new tutoring areas
33Gemini NL Grammar Particulars(slightly outdated)
- 170 grammar rules
- 755 one-word lexical entries
- 1769 multi-word lexical entries
- Including
- 48 action verbs (some synonymous)
- 33 lexical items for ship personnel
- 391 compartment names
- 1053 frame numbers (for compartments, bulkheads,
valves, etc.) to reduce speech recognition
errors - 13 synonyms for yes
34NL Interpretation
- First try Gemini, aiming for a logical form
interpretation - If doesnt parse, Nuance slots as a backoff
robustness strategy - Being idiosyncratic isnt so bad for the backoff
35Nuance NL interpretation rules
Forward_and_Aft ((forward and aft) (front
and back) (before and after) forward
aft (of the crisis fire
smoke flood compartment
casualty) )
ltposition-adjective forward-and-aftgt Either_Sid
e (on either side of the crisis fire smoke
flood compartment casualty )
ltposition-adjective either-sidegt
36Nuance slots vs. Gemini LFs
- Nuance slots
- allow for quick development cycle, because close
to domain representation used by dialogue manager - Easy to include very particular, idiosyncratic
patterns - Only one instance of a slot is filled per
sentence must define multiple slots for
boundaries - Allow some structure in NL representation
37Damage Control Symbology
38Highlight Action of Interest
39Highlight Generalizations
40Symbology and Coaching
- Contextualize discussion
- Shorten/eliminate spoken descriptions
- Focus attention on relevant parts
- Succinct comparison with expert actions
- Hint about actions
41Indicators of Uncertainty
- Response latency
- Pauses
- Um/uh
- Hedges I guess, I think
- Do subjects produce them?
- Can we detect them?
- What can we do with them?
42Pauses within Utterances
- Transcribers marked noticeable pauses as pause
- 35 instances in about 100 tutoring sessions
- Used Sphinx2 to perform forced-alignments of
transcription with wavefile - Appears to line up reasonably well with pauses
marked by transcribers - We have not yet calculated mean pause length or
analyzed variation between speakers
43Sphinx Forced Alignment
44Uncertainty indicators
- Response Latency
- We record milliseconds between system utterance
end and start of user speech - Um/uh
- They do occur in speech to system
- Detecting them possible
- depends on accuracy of speech recognition.