Title: CS 260: Lecture 10
1CS 260 Lecture 10
2Speech the Ultimate Interface?
- In the early days of HCI, people assumed that
speech/natural language would be the ultimate UI - Use of speech interfaces has grown, but its
still rarely used in the office. Why?
3Speech the Ultimate Interface?
- Why speech hasnt succeeded in the office
- Affordances of text
- Visual scanning (for email or docs)
- Unambiguity of text
- Editing of text
- Disadvantages of speech
- Noise call center ambience
- Lack of privacy
4Speech the Ultimate Interface?
- Use of speech interfaces has grown, but its
still rarely used in the office.
5Computing is Moving
- Where are computers these days?
- Intels breakdown (based on PC sales)
- Office
- Home
- Mobile (laptops)
- Medical
- And as we noted earlier, programmable smartphones
will soon outnumber total PCs. - Then there are game boxes, cable boxes, Smart TVs
etc.
6What is a good interface for
- Mobile computing (walking or driving)?
- Home computing?
- Medical computing?
7Where is the industry now?
- After a big slump around 2002, the speech
technology/voice interface industry seems to be
growing briskly, about 30-40 per year. One
current estimate put it at about 2.5 Billion. - It would probably be more visible, except several
related industries have overtaken it outsourced
call centers, and VOIP (Voice Over IP). - The biggest growth has been in the new markets
- Cell phones (as a local UI)
- Medical (e.g. order entry)
- Voice services over the phone
8Industry movement
- In January this year, Yahoo acquired a large team
of speech engineers from Nuance, the largest
speech company (which owns Dragon
NaturallySpeaking). - Google already had some leading speech
researchers. - So there is much interest in speech for the
portal market. - Aside there is a division of Nuance devoted to
medical speech recognition, and one to call
centers.
9Industry movement
- Heyanita Voice based email and messaging
- Bevocal Hosted IVR (Interactive Voice Response)
for customers, e.g. MetroPCS - Tellme Find a business service (including
restaurants) using ASR.
10Speech Some background
- A speech recognizer consists of 3 stages
- A state-of-the-art recognizer requires 50-100
Mflops for continuous speech (no pauses between
words). - PC continuous speech recognizers appeared in the
1990s and saved many victims of RSI.
Rawsound
Phoneticfeatures
Acousticfeatures
AcousticFront End
Language/phoneticmodel
AcousticModel
Words
11Speech Some background
- The first two stages are standard. The last is
not, and has a big impact on performance. - The last box encodes knowledge of what users
might say, either as a grammar, or as a
statistical language model (LM). Grammars are
suitable for small recognition tasks with
well-known command languages.
Rawsound
Phoneticfeatures
Acousticfeatures
AcousticFront End
Language/phoneticmodel
AcousticModel
Words
12Speech UIs
- Most implement a finite-state machine.
- At each state, the system can recognize various
speech segments to take you to the next state(s). - A segment may be a word, through to a complete
utterance. - The system can also make utterances of its own
at various states. - You can specify them usingregular expressions,
or using VoiceML.
13Speech on phones
- Speech recognition is faster and more accurate if
you limit the vocabulary to a few dozen words. - Small-vocabulary speech recognition has been
common on phones for the last few years - Call a number
- Call a name (from your contacts)
- What about large vocabulary, continuous speech?
14This years Smart phone
- This years Smartphone (free with service
contract) - 150-200 MHz ARM processor
- 32 MB ram
- 2 GB flash (not included)
- Windows-98 PC that boots quickly!
- Plus
- Camera
- AGPS (Qualcomm/Snaptrack)
- DSP cores, OpenGL GPU
- EV-DO (300 kb/s), Bluetooth
15Speech on phones
- This is just the right power for high-performance
speech recognition. - Large-vocabulary speech recognition(not
continuous) appeared on phones last year
Samsung P207 - LVCSR (Large-Vocabulary ContinuousSpeech
Recognition) should be available this year.
16Speech in the home
- Good speech recognition used to require careful
microphone placement and a worn headset.
17Speech in the home
- New microphones array mics with builtin DSPs
allow recognition at greater range (several
feet). - Users dont have to wear microphones any more
to use speech.
18Speech in the home
- Apart from CPU and memory (which are shrinking),
speech recognition requires only a microphone and
perhaps a speaker. It is power and size
efficient. - In a few years, it will probably be possible to
build speech recognition into bluetooth
microphones, or other small devices. Compare with
other interfaces
19Ten Guidelines for Speech Interfaces
- You cant design what you cant define
- Use user-centered design techniques
- Use the right technology, and use technology
right - Leverage the language instinct
- Establish success criteria and test against them
- Branding in VUI is more than just a pretty voice
- How you say it is as important as what you say
- Dont block the exit
- Take care with error handling
- Establish a change process
201. You cant design what you cant define
- Consider the task(s) that your users want to do,
i.e. start with standard task analysis. - What conceptual model do they have (use
contextual inquiry)? - What language do they use to refer to it?
- Use recordings during contextual inquiry/task
analysis.
212. Use user-centered design techniques
- Great to see this advice in a trade publication.
You know a lot about this - Study real use context especially important for
mobile devices, medical, home etc. - Performs needs analysis what kinds of service
might the system provide and how valuable are
they? - Develop personae to guide your design
- Once again, study users conceptual models
223. Use the right technology, and use technology
right
- In a speech interface, you have a choice between
synthesized and recorded speech for output. - In designing the recognizer, language-models will
generally give better results for routing a broad
range of user questions. - Using technology right speech recognizers are
fussy animals. They use many parameters to
trade-off performance and accuracy. You have to
experiment with these in order to understand
them.
234. Leverage the Language Instinct
- Make a voice UI resemble natural speech
- Use familiar phrasing
- Dont mimic written language
- Use conversational style (pronouns,
acknowledgements, transition words) - Use realistic prosody (pitch etc.) in TTS
- Enable callers to speak over and interrupt the
TTS system
245. Establish Success Criteria and Test Against
them
- Standard tests recognition accuracy, speed, CPU
- Dialog traversal tests capture many
conversations and plot the paths through your
dialog hierarchy that users took. - Usability testing
- Early rapid prototyping WOZ testing
- Define call success in a sensible way, and
track it!
256. Branding is more than a pretty voice
- Users make strong attributions about a human
speaker (personality, education, demographics).
They do the same with speech interfaces (whether
you intend it or not). - Design of a voice UI is as significant as design
of an attractive web site. A robot voice UI is
like a 12-point text-only web site. - The voice interfaces brand perception is a
combination of prosody and language, just like a
real speakers. Design both explicitly.
267. How you say it is as important as what you say
- Mostly about speech constructed from recorded
voice. - For natural speech, you need to think about the
context of each word in real speech. - Pronunciation actually changes when words are
connected together (this is co-articulation). - Ideally, you would include appropriate context
information in each recording (e.g. the number
one followed by a t consonant).
278. Dont block the exit
- Make sure users can exit the automated system and
reach a live person. - If you make it hard, they will get there anyway,
and be angry when they do. - Providing feedback can help (e.g. the estimated
time to reach a representative is, do you wish
to return to the automated system?). - Make sure you transfer user data from the
automated system to the service persons console
it looks really bad if you dont.
289. Take Care with Error Handling
- Most speech dialog systems have internal state
(in a state machine) that the user cant see
except through what the system says. - You must treat errors (e.g. unrecognized
utterances) very carefully. If you leave the
current state, make sure users can understand the
state youve gone into. - Large changes (e.g. backtracking up to the
initial state) is extremely frustrating for
users. - If you backtrack, take small steps, only as much
as needed.
2910. Establish a Change Process
- Speech UIs are very complex, and very sensitive
to some small changes (esp. in the recognizer). - Make sure you manage changes to the system
especially low level changes. They should be
discouraged once the system is deployed. - Establish regression tests representative
speech segments that the system should always
process successfully, and check them. - Always keep several working generations of the
system.
30The state of the Art
- A few services represent the state-of-the-art
- United Airlines flight information at
1-800-864-8331 then 1, - Fedex package rates 1-800-463-3339
- Tellme 1-800-555-1212, or 411 from Cingular
wireless or Verizon landlines. - Wildfire Speech phone services, voice dialing,
messages, etc. 1-800-WILDFIRE - Also Schwab, Wachovia, E-trade, B-of-A,
Fidelity,
31Conversational Speech
- CLERKMake of car?
- DRIVER Uh Mercedes
- CLERK Model year?
- DRIVER It's a 1970.
- CLERK Color?
- Compare with this version
- CLERK What's the make of your car?
- DRIVER Uh Mercedes
- CLERK OK. And the model year?
- DRIVER It's a 1970.
- CLERK Got it. What's the color?
32Conversational Speech
- The second version is both more polite and more
usable. - System status (i.e. that it understood the users
responses) is always clear. - Design of speech character should include
normal human styling (politeness) but not
excessive anthropomorphism. - In particular, the system should never suggest
capabilities it does not have.
33Conversational Speech Example
- System This is the delivery tracking center.
Tell me your four-digit delivery number or enter
it on the keypad. - Caller 4-8-3-3
- System 4-8-3-3 Is that right?
- Caller Yes.
- System OK, hold on(logs into system)What's
your status? You can say arrived, departed or
delayed. - Caller I'll be delayed two days. There's a big
storm. - System Oh, sorry to hear that! Let me confirm. I
have delivery number 4-8-3-3 delayed for 48 hours
due to weather. Is that right? - Caller Yes it is.
- System Great. Hold onOK. It's in the system.
Hopefully you'll be on your way soon. I'll talk
to you when you arrive. Drive safely.
34Conversational Speech
- Very good usability is possible through clever
design. - It does not all depend on raw recognizer
accuracy. - Careful design includes appropriate personality,
giving enough flexibility to the user, and
responding to errors carefully.
35Whats happening now
- Over the last half-dozen years, speech interfaces
have gotten a lot better. - Most of the improvement seems to be due to
improvements in method, i.e. iterative design,
and heuristic guidelines like the ones just
presented. - The field is a lot more interdisciplinary than it
used to be, including speech engineers, UI
designers and linguists.
36The Future Context-Awareness
- Speech interfaces are rather limited today
because they either rely on tightly constrained
utterances, or on coarse language models. - In many cases, especially for mobile phones,
there is a lot of constraint on what users might
do from the context of use (time, location,
meta-data on the phone) - Current research is using context data to improve
recognition all the way down. Instead of general
language models in the recognizer, you can push
down context information into it. The recognizer
can still recognize anything, but it will do
better with more likely utterances.
37Summary
- Speech seems like a very good option for future
computing environments. - Small devices can support speech interfaces, and
microphone technology is getting better. - Speech UI design requires many of the same
principles as general UI design, especially - Visibility of system status
- User control and freedom
- Helping users recognize and recover from errors
- Application of these principles leads to highly
usable designs.