CS 260: Lecture 10 - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

CS 260: Lecture 10

Description:

In the early days of HCI, people assumed that speech/natural ... Noise call center ambience. Lack of privacy. 8/3/09. 4. Speech: the Ultimate Interface? ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 38

Provided by: can6

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 260: Lecture 10

1
CS 260 Lecture 10

Professor John Canny

2
Speech the Ultimate Interface?

In the early days of HCI, people assumed that
speech/natural language would be the ultimate UI
Use of speech interfaces has grown, but its
still rarely used in the office. Why?

3
Speech the Ultimate Interface?

Why speech hasnt succeeded in the office
Affordances of text
Visual scanning (for email or docs)
Unambiguity of text
Editing of text
Disadvantages of speech
Noise call center ambience
Lack of privacy

4
Speech the Ultimate Interface?

Use of speech interfaces has grown, but its
still rarely used in the office.

5
Computing is Moving

Where are computers these days?
Intels breakdown (based on PC sales)
Office
Home
Mobile (laptops)
Medical
And as we noted earlier, programmable smartphones
will soon outnumber total PCs.
Then there are game boxes, cable boxes, Smart TVs
etc.

6
What is a good interface for

Mobile computing (walking or driving)?
Home computing?
Medical computing?

7
Where is the industry now?

After a big slump around 2002, the speech
technology/voice interface industry seems to be
growing briskly, about 30-40 per year. One
current estimate put it at about 2.5 Billion.
It would probably be more visible, except several
related industries have overtaken it outsourced
call centers, and VOIP (Voice Over IP).
The biggest growth has been in the new markets
Cell phones (as a local UI)
Medical (e.g. order entry)
Voice services over the phone

8
Industry movement

In January this year, Yahoo acquired a large team
of speech engineers from Nuance, the largest
speech company (which owns Dragon
NaturallySpeaking).
Google already had some leading speech
researchers.
So there is much interest in speech for the
portal market.
Aside there is a division of Nuance devoted to
medical speech recognition, and one to call
centers.

9
Industry movement

Heyanita Voice based email and messaging
Bevocal Hosted IVR (Interactive Voice Response)
for customers, e.g. MetroPCS
Tellme Find a business service (including
restaurants) using ASR.

10
Speech Some background

A speech recognizer consists of 3 stages
A state-of-the-art recognizer requires 50-100
Mflops for continuous speech (no pauses between
words).
PC continuous speech recognizers appeared in the
1990s and saved many victims of RSI.

Rawsound
Phoneticfeatures
Acousticfeatures
AcousticFront End
Language/phoneticmodel
AcousticModel
Words
11
Speech Some background

The first two stages are standard. The last is
not, and has a big impact on performance.
The last box encodes knowledge of what users
might say, either as a grammar, or as a
statistical language model (LM). Grammars are
suitable for small recognition tasks with
well-known command languages.

Rawsound
Phoneticfeatures
Acousticfeatures
AcousticFront End
Language/phoneticmodel
AcousticModel
Words
12
Speech UIs

Most implement a finite-state machine.
At each state, the system can recognize various
speech segments to take you to the next state(s).
A segment may be a word, through to a complete
utterance.
The system can also make utterances of its own
at various states.
You can specify them usingregular expressions,
or using VoiceML.

13
Speech on phones

Speech recognition is faster and more accurate if
you limit the vocabulary to a few dozen words.
Small-vocabulary speech recognition has been
common on phones for the last few years
Call a number
Call a name (from your contacts)
What about large vocabulary, continuous speech?

14
This years Smart phone

This years Smartphone (free with service
contract)
150-200 MHz ARM processor
32 MB ram
2 GB flash (not included)
Windows-98 PC that boots quickly!
Plus
Camera
AGPS (Qualcomm/Snaptrack)
DSP cores, OpenGL GPU
EV-DO (300 kb/s), Bluetooth

15
Speech on phones

This is just the right power for high-performance
speech recognition.
Large-vocabulary speech recognition(not
continuous) appeared on phones last year
Samsung P207
LVCSR (Large-Vocabulary ContinuousSpeech
Recognition) should be available this year.

16
Speech in the home

Good speech recognition used to require careful
microphone placement and a worn headset.

17
Speech in the home

New microphones array mics with builtin DSPs
allow recognition at greater range (several
feet).
Users dont have to wear microphones any more
to use speech.

18
Speech in the home

Apart from CPU and memory (which are shrinking),
speech recognition requires only a microphone and
perhaps a speaker. It is power and size
efficient.
In a few years, it will probably be possible to
build speech recognition into bluetooth
microphones, or other small devices. Compare with
other interfaces

19
Ten Guidelines for Speech Interfaces

You cant design what you cant define
Use user-centered design techniques
Use the right technology, and use technology
right
Leverage the language instinct
Establish success criteria and test against them
Branding in VUI is more than just a pretty voice
How you say it is as important as what you say
Dont block the exit
Take care with error handling
Establish a change process

20
1. You cant design what you cant define

Consider the task(s) that your users want to do,
i.e. start with standard task analysis.
What conceptual model do they have (use
contextual inquiry)?
What language do they use to refer to it?
Use recordings during contextual inquiry/task
analysis.

21
2. Use user-centered design techniques

Great to see this advice in a trade publication.
You know a lot about this
Study real use context especially important for
mobile devices, medical, home etc.
Performs needs analysis what kinds of service
might the system provide and how valuable are
they?
Develop personae to guide your design
Once again, study users conceptual models

22
3. Use the right technology, and use technology
right

In a speech interface, you have a choice between
synthesized and recorded speech for output.
In designing the recognizer, language-models will
generally give better results for routing a broad
range of user questions.
Using technology right speech recognizers are
fussy animals. They use many parameters to
trade-off performance and accuracy. You have to
experiment with these in order to understand
them.

23
4. Leverage the Language Instinct

Make a voice UI resemble natural speech
Use familiar phrasing
Dont mimic written language
Use conversational style (pronouns,
acknowledgements, transition words)
Use realistic prosody (pitch etc.) in TTS
Enable callers to speak over and interrupt the
TTS system

24
5. Establish Success Criteria and Test Against
them

Standard tests recognition accuracy, speed, CPU
Dialog traversal tests capture many
conversations and plot the paths through your
dialog hierarchy that users took.
Usability testing
Early rapid prototyping WOZ testing
Define call success in a sensible way, and
track it!

25
6. Branding is more than a pretty voice

Users make strong attributions about a human
speaker (personality, education, demographics).
They do the same with speech interfaces (whether
you intend it or not).
Design of a voice UI is as significant as design
of an attractive web site. A robot voice UI is
like a 12-point text-only web site.
The voice interfaces brand perception is a
combination of prosody and language, just like a
real speakers. Design both explicitly.

26
7. How you say it is as important as what you say

Mostly about speech constructed from recorded
voice.
For natural speech, you need to think about the
context of each word in real speech.
Pronunciation actually changes when words are
connected together (this is co-articulation).
Ideally, you would include appropriate context
information in each recording (e.g. the number
one followed by a t consonant).

27
8. Dont block the exit

Make sure users can exit the automated system and
reach a live person.
If you make it hard, they will get there anyway,
and be angry when they do.
Providing feedback can help (e.g. the estimated
time to reach a representative is, do you wish
to return to the automated system?).
Make sure you transfer user data from the
automated system to the service persons console
it looks really bad if you dont.

28
9. Take Care with Error Handling

Most speech dialog systems have internal state
(in a state machine) that the user cant see
except through what the system says.
You must treat errors (e.g. unrecognized
utterances) very carefully. If you leave the
current state, make sure users can understand the
state youve gone into.
Large changes (e.g. backtracking up to the
initial state) is extremely frustrating for
users.
If you backtrack, take small steps, only as much
as needed.

29
10. Establish a Change Process

Speech UIs are very complex, and very sensitive
to some small changes (esp. in the recognizer).
Make sure you manage changes to the system
especially low level changes. They should be
discouraged once the system is deployed.
Establish regression tests representative
speech segments that the system should always
process successfully, and check them.
Always keep several working generations of the
system.

30
The state of the Art

A few services represent the state-of-the-art
United Airlines flight information at
1-800-864-8331 then 1,
Fedex package rates 1-800-463-3339
Tellme 1-800-555-1212, or 411 from Cingular
wireless or Verizon landlines.
Wildfire Speech phone services, voice dialing,
messages, etc. 1-800-WILDFIRE
Also Schwab, Wachovia, E-trade, B-of-A,
Fidelity,

31
Conversational Speech

CLERKMake of car?
DRIVER Uh Mercedes
CLERK Model year?
DRIVER It's a 1970.
CLERK Color?
Compare with this version
CLERK What's the make of your car?
DRIVER Uh Mercedes
CLERK OK. And the model year?
DRIVER It's a 1970.
CLERK Got it. What's the color?

32
Conversational Speech

The second version is both more polite and more
usable.
System status (i.e. that it understood the users
responses) is always clear.
Design of speech character should include
normal human styling (politeness) but not
excessive anthropomorphism.
In particular, the system should never suggest
capabilities it does not have.

33
Conversational Speech Example

System This is the delivery tracking center.
Tell me your four-digit delivery number or enter
it on the keypad.
Caller 4-8-3-3
System 4-8-3-3 Is that right?
Caller Yes.
System OK, hold on(logs into system)What's
your status? You can say arrived, departed or
delayed.
Caller I'll be delayed two days. There's a big
storm.
System Oh, sorry to hear that! Let me confirm. I
have delivery number 4-8-3-3 delayed for 48 hours
due to weather. Is that right?
Caller Yes it is.
System Great. Hold onOK. It's in the system.
Hopefully you'll be on your way soon. I'll talk
to you when you arrive. Drive safely.

34
Conversational Speech

Very good usability is possible through clever
design.
It does not all depend on raw recognizer
accuracy.
Careful design includes appropriate personality,
giving enough flexibility to the user, and
responding to errors carefully.

35
Whats happening now

Over the last half-dozen years, speech interfaces
have gotten a lot better.
Most of the improvement seems to be due to
improvements in method, i.e. iterative design,
and heuristic guidelines like the ones just
presented.
The field is a lot more interdisciplinary than it
used to be, including speech engineers, UI
designers and linguists.

36
The Future Context-Awareness

Speech interfaces are rather limited today
because they either rely on tightly constrained
utterances, or on coarse language models.
In many cases, especially for mobile phones,
there is a lot of constraint on what users might
do from the context of use (time, location,
meta-data on the phone)
Current research is using context data to improve
recognition all the way down. Instead of general
language models in the recognizer, you can push
down context information into it. The recognizer
can still recognize anything, but it will do
better with more likely utterances.

37
Summary

Speech seems like a very good option for future
computing environments.
Small devices can support speech interfaces, and
microphone technology is getting better.
Speech UI design requires many of the same
principles as general UI design, especially
Visibility of system status
User control and freedom
Helping users recognize and recover from errors
Application of these principles leads to highly
usable designs.