CS 260: Lecture 10 - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

CS 260: Lecture 10

Description:

In the early days of HCI, people assumed that speech/natural ... Noise call center ambience. Lack of privacy. 8/3/09. 4. Speech: the Ultimate Interface? ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 38
Provided by: can6
Category:
Tags: ambience | lecture

less

Transcript and Presenter's Notes

Title: CS 260: Lecture 10


1
CS 260 Lecture 10
  • Professor John Canny

2
Speech the Ultimate Interface?
  • In the early days of HCI, people assumed that
    speech/natural language would be the ultimate UI
  • Use of speech interfaces has grown, but its
    still rarely used in the office. Why?

3
Speech the Ultimate Interface?
  • Why speech hasnt succeeded in the office
  • Affordances of text
  • Visual scanning (for email or docs)
  • Unambiguity of text
  • Editing of text
  • Disadvantages of speech
  • Noise call center ambience
  • Lack of privacy

4
Speech the Ultimate Interface?
  • Use of speech interfaces has grown, but its
    still rarely used in the office.

5
Computing is Moving
  • Where are computers these days?
  • Intels breakdown (based on PC sales)
  • Office
  • Home
  • Mobile (laptops)
  • Medical
  • And as we noted earlier, programmable smartphones
    will soon outnumber total PCs.
  • Then there are game boxes, cable boxes, Smart TVs
    etc.

6
What is a good interface for
  • Mobile computing (walking or driving)?
  • Home computing?
  • Medical computing?

7
Where is the industry now?
  • After a big slump around 2002, the speech
    technology/voice interface industry seems to be
    growing briskly, about 30-40 per year. One
    current estimate put it at about 2.5 Billion.
  • It would probably be more visible, except several
    related industries have overtaken it outsourced
    call centers, and VOIP (Voice Over IP).
  • The biggest growth has been in the new markets
  • Cell phones (as a local UI)
  • Medical (e.g. order entry)
  • Voice services over the phone

8
Industry movement
  • In January this year, Yahoo acquired a large team
    of speech engineers from Nuance, the largest
    speech company (which owns Dragon
    NaturallySpeaking).
  • Google already had some leading speech
    researchers.
  • So there is much interest in speech for the
    portal market.
  • Aside there is a division of Nuance devoted to
    medical speech recognition, and one to call
    centers.

9
Industry movement
  • Heyanita Voice based email and messaging
  • Bevocal Hosted IVR (Interactive Voice Response)
    for customers, e.g. MetroPCS
  • Tellme Find a business service (including
    restaurants) using ASR.

10
Speech Some background
  • A speech recognizer consists of 3 stages
  • A state-of-the-art recognizer requires 50-100
    Mflops for continuous speech (no pauses between
    words).
  • PC continuous speech recognizers appeared in the
    1990s and saved many victims of RSI.

Rawsound
Phoneticfeatures
Acousticfeatures
AcousticFront End
Language/phoneticmodel
AcousticModel
Words
11
Speech Some background
  • The first two stages are standard. The last is
    not, and has a big impact on performance.
  • The last box encodes knowledge of what users
    might say, either as a grammar, or as a
    statistical language model (LM). Grammars are
    suitable for small recognition tasks with
    well-known command languages.

Rawsound
Phoneticfeatures
Acousticfeatures
AcousticFront End
Language/phoneticmodel
AcousticModel
Words
12
Speech UIs
  • Most implement a finite-state machine.
  • At each state, the system can recognize various
    speech segments to take you to the next state(s).
  • A segment may be a word, through to a complete
    utterance.
  • The system can also make utterances of its own
    at various states.
  • You can specify them usingregular expressions,
    or using VoiceML.

13
Speech on phones
  • Speech recognition is faster and more accurate if
    you limit the vocabulary to a few dozen words.
  • Small-vocabulary speech recognition has been
    common on phones for the last few years
  • Call a number
  • Call a name (from your contacts)
  • What about large vocabulary, continuous speech?

14
This years Smart phone
  • This years Smartphone (free with service
    contract)
  • 150-200 MHz ARM processor
  • 32 MB ram
  • 2 GB flash (not included)
  • Windows-98 PC that boots quickly!
  • Plus
  • Camera
  • AGPS (Qualcomm/Snaptrack)
  • DSP cores, OpenGL GPU
  • EV-DO (300 kb/s), Bluetooth

15
Speech on phones
  • This is just the right power for high-performance
    speech recognition.
  • Large-vocabulary speech recognition(not
    continuous) appeared on phones last year
    Samsung P207
  • LVCSR (Large-Vocabulary ContinuousSpeech
    Recognition) should be available this year.

16
Speech in the home
  • Good speech recognition used to require careful
    microphone placement and a worn headset.

17
Speech in the home
  • New microphones array mics with builtin DSPs
    allow recognition at greater range (several
    feet).
  • Users dont have to wear microphones any more
    to use speech.

18
Speech in the home
  • Apart from CPU and memory (which are shrinking),
    speech recognition requires only a microphone and
    perhaps a speaker. It is power and size
    efficient.
  • In a few years, it will probably be possible to
    build speech recognition into bluetooth
    microphones, or other small devices. Compare with
    other interfaces

19
Ten Guidelines for Speech Interfaces
  1. You cant design what you cant define
  2. Use user-centered design techniques
  3. Use the right technology, and use technology
    right
  4. Leverage the language instinct
  5. Establish success criteria and test against them
  6. Branding in VUI is more than just a pretty voice
  7. How you say it is as important as what you say
  8. Dont block the exit
  9. Take care with error handling
  10. Establish a change process

20
1. You cant design what you cant define
  • Consider the task(s) that your users want to do,
    i.e. start with standard task analysis.
  • What conceptual model do they have (use
    contextual inquiry)?
  • What language do they use to refer to it?
  • Use recordings during contextual inquiry/task
    analysis.

21
2. Use user-centered design techniques
  • Great to see this advice in a trade publication.
    You know a lot about this
  • Study real use context especially important for
    mobile devices, medical, home etc.
  • Performs needs analysis what kinds of service
    might the system provide and how valuable are
    they?
  • Develop personae to guide your design
  • Once again, study users conceptual models

22
3. Use the right technology, and use technology
right
  • In a speech interface, you have a choice between
    synthesized and recorded speech for output.
  • In designing the recognizer, language-models will
    generally give better results for routing a broad
    range of user questions.
  • Using technology right speech recognizers are
    fussy animals. They use many parameters to
    trade-off performance and accuracy. You have to
    experiment with these in order to understand
    them.

23
4. Leverage the Language Instinct
  • Make a voice UI resemble natural speech
  • Use familiar phrasing
  • Dont mimic written language
  • Use conversational style (pronouns,
    acknowledgements, transition words)
  • Use realistic prosody (pitch etc.) in TTS
  • Enable callers to speak over and interrupt the
    TTS system

24
5. Establish Success Criteria and Test Against
them
  • Standard tests recognition accuracy, speed, CPU
  • Dialog traversal tests capture many
    conversations and plot the paths through your
    dialog hierarchy that users took.
  • Usability testing
  • Early rapid prototyping WOZ testing
  • Define call success in a sensible way, and
    track it!

25
6. Branding is more than a pretty voice
  • Users make strong attributions about a human
    speaker (personality, education, demographics).
    They do the same with speech interfaces (whether
    you intend it or not).
  • Design of a voice UI is as significant as design
    of an attractive web site. A robot voice UI is
    like a 12-point text-only web site.
  • The voice interfaces brand perception is a
    combination of prosody and language, just like a
    real speakers. Design both explicitly.

26
7. How you say it is as important as what you say
  • Mostly about speech constructed from recorded
    voice.
  • For natural speech, you need to think about the
    context of each word in real speech.
  • Pronunciation actually changes when words are
    connected together (this is co-articulation).
  • Ideally, you would include appropriate context
    information in each recording (e.g. the number
    one followed by a t consonant).

27
8. Dont block the exit
  • Make sure users can exit the automated system and
    reach a live person.
  • If you make it hard, they will get there anyway,
    and be angry when they do.
  • Providing feedback can help (e.g. the estimated
    time to reach a representative is, do you wish
    to return to the automated system?).
  • Make sure you transfer user data from the
    automated system to the service persons console
    it looks really bad if you dont.

28
9. Take Care with Error Handling
  • Most speech dialog systems have internal state
    (in a state machine) that the user cant see
    except through what the system says.
  • You must treat errors (e.g. unrecognized
    utterances) very carefully. If you leave the
    current state, make sure users can understand the
    state youve gone into.
  • Large changes (e.g. backtracking up to the
    initial state) is extremely frustrating for
    users.
  • If you backtrack, take small steps, only as much
    as needed.

29
10. Establish a Change Process
  • Speech UIs are very complex, and very sensitive
    to some small changes (esp. in the recognizer).
  • Make sure you manage changes to the system
    especially low level changes. They should be
    discouraged once the system is deployed.
  • Establish regression tests representative
    speech segments that the system should always
    process successfully, and check them.
  • Always keep several working generations of the
    system.

30
The state of the Art
  • A few services represent the state-of-the-art
  • United Airlines flight information at
    1-800-864-8331 then 1,
  • Fedex package rates 1-800-463-3339
  • Tellme 1-800-555-1212, or 411 from Cingular
    wireless or Verizon landlines.
  • Wildfire Speech phone services, voice dialing,
    messages, etc. 1-800-WILDFIRE
  • Also Schwab, Wachovia, E-trade, B-of-A,
    Fidelity,

31
Conversational Speech
  • CLERKMake of car?
  • DRIVER Uh Mercedes
  • CLERK Model year?
  • DRIVER It's a 1970.
  • CLERK Color?
  • Compare with this version
  • CLERK What's the make of your car?
  • DRIVER Uh Mercedes
  • CLERK OK. And the model year?
  • DRIVER It's a 1970.
  • CLERK Got it. What's the color?

32
Conversational Speech
  • The second version is both more polite and more
    usable.
  • System status (i.e. that it understood the users
    responses) is always clear.
  • Design of speech character should include
    normal human styling (politeness) but not
    excessive anthropomorphism.
  • In particular, the system should never suggest
    capabilities it does not have.

33
Conversational Speech Example
  • System This is the delivery tracking center.
    Tell me your four-digit delivery number or enter
    it on the keypad.
  • Caller 4-8-3-3
  • System 4-8-3-3 Is that right?
  • Caller Yes.
  • System OK, hold on(logs into system)What's
    your status? You can say arrived, departed or
    delayed.
  • Caller I'll be delayed two days. There's a big
    storm.
  • System Oh, sorry to hear that! Let me confirm. I
    have delivery number 4-8-3-3 delayed for 48 hours
    due to weather. Is that right?
  • Caller Yes it is.
  • System Great. Hold onOK. It's in the system.
    Hopefully you'll be on your way soon. I'll talk
    to you when you arrive. Drive safely.

34
Conversational Speech
  • Very good usability is possible through clever
    design.
  • It does not all depend on raw recognizer
    accuracy.
  • Careful design includes appropriate personality,
    giving enough flexibility to the user, and
    responding to errors carefully.

35
Whats happening now
  • Over the last half-dozen years, speech interfaces
    have gotten a lot better.
  • Most of the improvement seems to be due to
    improvements in method, i.e. iterative design,
    and heuristic guidelines like the ones just
    presented.
  • The field is a lot more interdisciplinary than it
    used to be, including speech engineers, UI
    designers and linguists.

36
The Future Context-Awareness
  • Speech interfaces are rather limited today
    because they either rely on tightly constrained
    utterances, or on coarse language models.
  • In many cases, especially for mobile phones,
    there is a lot of constraint on what users might
    do from the context of use (time, location,
    meta-data on the phone)
  • Current research is using context data to improve
    recognition all the way down. Instead of general
    language models in the recognizer, you can push
    down context information into it. The recognizer
    can still recognize anything, but it will do
    better with more likely utterances.

37
Summary
  • Speech seems like a very good option for future
    computing environments.
  • Small devices can support speech interfaces, and
    microphone technology is getting better.
  • Speech UI design requires many of the same
    principles as general UI design, especially
  • Visibility of system status
  • User control and freedom
  • Helping users recognize and recover from errors
  • Application of these principles leads to highly
    usable designs.
Write a Comment
User Comments (0)
About PowerShow.com