From Voice Browsers to Multimodal Systems

About This Presentation

Title:

From Voice Browsers to Multimodal Systems

Description:

Hands and eyes free operation. Why do we need a language for specifying voice dialogs? ... Word games. 16 /41. W3C AC/WWW10. Hong Kong May 2001. Speech Grammar ML ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 42

Provided by: jamesa51

Category:

more less

Transcript and Presenter's Notes

Title: From Voice Browsers to Multimodal Systems

1
From Voice Browsers to Multimodal Systems
With thanks to Jim Larson
The W3C Speech Interface Framework
http//www.w3.org/Voice

Dave Raggett
W3C Lead for Voice/Multimodal
W3C Openwave
dsr_at_w3.org

2
Voice The Natural Interfaceavailable from over
a billion phones

Personal assistant functions
Name dialing and Search
Personal Information Management
Unified Messaging (mail, Fax IM)
Call screening call routing
Voice Portals
Access to news, information, entertainment,
customer service and V-commerce(e.g. Find a
friend, Wine Tips, Flight info, Find a hotel room
, Buy ringing tones, Track a shipment)
Front-ends for Call Centers
90 cost savings over human agents
Reduced call abandonment rates (IVR)
Increased customer satisfaction

(Portal Demo)
3
W3C Voice Browser Working Grouphttp//www.w3.org/
Voice/Group

Founded May 1999 following workshop in October
1998
Mission
Prepare and review markup languages to enable
Internet-based speech applications
Has published requirements and specifications for
languages in the W3C Speech Interface Framework
Is now due to be re-chartered with clarified IP
policy

4
Voice Browser WG Membership
5
W3C Speech Interface Framework
N-gram Grammar ML
Natural Language Semantics ML
VoiceXML 2.0
Speech Recognition Grammar ML
ASR
Language Understanding
Dialog Manager
World Wide Web
Context Interpretation
DTMF Tone Recognizer
Lexicon
Telephone System
Media Planning
Prerecorded Audio Player
User
TTS
Language Generation
Speech Synthesis ML
Reusable Components
Call Control
6
W3C Speech Interface Framework Published Documents
Documents available at http//www.w3.org/Voice
REC PR CR LCWD WD REQ
Soon
1-01
1-01
Soon
12-99
12-99
12-99
5-00
12-99
12-99
12-99
12-99
12-99
5-00
4-01
2-01
Dialog Speech Speech N-gram NL
Reusable Lexicon Call Synthesis
Grammar Semantics Comp'ts
Control
7
Voice User Interfaces and VoiceXML

Why use voice as a user interface?
Far more phones than PCs
More wireless phones than PCs
Hands and eyes free operation
Why do we need a language for specifying voice
dialogs?
High-level language simplifies application
development
Separates Voice interface from Application server
Leverage existing Web application development
tools
What does VoiceXML describe?
Conversational dialogs System and user turns to
speak
Dialogs based on form-filling metaphor plus
events and links
W3C is standardizing VoiceXML based upon VoiceXML
1.0 submission by ATT, IBM, Lucent and Motorola

8
VoiceXML Architecture
Brings the power of the Web to Voice
VoiceXML Gateway
Consumer or Corporate Web site
Any Phone
PSTN or VoIP
VoiceXMLGrammarsAudio files
Speech DTMF
Corporation
Carrier
9
Reaching Out to Multiple Channels
Applications Database
XML, Images, Audio,
Content Adaptation
Adjust as needed for each device user
XHTML
VoiceXML
WML/HDML
10
VoiceXML Features

Menus, Forms, Sub-dialogs
ltmenugt, ltformgt, ltsubdialoggt
Inputs
Speech Recognition ltgrammargt
Recording ltrecordgt
Keypad ltdtmfgt
Output
Audio files ltaudiogt
Text-To-Speech
Variables
ltvargt, ltscriptgt

Events
ltnomatchgt, ltnoinputgt, lthelpgt, ltcatchgt, ltthrowgt
Transition submission
ltgotogt, ltsubmitgt
Telephony
Call transfer
Telephony information
Platform
Objects
Performance
Fetch

11
Example VoiceXML

ltmenugt
ltpromptgt ltspeakgt
Welcome to Ajax Travel. Do you want to fly
to
ltemphasisgt
New York
lt/emphasisgt
or
ltemphasisgt
Washington
lt/emphasisgt
lt/speakgt
lt/promptgt

ltchoice next"http//www.NY...".gtltgrammargt
ltchoicegt
ltitemgt New York lt/itemgt
ltitemgt Big Apple lt/itemgt lt/choicegt
lt/grammargt
lt/choicegt
ltchoice next"http//www.Wash..."gt
ltgrammargt
ltchoicegt ltitemgt Washington lt/itemgt
ltitemgt The Capital lt/itemgt lt/choicegt
lt/grammargt
lt/choicegt
lt/menugt

12
Example VoiceXML
ltform id"weather_info"gt ltblockgtWelcome
to the international weather service.lt/blockgt
ltfield namecountry"gt
ltpromptgtWhat country?lt/promptgt
ltgrammar srccountry.gram" type"application/x-js
gf"/gt ltcatch event"help"gt
Please say the country for which you want the
weather. lt/catchgt lt/fieldgt
ltfield name"city"gt ltpromptgtWhat
city?lt/promptgt ltgrammar
src"city.gram" type"application/x-jsgf"/gt
ltcatch event"help"gt Please say
the city for which you want the weather.
lt/catchgt lt/fieldgt ltblockgt
ltsubmit next"/servlet/weather"
namelist"city country"/gt lt/blockgt
lt/formgt
13
VoiceXML Implementations
See http//www.w3.org/Voice

BeVocal
General Magic
HeyAnita
IBM
Lucent
Motorola

Nuance
PipeBeach
SpeechWorks
Telera
Tellme
Voice Genie

These are the companies who asked to be listed
on the W3C Voice page
14
Reusable Components
Voice Application Developer
Voice Application Developer
Reusable Components
VoiceXML Scripts
Dialog Manager
15
Reusable Dialog Modules

Express application at task level rather than
interaction level
Save development time by reusing tried and
effective modules
Increase consistency among applications
Examples include

Credit card number Date Name Address Telephone
number Yes/No question
Shopping cart Order status Weather Stock
quotes Sport scores Word games
16
Speech Grammar ML

Specifies the words and patterns of words for
which a speaker independent recognizer can listen
May be specified
Inline as part of a VoiceXML page
Referenced and stored separately on Web servers
Three variants XML, ABNF, N-Gram
Action Tags for semantic processing

17
Three forms of the Grammar ML

XML
Modeled after Java Speech Grammar Format
Mandatory for Dialog ML interpreters
Manually specified by developer
Augmented BNF syntax (ABNF)
Modeled after Java Speech Grammar Format
Optional for Dialog ML interpreters
May be mapped to and from XML grammars
Manually specified by developer
N-grams
Optional for Dialog ML interpreters
Used for larger vocabularies
Generated statistically

ltrule id"state" scope"public"gt ltone-ofgt
ltitemgt Oregon lt/itemgt ltitemgtMaine
lt/itemgt lt/one-ofgt lt/rulegt
public state Oregon Maine
18
Action Tags

Specify what VoiceXML variables to set when
grammar rules are matched to user input
Based upon subset of ECMAScript

drink coke pepsi coca cola "coke" //
medium is default if nothing said size
"medium" small medium large regular
"medium"
19
N-Gram Language Models

Likelihood of a given word following certain
others
Used as a linguistic model to identify most
likely sequence of words that matches the spoken
input
N-Grams are computed automatically from a corpus
of many inputs
The N-Gram Markup Language is used as interchange
format for automatic analysis of words and
phrases to an dictation ASR engine.

20
Speech synthesis process
modeled after Suns Java Speech Markup Language
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
IN
OUT

Dr. Jones lives at 175 Park Dr. He weighs 175
lb. He plays bass in a blues band. He also likes
to fish last week he caught a 20 lb. bass.

Doctor Jones lives at one seventy-five Park
Drive. He weighs one hundred and seventy-five
pounds. He plays base in a blues band. He likes
to fish last week he caught a twenty-pound bass.

21
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
ltparagraphgt ltsentencegt This is the first
sentence. lt/sentencegt ltsentencegt This is the
second sentence. lt/sentencegt lt/paragraphgt
Non-markup behavior infer structure by automated
text analysis Markup support paragraph, sentence
22
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Non-markup behavior automatically identify and
convert constructs Markup support sayas for
dates, times, etc.
Examples ltsayas sub"World Wide Web Consortium" gt
W3Clt/sayasgt ltsayas type"numberdigits"gt 175
lt/sayasgt
23
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Non-markup behavior look up in a pronunciation
dictionary Markup support phoneme, sayas

Phonetic Alphabets
International Phonetic Alphabet
Worldbet
X-SAMPA

International Phonetic Alphabet (IPA) using
character entities
Example ltphoneme alphabet"ipa"
ph"tx252mx251tox28A"gt tomatolt/phonemegt
24
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Examples ltemphasisgt Hi lt/emphasisgt ltbreak
time"3s"/gt ltprosody rate"slow"/gt Prosody
element pitch high, medium, low,
default contour range high, medium, low,
default rate fast medium, slow, default volume
silent, soft medium, loud, default
Non-markup behavior automatically generates
prosody through analysis of document structure
and sentence syntax Markup support emphasis,
break, prosody
25
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Examples ltaudio srclaughter.wav"gtlaughterlt/aud
iogt ltvoice age"child"gt Mary had a little lamb
lt/voicegt Attributes gender male, female,
neutral age child, teenager, adult, elder,
(integer) variant different, (integer) name
default, (voice-name)
Markup support voice, audio
26
LexiconML - Why?

Accurate pronunciations are essential in EVERY
speech application
Platform default lexicons do not give 100
coverage of user speech

Voice Application Developer
ASR
either
either
TTS
/ay th r/ /iy th r/
/ay th r/
Pronunciation Lexicon
27
LexiconML - Key Requirements

Meets both synthesis and recognition requirements
Pronunciations for any language (including tonal)
reuse standard alphabets, support for
suprasegmentals
Multiple pronunciations per word
Alternate orthographies
Spelling variations colour and color
Alternative writing systems Japanese Kanji and
Kana
Abbreviations and Acronyms - e.g. Dr., BT,
Homophones e.g read and reed (same sound)
Homographs e.g. read and read (same spelling)

28
Interaction Style

Voice user interfaces needn't be dull
Choose prompts to reflect an explicit choice of
personality
Introduce variety in prompts rather than always
repeating the same thing
Politeness, helpfulness and sense of humor
Target different groups of users e.g. Gen Y
Allow users to select personality (skin)

(Personality Demo)
29
Call Control
Voice Application Developer
Dialog Manager
Voice XML
Call Control
User
(Call control Demo)
30
Call Control Requirements

Call managementPlace outbound call,
conditionally answer inbound call, outbound fax
Call leg managementCreate, redirect, interact
while on hold
Conference managementCreate, join, exit
Intersession communicationAsynchronous events
Interpreter contextInvoke, terminate

31
Natural Language Semantics ML
Voice Application Developer
Grammar and semantic tags
ASR
Language Understanding
Context Interpretation
Text
NL Semantics
32
Natural Language Semantics ML

Represent semantic interpretations of an
utterance
Speech
Natural language text
Other forms (e.g., handwriting, ocr, DTMF.)
Used primarily as an interchange format among
voice browser components
Usually generated automatically and not authored
directly by developers
Goal is to use XForms as a data model

33
NLSemantics ML structure
confidence grammar x-model xmlns
grammar x-model xmlns
Result
Interpretation
Meaning
Incoming data
mode timestamp-start timestamp-end confidence
xfmodel
xfinstance
Input
Application-specific elements defined by X Forms
data model
Text
Nomatch
Noinput
Input
Text
Xforms definition
34
What toppings do you have?

ltinterpretation grammar"http//toppings"
xmlnsxf"http//www.w3.org/xxxgt
ltinput mode"speech"gtwhat toppings to you
have?lt/inputgt
ltxfx-modelgt
ltxf group xfname"question"/gt
ltxfstring xfname"questioned_item"
/gt
ltxf string xfname"questioned_prop
erty"/gt
lt/xfgroupgt
lt/xfx-modelgt
ltxf instancegt
ltappquestiongt
ltappquestioned-itemgttoppingslt/app
questioned_itemgt
ltappquestioned_propertygtavailabili
tylt/appquestioned_propertygt
lt/appquestiongt
lt/xfinstancegt
lt/interpretationgt

35
Richer Natural Language

Most current voice apps restrict users to
keywords or short phrases
The application does most of the talking
Alternative is to use open grammars with word
spotting and let user do the talking
Rules for figuring out what the user said and why
as basis for asking next question

(GM/AskJeeves Demo)
36
Multimodal Voice Displays
What is the weather in San Francisco?

Say which City you want weather for and see the
information on your phone
Say which bands/CDs you want to buy and confirm
the choices visually

I want to place an orderfor Hotshot by Shaggy.
37
Multimodal Interaction

Multimodal applications
Voice Display Key pad Stylus etc.
User is free to switch between voice interaction
and use of display/key pad/clicking/handwriting
July 2000 Published Multimodal Requirements Draft
Demonstrations of Multimodal prototypes at Paris
face to face meeting of Voice Browser WG
Joint W3C/WAP Forum workshop on Multimodal Hong
Kong September 2000
February 2001 W3C publishes Multimodal Request
for Proposals
Plan to set up Multimodal Working Group later
this year assuming we get appropriate
submission(s)

38
Multimodal Interaction

Primary market is mobile wireless
cell phones, personal digital assistants and cars
Timescale is driven by deployment of 3G networks
Input modes
speech, keypads, pointing devices, and electronic
ink
Output modes
speech, audio, and bitmapped or character cell
displays
Architecture should allow for both local and
remote speech processing

39
Some Ideas
W3C is seeking detailed proposals with broad
industry support as basis for chartering
multimodal working group

Speech enabling XHTML (and WML) without requiring
changes to markup language
New ECMAScript Speech Object?
Loose coupling of VoiceXML with externally
defined pages written in XHTML, SMIL, etc.
Turn-driven synchronization protocol based on
SIP?
Distributed Speech Processing
Reduce load on wireless network and speech
servers
Increase recognition accuracy in presence of
noise
ETSI work on Aurora
Using pen-based gestures to constrain ASR (click
and speak)

40
VoiceXML IP Issues

Technical work on VoiceXML 2.0 is proceeding well
Publication of VoiceXML 2.0 working draft held up
over IP issues (although internal version is
accessible to W3C Members)
Related specifications for grammar, speech
synthesis, natural language synthesis, lexicon,
and call control have or shortly will be
published.
W3C and VoiceXML Forum Management are in process
of developing a formal Memorandum of
Understanding
W3C is convening a Patent Advisory Group to
recommend IP Policy for re-chartering the Voice
Browser Activity
Draw inspiration from IETF, ECTF, ETSI and other
bodies, e.g. require all WG members to license
essential IP under openly specified RAND terms
with operational criteria for effective terms
expressed in terms of exit criteria for Candidate
Recommendation phase. No requirement for advanced
disclosure of IP