Title: From Voice Browsers to Multimodal Systems
1From Voice Browsers to Multimodal Systems
With thanks to Jim Larson
The W3C Speech Interface Framework
http//www.w3.org/Voice
- Dave Raggett
- W3C Lead for Voice/Multimodal
- W3C Openwave
- dsr_at_w3.org
2Voice The Natural Interfaceavailable from over
a billion phones
- Personal assistant functions
- Name dialing and Search
- Personal Information Management
- Unified Messaging (mail, Fax IM)
- Call screening call routing
- Voice Portals
- Access to news, information, entertainment,
customer service and V-commerce(e.g. Find a
friend, Wine Tips, Flight info, Find a hotel room
, Buy ringing tones, Track a shipment) - Front-ends for Call Centers
- 90 cost savings over human agents
- Reduced call abandonment rates (IVR)
- Increased customer satisfaction
(Portal Demo)
3W3C Voice Browser Working Grouphttp//www.w3.org/
Voice/Group
- Founded May 1999 following workshop in October
1998 - Mission
- Prepare and review markup languages to enable
Internet-based speech applications - Has published requirements and specifications for
languages in the W3C Speech Interface Framework - Is now due to be re-chartered with clarified IP
policy
4Voice Browser WG Membership
5W3C Speech Interface Framework
N-gram Grammar ML
Natural Language Semantics ML
VoiceXML 2.0
Speech Recognition Grammar ML
ASR
Language Understanding
Dialog Manager
World Wide Web
Context Interpretation
DTMF Tone Recognizer
Lexicon
Telephone System
Media Planning
Prerecorded Audio Player
User
TTS
Language Generation
Speech Synthesis ML
Reusable Components
Call Control
6W3C Speech Interface Framework Published Documents
Documents available at http//www.w3.org/Voice
REC PR CR LCWD WD REQ
Soon
1-01
1-01
Soon
12-99
12-99
12-99
5-00
12-99
12-99
12-99
12-99
12-99
5-00
4-01
2-01
Dialog Speech Speech N-gram NL
Reusable Lexicon Call Synthesis
Grammar Semantics Comp'ts
Control
7Voice User Interfaces and VoiceXML
- Why use voice as a user interface?
- Far more phones than PCs
- More wireless phones than PCs
- Hands and eyes free operation
- Why do we need a language for specifying voice
dialogs? - High-level language simplifies application
development - Separates Voice interface from Application server
- Leverage existing Web application development
tools - What does VoiceXML describe?
- Conversational dialogs System and user turns to
speak - Dialogs based on form-filling metaphor plus
events and links - W3C is standardizing VoiceXML based upon VoiceXML
1.0 submission by ATT, IBM, Lucent and Motorola
8VoiceXML Architecture
Brings the power of the Web to Voice
VoiceXML Gateway
Consumer or Corporate Web site
Any Phone
PSTN or VoIP
VoiceXMLGrammarsAudio files
Speech DTMF
Corporation
Carrier
9Reaching Out to Multiple Channels
Applications Database
XML, Images, Audio,
Content Adaptation
Adjust as needed for each device user
XHTML
VoiceXML
WML/HDML
10VoiceXML Features
- Menus, Forms, Sub-dialogs
- ltmenugt, ltformgt, ltsubdialoggt
- Inputs
- Speech Recognition ltgrammargt
- Recording ltrecordgt
- Keypad ltdtmfgt
- Output
- Audio files ltaudiogt
- Text-To-Speech
- Variables
- ltvargt, ltscriptgt
- Events
- ltnomatchgt, ltnoinputgt, lthelpgt, ltcatchgt, ltthrowgt
- Transition submission
- ltgotogt, ltsubmitgt
- Telephony
- Call transfer
- Telephony information
- Platform
- Objects
- Performance
- Fetch
11Example VoiceXML
- ltmenugt
- ltpromptgt  ltspeakgt
- Welcome to Ajax Travel. Do you want to fly
to - ltemphasisgt
- New York
- lt/emphasisgt
- orÂ
- ltemphasisgt
- Washington
- lt/emphasisgt
- lt/speakgt
- lt/promptgt
- Â Â
- ltchoice next"http//www.NY...".gtltgrammargt
- ltchoicegt
- ltitemgt New York lt/itemgt
- ltitemgt Big Apple lt/itemgt lt/choicegt
- lt/grammargt
- lt/choicegt
- ltchoice next"http//www.Wash..."gt
- Â ltgrammargt
- ltchoicegt ltitemgt Washington lt/itemgt
- ltitemgt The Capital lt/itemgt lt/choicegt
- lt/grammargt   Â
- lt/choicegt
- lt/menugt
12Example VoiceXML
ltform id"weather_info"gt ltblockgtWelcome
to the international weather service.lt/blockgt
ltfield namecountry"gt
ltpromptgtWhat country?lt/promptgt
ltgrammar srccountry.gram" type"application/x-js
gf"/gt ltcatch event"help"gt
Please say the country for which you want the
weather. lt/catchgt lt/fieldgt
ltfield name"city"gt ltpromptgtWhat
city?lt/promptgt ltgrammar
src"city.gram" type"application/x-jsgf"/gt
ltcatch event"help"gt Please say
the city for which you want the weather.
lt/catchgt lt/fieldgt ltblockgt
ltsubmit next"/servlet/weather"
namelist"city country"/gt lt/blockgt
lt/formgt
13VoiceXML Implementations
See http//www.w3.org/Voice
- BeVocal
- General Magic
- HeyAnita
- IBM
- Lucent
- Motorola
- Nuance
- PipeBeach
- SpeechWorks
- Telera
- Tellme
- Voice Genie
These are the companies who asked to be listed
on the W3C Voice page
14Reusable Components
Voice Application Developer
Voice Application Developer
Reusable Components
VoiceXML Scripts
Dialog Manager
15Reusable Dialog Modules
- Express application at task level rather than
interaction level - Save development time by reusing tried and
effective modules - Increase consistency among applications
- Examples include
Credit card number Date Name Address Telephone
number Yes/No question
Shopping cart Order status Weather Stock
quotes Sport scores Word games
16Speech Grammar ML
- Specifies the words and patterns of words for
which a speaker independent recognizer can listen - May be specified
- Inline as part of a VoiceXML page
- Referenced and stored separately on Web servers
- Three variants XML, ABNF, N-Gram
- Action Tags for semantic processing
17Three forms of the Grammar ML
- XML
- Modeled after Java Speech Grammar Format
- Mandatory for Dialog ML interpreters
- Manually specified by developer
- Augmented BNF syntax (ABNF)
- Modeled after Java Speech Grammar Format
- Optional for Dialog ML interpreters
- May be mapped to and from XML grammars
- Manually specified by developer
- N-grams
- Optional for Dialog ML interpreters
- Used for larger vocabularies
- Generated statistically
ltrule id"state" scope"public"gt ltone-ofgt
ltitemgt Oregon lt/itemgt ltitemgtMaine
lt/itemgt lt/one-ofgt lt/rulegt
public state Oregon Maine
18Action Tags
- Specify what VoiceXML variables to set when
grammar rules are matched to user input - Based upon subset of ECMAScript
drink coke pepsi coca cola "coke" //
medium is default if nothing said size
"medium" small medium large regular
"medium"
19N-Gram Language Models
- Likelihood of a given word following certain
others - Used as a linguistic model to identify most
likely sequence of words that matches the spoken
input - N-Grams are computed automatically from a corpus
of many inputs - The N-Gram Markup Language is used as interchange
format for automatic analysis of words and
phrases to an dictation ASR engine.
20Speech synthesis process
modeled after Suns Java Speech Markup Language
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
IN
OUT
- Dr. Jones lives at 175 Park Dr. He weighs 175
lb. He plays bass in a blues band. He also likes
to fish last week he caught a 20 lb. bass.
- Doctor Jones lives at one seventy-five Park
Drive. He weighs one hundred and seventy-five
pounds. He plays base in a blues band. He likes
to fish last week he caught a twenty-pound bass.
21Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
ltparagraphgt ltsentencegt This is the first
sentence. lt/sentencegt ltsentencegt This is the
second sentence. lt/sentencegt lt/paragraphgt
Non-markup behavior infer structure by automated
text analysis Markup support paragraph, sentence
22Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Non-markup behavior automatically identify and
convert constructs Markup support sayas for
dates, times, etc.
Examples ltsayas sub"World Wide Web Consortium" gt
W3Clt/sayasgt ltsayas type"numberdigits"gt 175
lt/sayasgt
23Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Non-markup behavior look up in a pronunciation
dictionary Markup support phoneme, sayas
- Phonetic Alphabets
- International Phonetic Alphabet
- Worldbet
- X-SAMPA
International Phonetic Alphabet (IPA) using
character entities
Example ltphoneme alphabet"ipa"
ph"tx252mx251tox28A"gt tomatolt/phonemegt
24Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Examples ltemphasisgt Hi lt/emphasisgt ltbreak
time"3s"/gt ltprosody rate"slow"/gt Prosody
element pitch high, medium, low,
default contour range high, medium, low,
default rate fast medium, slow, default volume
silent, soft medium, loud, default
Non-markup behavior automatically generates
prosody through analysis of document structure
and sentence syntax Markup support emphasis,
break, prosody
25Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Examples ltaudio srclaughter.wav"gtlaughterlt/aud
iogt ltvoice age"child"gt Mary had a little lamb
lt/voicegt Attributes gender male, female,
neutral age child, teenager, adult, elder,
(integer) variant different, (integer) name
default, (voice-name)
Markup support voice, audio
26LexiconML - Why?
- Accurate pronunciations are essential in EVERY
speech application - Platform default lexicons do not give 100
coverage of user speech
Voice Application Developer
ASR
either
either
TTS
/ay th r/ /iy th r/
/ay th r/
Pronunciation Lexicon
27LexiconML - Key Requirements
- Meets both synthesis and recognition requirements
- Pronunciations for any language (including tonal)
- reuse standard alphabets, support for
suprasegmentals - Multiple pronunciations per word
- Alternate orthographies
- Spelling variations colour and color
- Alternative writing systems Japanese Kanji and
Kana - Abbreviations and Acronyms - e.g. Dr., BT,
- Homophones e.g read and reed (same sound)
- Homographs e.g. read and read (same spelling)
28Interaction Style
- Voice user interfaces needn't be dull
- Choose prompts to reflect an explicit choice of
personality - Introduce variety in prompts rather than always
repeating the same thing - Politeness, helpfulness and sense of humor
- Target different groups of users e.g. Gen Y
- Allow users to select personality (skin)
(Personality Demo)
29Call Control
Voice Application Developer
Dialog Manager
Voice XML
Call Control
User
(Call control Demo)
30Call Control Requirements
- Call managementPlace outbound call,
conditionally answer inbound call, outbound fax - Call leg managementCreate, redirect, interact
while on hold - Conference managementCreate, join, exit
- Intersession communicationAsynchronous events
- Interpreter contextInvoke, terminate
31Natural Language Semantics ML
Voice Application Developer
Grammar and semantic tags
ASR
Language Understanding
Context Interpretation
Text
NL Semantics
32Natural Language Semantics ML
- Represent semantic interpretations of an
utterance - Speech
- Natural language text
- Other forms (e.g., handwriting, ocr, DTMF.)
- Used primarily as an interchange format among
voice browser components - Usually generated automatically and not authored
directly by developers - Goal is to use XForms as a data model
33NLSemantics ML structure
confidence grammar x-model xmlns
grammar x-model xmlns
Result
Interpretation
Meaning
Incoming data
mode timestamp-start timestamp-end confidence
xfmodel
xfinstance
Input
Application-specific elements defined by X Forms
data model
Text
Nomatch
Noinput
Input
Text
Xforms definition
34What toppings do you have?
- ltinterpretation grammar"http//toppings"
xmlnsxf"http//www.w3.org/xxxgt - ltinput mode"speech"gtwhat toppings to you
have?lt/inputgt - ltxfx-modelgt
- ltxf group xfname"question"/gt
- ltxfstring xfname"questioned_item"
/gt - ltxf string xfname"questioned_prop
erty"/gt - lt/xfgroupgt
- lt/xfx-modelgt
- ltxf instancegt
- ltappquestiongt
- ltappquestioned-itemgttoppingslt/app
questioned_itemgt - ltappquestioned_propertygtavailabili
tylt/appquestioned_propertygt - lt/appquestiongt
- lt/xfinstancegt
- lt/interpretationgt
35Richer Natural Language
- Most current voice apps restrict users to
keywords or short phrases - The application does most of the talking
- Alternative is to use open grammars with word
spotting and let user do the talking - Rules for figuring out what the user said and why
as basis for asking next question
(GM/AskJeeves Demo)
36Multimodal Voice Displays
What is the weather in San Francisco?
- Say which City you want weather for and see the
information on your phone - Say which bands/CDs you want to buy and confirm
the choices visually
I want to place an orderfor Hotshot by Shaggy.
37Multimodal Interaction
- Multimodal applications
- Voice Display Key pad Stylus etc.
- User is free to switch between voice interaction
and use of display/key pad/clicking/handwriting - July 2000 Published Multimodal Requirements Draft
- Demonstrations of Multimodal prototypes at Paris
face to face meeting of Voice Browser WG - Joint W3C/WAP Forum workshop on Multimodal Hong
Kong September 2000 - February 2001 W3C publishes Multimodal Request
for Proposals - Plan to set up Multimodal Working Group later
this year assuming we get appropriate
submission(s)
38Multimodal Interaction
- Primary market is mobile wireless
- cell phones, personal digital assistants and cars
- Timescale is driven by deployment of 3G networks
- Input modes
- speech, keypads, pointing devices, and electronic
ink - Output modes
- speech, audio, and bitmapped or character cell
displays - Architecture should allow for both local and
remote speech processing
39Some Ideas
W3C is seeking detailed proposals with broad
industry support as basis for chartering
multimodal working group
- Speech enabling XHTML (and WML) without requiring
changes to markup language - New ECMAScript Speech Object?
- Loose coupling of VoiceXML with externally
defined pages written in XHTML, SMIL, etc. - Turn-driven synchronization protocol based on
SIP? - Distributed Speech Processing
- Reduce load on wireless network and speech
servers - Increase recognition accuracy in presence of
noise - ETSI work on Aurora
- Using pen-based gestures to constrain ASR (click
and speak)
40VoiceXML IP Issues
- Technical work on VoiceXML 2.0 is proceeding well
- Publication of VoiceXML 2.0 working draft held up
over IP issues (although internal version is
accessible to W3C Members) - Related specifications for grammar, speech
synthesis, natural language synthesis, lexicon,
and call control have or shortly will be
published. - W3C and VoiceXML Forum Management are in process
of developing a formal Memorandum of
Understanding - W3C is convening a Patent Advisory Group to
recommend IP Policy for re-chartering the Voice
Browser Activity - Draw inspiration from IETF, ECTF, ETSI and other
bodies, e.g. require all WG members to license
essential IP under openly specified RAND terms
with operational criteria for effective terms
expressed in terms of exit criteria for Candidate
Recommendation phase. No requirement for advanced
disclosure of IP
41Discussion?