From Voice Browsers to Multimodal Systems - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

From Voice Browsers to Multimodal Systems

Description:

Hands and eyes free operation. Why do we need a language for specifying voice dialogs? ... Word games. 16 /41. W3C AC/WWW10. Hong Kong May 2001. Speech Grammar ML ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 42
Provided by: jamesa51
Category:

less

Transcript and Presenter's Notes

Title: From Voice Browsers to Multimodal Systems


1
From Voice Browsers to Multimodal Systems
With thanks to Jim Larson
The W3C Speech Interface Framework
http//www.w3.org/Voice
  • Dave Raggett
  • W3C Lead for Voice/Multimodal
  • W3C Openwave
  • dsr_at_w3.org

2
Voice The Natural Interfaceavailable from over
a billion phones
  • Personal assistant functions
  • Name dialing and Search
  • Personal Information Management
  • Unified Messaging (mail, Fax IM)
  • Call screening call routing
  • Voice Portals
  • Access to news, information, entertainment,
    customer service and V-commerce(e.g. Find a
    friend, Wine Tips, Flight info, Find a hotel room
    , Buy ringing tones, Track a shipment)
  • Front-ends for Call Centers
  • 90 cost savings over human agents
  • Reduced call abandonment rates (IVR)
  • Increased customer satisfaction

(Portal Demo)
3
W3C Voice Browser Working Grouphttp//www.w3.org/
Voice/Group
  • Founded May 1999 following workshop in October
    1998
  • Mission
  • Prepare and review markup languages to enable
    Internet-based speech applications
  • Has published requirements and specifications for
    languages in the W3C Speech Interface Framework
  • Is now due to be re-chartered with clarified IP
    policy

4
Voice Browser WG Membership
5
W3C Speech Interface Framework
N-gram Grammar ML
Natural Language Semantics ML
VoiceXML 2.0
Speech Recognition Grammar ML
ASR
Language Understanding
Dialog Manager
World Wide Web
Context Interpretation
DTMF Tone Recognizer
Lexicon
Telephone System
Media Planning
Prerecorded Audio Player
User
TTS
Language Generation
Speech Synthesis ML
Reusable Components
Call Control
6
W3C Speech Interface Framework Published Documents
Documents available at http//www.w3.org/Voice
REC PR CR LCWD WD REQ
Soon
1-01
1-01
Soon
12-99
12-99
12-99
5-00
12-99
12-99
12-99
12-99
12-99
5-00
4-01
2-01
Dialog Speech Speech N-gram NL
Reusable Lexicon Call Synthesis
Grammar Semantics Comp'ts
Control
7
Voice User Interfaces and VoiceXML
  • Why use voice as a user interface?
  • Far more phones than PCs
  • More wireless phones than PCs
  • Hands and eyes free operation
  • Why do we need a language for specifying voice
    dialogs?
  • High-level language simplifies application
    development
  • Separates Voice interface from Application server
  • Leverage existing Web application development
    tools
  • What does VoiceXML describe?
  • Conversational dialogs System and user turns to
    speak
  • Dialogs based on form-filling metaphor plus
    events and links
  • W3C is standardizing VoiceXML based upon VoiceXML
    1.0 submission by ATT, IBM, Lucent and Motorola

8
VoiceXML Architecture
Brings the power of the Web to Voice
VoiceXML Gateway
Consumer or Corporate Web site
Any Phone
PSTN or VoIP
VoiceXMLGrammarsAudio files
Speech DTMF
Corporation
Carrier
9
Reaching Out to Multiple Channels
Applications Database
XML, Images, Audio,
Content Adaptation
Adjust as needed for each device user
XHTML
VoiceXML
WML/HDML
10
VoiceXML Features
  • Menus, Forms, Sub-dialogs
  • ltmenugt, ltformgt, ltsubdialoggt
  • Inputs
  • Speech Recognition ltgrammargt
  • Recording ltrecordgt
  • Keypad ltdtmfgt
  • Output
  • Audio files ltaudiogt
  • Text-To-Speech
  • Variables
  • ltvargt, ltscriptgt
  • Events
  • ltnomatchgt, ltnoinputgt, lthelpgt, ltcatchgt, ltthrowgt
  • Transition submission
  • ltgotogt, ltsubmitgt
  • Telephony
  • Call transfer
  • Telephony information
  • Platform
  • Objects
  • Performance
  • Fetch

11
Example VoiceXML
  • ltmenugt
  • ltpromptgt  ltspeakgt
  • Welcome to Ajax Travel. Do you want to fly
    to
  • ltemphasisgt
  • New York
  • lt/emphasisgt
  • or 
  • ltemphasisgt
  • Washington
  • lt/emphasisgt
  • lt/speakgt
  • lt/promptgt
  •   
  • ltchoice next"http//www.NY...".gtltgrammargt
  • ltchoicegt
  • ltitemgt New York lt/itemgt
  • ltitemgt Big Apple lt/itemgt lt/choicegt
  • lt/grammargt
  • lt/choicegt
  • ltchoice next"http//www.Wash..."gt
  •   ltgrammargt
  • ltchoicegt ltitemgt Washington lt/itemgt
  • ltitemgt The Capital lt/itemgt lt/choicegt
  • lt/grammargt    
  • lt/choicegt
  • lt/menugt

12
Example VoiceXML
ltform id"weather_info"gt ltblockgtWelcome
to the international weather service.lt/blockgt
ltfield namecountry"gt
ltpromptgtWhat country?lt/promptgt
ltgrammar srccountry.gram" type"application/x-js
gf"/gt ltcatch event"help"gt
Please say the country for which you want the
weather. lt/catchgt lt/fieldgt
ltfield name"city"gt ltpromptgtWhat
city?lt/promptgt ltgrammar
src"city.gram" type"application/x-jsgf"/gt
ltcatch event"help"gt Please say
the city for which you want the weather.
lt/catchgt lt/fieldgt ltblockgt
ltsubmit next"/servlet/weather"
namelist"city country"/gt lt/blockgt
lt/formgt
13
VoiceXML Implementations
See http//www.w3.org/Voice
  • BeVocal
  • General Magic
  • HeyAnita
  • IBM
  • Lucent
  • Motorola
  • Nuance
  • PipeBeach
  • SpeechWorks
  • Telera
  • Tellme
  • Voice Genie

These are the companies who asked to be listed
on the W3C Voice page
14
Reusable Components
Voice Application Developer
Voice Application Developer
Reusable Components
VoiceXML Scripts
Dialog Manager
15
Reusable Dialog Modules
  • Express application at task level rather than
    interaction level
  • Save development time by reusing tried and
    effective modules
  • Increase consistency among applications
  • Examples include

Credit card number Date Name Address Telephone
number Yes/No question
Shopping cart Order status Weather Stock
quotes Sport scores Word games
16
Speech Grammar ML
  • Specifies the words and patterns of words for
    which a speaker independent recognizer can listen
  • May be specified
  • Inline as part of a VoiceXML page
  • Referenced and stored separately on Web servers
  • Three variants XML, ABNF, N-Gram
  • Action Tags for semantic processing

17
Three forms of the Grammar ML
  • XML
  • Modeled after Java Speech Grammar Format
  • Mandatory for Dialog ML interpreters
  • Manually specified by developer
  • Augmented BNF syntax (ABNF)
  • Modeled after Java Speech Grammar Format
  • Optional for Dialog ML interpreters
  • May be mapped to and from XML grammars
  • Manually specified by developer
  • N-grams
  • Optional for Dialog ML interpreters
  • Used for larger vocabularies
  • Generated statistically

ltrule id"state" scope"public"gt ltone-ofgt
ltitemgt Oregon lt/itemgt ltitemgtMaine
lt/itemgt lt/one-ofgt lt/rulegt
public state Oregon Maine
18
Action Tags
  • Specify what VoiceXML variables to set when
    grammar rules are matched to user input
  • Based upon subset of ECMAScript

drink coke pepsi coca cola "coke" //
medium is default if nothing said size
"medium" small medium large regular
"medium"
19
N-Gram Language Models
  • Likelihood of a given word following certain
    others
  • Used as a linguistic model to identify most
    likely sequence of words that matches the spoken
    input
  • N-Grams are computed automatically from a corpus
    of many inputs
  • The N-Gram Markup Language is used as interchange
    format for automatic analysis of words and
    phrases to an dictation ASR engine.

20
Speech synthesis process
modeled after Suns Java Speech Markup Language
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
IN
OUT
  • Dr. Jones lives at 175 Park Dr. He weighs 175
    lb. He plays bass in a blues band. He also likes
    to fish last week he caught a 20 lb. bass.
  • Doctor Jones lives at one seventy-five Park
    Drive. He weighs one hundred and seventy-five
    pounds. He plays base in a blues band. He likes
    to fish last week he caught a twenty-pound bass.

21
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
ltparagraphgt ltsentencegt This is the first
sentence. lt/sentencegt ltsentencegt This is the
second sentence. lt/sentencegt lt/paragraphgt
Non-markup behavior infer structure by automated
text analysis Markup support paragraph, sentence
22
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Non-markup behavior automatically identify and
convert constructs Markup support sayas for
dates, times, etc.
Examples ltsayas sub"World Wide Web Consortium" gt
W3Clt/sayasgt ltsayas type"numberdigits"gt 175
lt/sayasgt
23
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Non-markup behavior look up in a pronunciation
dictionary Markup support phoneme, sayas
  • Phonetic Alphabets
  • International Phonetic Alphabet
  • Worldbet
  • X-SAMPA

International Phonetic Alphabet (IPA) using
character entities
Example ltphoneme alphabet"ipa"
ph"tx252mx251tox28A"gt tomatolt/phonemegt
24
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Examples ltemphasisgt Hi lt/emphasisgt ltbreak
time"3s"/gt ltprosody rate"slow"/gt Prosody
element pitch high, medium, low,
default contour range high, medium, low,
default rate fast medium, slow, default volume
silent, soft medium, loud, default
Non-markup behavior automatically generates
prosody through analysis of document structure
and sentence syntax Markup support emphasis,
break, prosody
25
Speech Synthesis ML
Structure Analysis
Text Normali- zation
Text-to- Phoneme Conversion
Prosody Analysis
Waveform Production
Examples ltaudio srclaughter.wav"gtlaughterlt/aud
iogt ltvoice age"child"gt Mary had a little lamb
lt/voicegt Attributes gender male, female,
neutral age child, teenager, adult, elder,
(integer) variant different, (integer) name
default, (voice-name)
Markup support voice, audio
26
LexiconML - Why?
  • Accurate pronunciations are essential in EVERY
    speech application
  • Platform default lexicons do not give 100
    coverage of user speech

Voice Application Developer
ASR
either
either
TTS
/ay th r/ /iy th r/
/ay th r/
Pronunciation Lexicon
27
LexiconML - Key Requirements
  • Meets both synthesis and recognition requirements
  • Pronunciations for any language (including tonal)
  • reuse standard alphabets, support for
    suprasegmentals
  • Multiple pronunciations per word
  • Alternate orthographies
  • Spelling variations colour and color
  • Alternative writing systems Japanese Kanji and
    Kana
  • Abbreviations and Acronyms - e.g. Dr., BT,
  • Homophones e.g read and reed (same sound)
  • Homographs e.g. read and read (same spelling)

28
Interaction Style
  • Voice user interfaces needn't be dull
  • Choose prompts to reflect an explicit choice of
    personality
  • Introduce variety in prompts rather than always
    repeating the same thing
  • Politeness, helpfulness and sense of humor
  • Target different groups of users e.g. Gen Y
  • Allow users to select personality (skin)

(Personality Demo)
29
Call Control
Voice Application Developer
Dialog Manager
Voice XML
Call Control
User
(Call control Demo)
30
Call Control Requirements
  • Call managementPlace outbound call,
    conditionally answer inbound call, outbound fax
  • Call leg managementCreate, redirect, interact
    while on hold
  • Conference managementCreate, join, exit
  • Intersession communicationAsynchronous events
  • Interpreter contextInvoke, terminate

31
Natural Language Semantics ML
Voice Application Developer
Grammar and semantic tags
ASR
Language Understanding
Context Interpretation
Text
NL Semantics
32
Natural Language Semantics ML
  • Represent semantic interpretations of an
    utterance
  • Speech
  • Natural language text
  • Other forms (e.g., handwriting, ocr, DTMF.)
  • Used primarily as an interchange format among
    voice browser components
  • Usually generated automatically and not authored
    directly by developers
  • Goal is to use XForms as a data model

33
NLSemantics ML structure
confidence grammar x-model xmlns
grammar x-model xmlns
Result
Interpretation
Meaning
Incoming data
mode timestamp-start timestamp-end confidence
xfmodel
xfinstance
Input
Application-specific elements defined by X Forms
data model
Text
Nomatch
Noinput
Input
Text
Xforms definition
34
What toppings do you have?
  • ltinterpretation grammar"http//toppings"
    xmlnsxf"http//www.w3.org/xxxgt
  • ltinput mode"speech"gtwhat toppings to you
    have?lt/inputgt
  • ltxfx-modelgt
  • ltxf group xfname"question"/gt
  • ltxfstring xfname"questioned_item"
    /gt
  • ltxf string xfname"questioned_prop
    erty"/gt
  • lt/xfgroupgt
  • lt/xfx-modelgt
  • ltxf instancegt
  • ltappquestiongt
  • ltappquestioned-itemgttoppingslt/app
    questioned_itemgt
  • ltappquestioned_propertygtavailabili
    tylt/appquestioned_propertygt
  • lt/appquestiongt
  • lt/xfinstancegt
  • lt/interpretationgt

35
Richer Natural Language
  • Most current voice apps restrict users to
    keywords or short phrases
  • The application does most of the talking
  • Alternative is to use open grammars with word
    spotting and let user do the talking
  • Rules for figuring out what the user said and why
    as basis for asking next question

(GM/AskJeeves Demo)
36
Multimodal Voice Displays
What is the weather in San Francisco?
  • Say which City you want weather for and see the
    information on your phone
  • Say which bands/CDs you want to buy and confirm
    the choices visually

I want to place an orderfor Hotshot by Shaggy.
37
Multimodal Interaction
  • Multimodal applications
  • Voice Display Key pad Stylus etc.
  • User is free to switch between voice interaction
    and use of display/key pad/clicking/handwriting
  • July 2000 Published Multimodal Requirements Draft
  • Demonstrations of Multimodal prototypes at Paris
    face to face meeting of Voice Browser WG
  • Joint W3C/WAP Forum workshop on Multimodal Hong
    Kong September 2000
  • February 2001 W3C publishes Multimodal Request
    for Proposals
  • Plan to set up Multimodal Working Group later
    this year assuming we get appropriate
    submission(s)

38
Multimodal Interaction
  • Primary market is mobile wireless
  • cell phones, personal digital assistants and cars
  • Timescale is driven by deployment of 3G networks
  • Input modes
  • speech, keypads, pointing devices, and electronic
    ink
  • Output modes
  • speech, audio, and bitmapped or character cell
    displays
  • Architecture should allow for both local and
    remote speech processing

39
Some Ideas
W3C is seeking detailed proposals with broad
industry support as basis for chartering
multimodal working group
  • Speech enabling XHTML (and WML) without requiring
    changes to markup language
  • New ECMAScript Speech Object?
  • Loose coupling of VoiceXML with externally
    defined pages written in XHTML, SMIL, etc.
  • Turn-driven synchronization protocol based on
    SIP?
  • Distributed Speech Processing
  • Reduce load on wireless network and speech
    servers
  • Increase recognition accuracy in presence of
    noise
  • ETSI work on Aurora
  • Using pen-based gestures to constrain ASR (click
    and speak)

40
VoiceXML IP Issues
  • Technical work on VoiceXML 2.0 is proceeding well
  • Publication of VoiceXML 2.0 working draft held up
    over IP issues (although internal version is
    accessible to W3C Members)
  • Related specifications for grammar, speech
    synthesis, natural language synthesis, lexicon,
    and call control have or shortly will be
    published.
  • W3C and VoiceXML Forum Management are in process
    of developing a formal Memorandum of
    Understanding
  • W3C is convening a Patent Advisory Group to
    recommend IP Policy for re-chartering the Voice
    Browser Activity
  • Draw inspiration from IETF, ECTF, ETSI and other
    bodies, e.g. require all WG members to license
    essential IP under openly specified RAND terms
    with operational criteria for effective terms
    expressed in terms of exit criteria for Candidate
    Recommendation phase. No requirement for advanced
    disclosure of IP

41
Discussion?
Write a Comment
User Comments (0)
About PowerShow.com