Speech Recognition and Understanding

About This Presentation

Title:

Speech Recognition and Understanding

Description:

Pitch and formant structure. Quantization. Segmentation (blocking) Windowing. Preemphasis ... Formants. Noise. removal, Normalization. M,N. Energy. Zero ... – PowerPoint PPT presentation

Number of Views:388

Avg rating:3.0/5.0

Slides: 108

Provided by: maz91

more less

Transcript and Presenter's Notes

Title: Speech Recognition and Understanding

1
Speech Recognition and Understanding

Alex Acero
Microsoft Research

Thanks to Mazin Rahim (ATT)
2
A Vision into the 21st Century
3
Milestones in Speech Recognition
Small Vocabulary, Acoustic Phonetics-based
Large Vocabulary Syntax, Semantics,
Very Large Vocabulary Semantics, Multimodal
Dialog
Medium Vocabulary, Template-based
Large Vocabulary, Statistical-based
Isolated Words Connected Digits Continuous Speech
Continuous Speech Speech Understanding
Spoken dialog Multiple modalities
Connected Words Continuous Speech
Isolated Words
Stochastic language understanding
Finite-state machines Statistical learning
Pattern recognition LPC analysis Clustering
algorithms Level building
Filter-bank analysis Time-normalization Dynamic
programming
Concatenative synthesis Machine learning
Mixed-initiative dialog
Hidden Markov models Stochastic
Language modeling
1962 1967 1972 1977 1982 1987 1
992 1997 2003
Year
4
Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Action
Meaning
DM
DialogManagement
5
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
6
Automatic Speech Recognition

Goal
convert a speech signal into a text message
Accurately and efficiently
independent of the device, speaker or the
environment.
Applications
Accessibility
Eyes-busy hands-busy (automobile, doctors, etc)
Call Centers for customer care
Dictation

7
Basic Formulation

Basic equation of speech recognition is
XX1,X2,,Xn is the acoustic observation
is the word sequence.
P(XW) is the acoustic model
P(W) is the language model

8
Speech Recognition Process
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
9
Feature Extraction
Pattern Classification
Confidence Scoring
Feature Extraction

Goal
Extract robust features relevant for ASR
Method
Spectral analysis
Result
A feature vector every 10ms
Challenges
Robustness to environment (office, airport, car)
Devices (speakerphones, cellphones)
Speakers (accents, dialect, style, speaking
defects)

Language Model
Word Lexicon
10
Spectral Analysis

Female speech (/aa/, pitch of 200Hz)
Fourier transform
30ms Hamming Window

xn time signal
Xk Fourier transform
11
Spectrograms

Short-time Fourier transform
Pitch and formant structure

12
Feature Extraction Process
Noise removal, Normalization
Quantization
Filtering
Cepstral Analysis
Preemphasis
M,N
Pitch Formants
Spectral Analysis
Energy Zero-crossing
Segmentation (blocking)
Equalization
Bias removal or normalization
Temporal Derivative
Windowing
Delta cepstrum Delta2 cepstrum
13
Robust Speech Recognition
A mismatch in the speech signal between the
training phase and testing phase results in
performance degradation.
Signal
Features
Model
Training
Enhancement
Normalization
Adaptation
Signal
Features
Model
Testing
14
Noise and Channel Distortion
y(t) s(t) n(t) h(t)
Noise
n(t)
h(t)
Channel

Distorted Speech
y(t)
Speech
s(t)
5dB
50dB
Fourier Transform
Fourier Transform
frequency
frequency
15
Speaker Variations

Vocal tract length varies from 15-20cm
Longer vocal tracts gtlower frequency contents
Maximum Likelihood Speaker Normalization
Warp the frequency of a signal

16
Acoustic Modeling

Goal
Map acoustic features into distinct subword units
Such as phones, syllables, words, etc.
Hidden Markov Model (HMM)
Spectral properties modeled by a parametric
random process
A collection of HMMs is associated with a subword
unit
HMMs are also assigned for modeling extraneous
events
Advantages
Powerful statistical method for a wide range of
data and conditions
Highly reliable for recognizing speech

17
Discrete-Time Markov Process

The Dow Jones Industrial Average

Discrete-time first order Markov chain
18
Hidden Markov Models

19
Example

I observe (up, up, down, up, unchanged, up)
Is it a bull market? bear market?
P(bull)0.70.70.10.70.20.70.5
(0.6)51.86710-4
P(bear)0.10.10.60.10.30.10.2
(0.3)58.74810-9
P(steady)0.30.30.30.30.40.30.3(0.5)59.1
12510-6
Its 20 times more likely that we are in a bull
market than a steady market!
How about
P(bull,bull,bear,bull,steady,bull)
(0.70.70.60.70.40.7)(0.50.60.20.50.20.
4)1.38297610-4

20
Basic Problems in HMMs

Given acoustic observation X and model ?
Evaluation Compute P(X ?)
Decoding Choose optimal state sequence
Re-estimation Adjust ? to maximize P(X ?)

21
Evaluation Forward-Backward algorithm

Forward
Backward
22
Decoding Viterbi Algorithm
Step 1 Initialization D1(i)pibi(x1),
B1(i)0 j1,N Step 2
Iterations for t2,,T
for j1,,N Vt(j)minVt-1(i)aij
bj(xt) Bt(j)argminVt-1(i)aij Step 3
Backtracking The optimal score is VT
max Vt(i) Final state is sT argmax
Vt(i) Optimal path is (s1,s2,,sT)
where stBt1(st1) tT-1,1

23
ReestimationBaum-Welch Algorithm

Find ?(A, B, ?) that maximize p(X ?)
No closed-form solution gt
EM algorithm
Start with old parameter value ?
Obtain a new parameter that maximizes
EM guaranteed to have higher likelihood

24
Continuous Densities

Output distribution is mixture of Gaussians
Posterior probabilities of state j at time t,
(mixture k and state i at time t - 1)
Reestimation Formulae

25
EM Training Procedure
Input speech database
Updated HMM Model
Estimate Posteriors ?t(j,k), ?t(i,j)
Maximize parameters aij, cjk, ?jk, ?jk
Old HMM Model
26
Design Issues

Continuous vs. Discrete HMM
Whole-word vs. subword (phone units)
Number of states, number of Gaussians
Ergodic vs. Bakis
Context-dependent vs. context-independent

a
27
Training with continuous speech

No segmentation is needed
Composed HMM

28
Context Variability in Speech

At word/sentence level
Mr. Wright should write to Ms. Wright right away
about his Ford or four door Honda.

Peat
Wheel

At phone level
/ee/ for words peat and wheel
Triphones capture
Coarticulation
phonetic context

29
Context-dependent models
Triphone IY(P, CH) captures coarticulation,
phonetic context
Stress Italy vs Italian
30
Clustering similar triphones

/iy/ with two different left contexts /r/ and /w/
Similar effects on /iy/
Cluster those triphones together

31
Clustering with decision trees

32
Other variability in Speech

Style
discrete vs. continuous speech,
read vs spontaneous
slow vs fast
Speaker
speaker independent
speaker dependent
speaker adapted
Environment
additive noise (cocktail party effect)
telephone channel

33
Acoustic Adaptation

Model adaptation needed if
Mismatched test conditions
Desire to tune to a given speaker
Maximum a Posteriori (MAP)
adds a prior for the parameters ?
Maximum Likelihood Linear Regression (MLLR)
Transform mean vectors
Can have more than one MLLR transform
Speaker Adaptive Training (SAT) applies MLLR to
training data as well

34
MAP vs MLLR
Speaker-dependent system is trained with 1000
sentences
35
Discriminative Training

Maximum Likelihood Training
Parameters obtained from true classes
Discriminative Training
maximize discrimination between classes
Discriminative Feature Transformation
Maximize inter-class difference to intra-class
difference
Done at the state level
Linear Discriminant Analysis (LDA)
Discriminative Model Training
Maximize posterior probability
Correct class and competing classes are used
Maximum mutual information (MMI) Minimum
classification Error (MCE), Minimum Phone Error
(MPE)

36
Word Lexicon

Goal
Map legal phone sequences into words
according to phonotactic rules
David /d/ /ey/ /v/ /ih/ /d/
Multiple Pronunciation
Several words may have multiple pronunciations
Data /d/ /ae/ /t/ /ax/
Data /d/ /ey/ /t/ /ax/
Challenges
How do you generate a word lexicon
automatically?
How do you add new variant dialects and word
pronunciations?

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
37
The Lexicon

An entry per word (gt 100K words for dictation)
Multiple pronunciations (tomato)
Done by hand or with letter-to-sound rules (LTS)
LTS rules can be automatically trained with
decision trees (CART)
less than 8 errors, but proper nouns are hard!

38
Language Model
Goal Model acceptable spoken
phrases constrained by task syntax Rule-based De
terministic knowledge-driven grammars Statistic
al Compute estimate of word probabilities
(N-gram, class-based, CFG)
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
flying from city to city on date
flying from Newark to Boston tomorrow
0.4
0.6
39
Formal grammars

40
Chomsky Grammar Hierarchy

41
Ngrams

Trigram Estimation

42
Understanding Bigrams

Training data
John read her book
I read a different book
John read a book by Mark
But we have a problem here

43
Ngram Smoothing

Data sparseness in millions of words more than
50 of trigrams occur only once.
Cant assign p(wiwi-1, wi-2)0
Solution assign non-zero probability for each
ngram by lowering the probability mass of seen
ngrams

44
Perplexity

Cross-entropy of a language model on word
sequence W is
And its perplexity
measures the complexity of a language model
(geometric mean of branching factor).

45
Perplexity

For digit recognition task (TIDIGITS) has 10
words, PP10 and 0.2 error rate
Airline Travel Information System (ATIS) has 2000
words and PP20 and a 2.5 error rate
Wall Street Journal Task has 5000 words and
PP130 with bigram and 5 error rate
In general, lower perplexity gt lower error rate,
but it does not take acoustic confusability into
account E-set (B, C, D, E, G, P, T) has PP7 and
has 5 error rate.

46
Ngram Smoothing

Deleted Interpolation algorithm estimates ? that
maximizes probability on held-out data set
We can also map all out-of-vocabulary words to
the unknown word
Other backoff smoothing algorithms possible
Katz, Kneser-Ney, Good-Turing, class ngrams

47
Adaptive Language Models

Cache Language Models
Topic Language Models
Maximum Entropy Language Models

48
Bigram Perplexity

Trained on 500 million words and tested on
Encarta Encyclopedia

49
OOV Rate

OOV rate measured on Encarta Encyclopedia.
Trained on 500 million words.

50
WSJ Results

Perplexity and word error rate on the 60000-word
Wall Street Journal continuous speech recognition
task
Unigrams, bigrams and trigrams were trained from
260 million words
Smoothing mechanisms Kneser-Ney

51
Pattern Classification

Goal
Find optimal word sequence
Combine information (probabilities) from
Acoustic model
Word lexicon
Language model
Method
Decoder searches through all possible recognition
choices using a Viterbi decoding algorithm
Challenge
Efficient search through a large network space is
computationally expensive for large vocabulary ASR

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
52
The Problem of Large Vocabulary ASR

The basic problem in ASR is to find the sequence
of words
that explain the input signal. This implies the
following mapping

Features HMM states HMM units
Phones Words Sentences
For the WSJ 20K vocabulary, this results in a
network of 10 bytes!
22

Stateof-the-art methods include fast match,
multi-pass
decoding, A stack, and finite state transducers
all
provide tremendous speed-up by searching through
the
network and finding the best path that maximizes
the
likelihood function

53
Weighted Finite State Transducers (WFST)

Unified Mathematical framework to ASR
Efficiency in time and space

WFST
WordPhrase
WFST
Search Network
PhoneWord
Composition
Optimization
WFST
HMMPhone
8
WFST can compile the network to 10 states
14 orders of magnitude more efficient
WFST
StateHMM
54
Word Pronunciation WFST
eye/.4
dxe/.8
Data
axdata/1
d e/1
aee/.6
te/.2
55
Confidence Scoring
Goal Identify possible recognition errors and
out-of-vocabulary events. Potentially improves
the performance of ASR, SLU and DM. Method A
confidence score based on a hypothesis likelihood
ratio test is associated with each recognized
word Label credit please
Recognized credit fees Confidence
(0.9) (0.3) Challenges Rejection of
extraneous acoustic events (noise, background
speech, door slams) without rejection of valid
user input speech.
Pattern Classification
ConfidenceScoring
Feature Extraction
Language Model
Word Lexicon
56
Speech Recognition Process
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
57
How to evaluate performance?

Dictation applications Insertions, substitutions
and deletions
Command-and-control false rejection and false
acceptance gt ROC curves.

58
ASR Performance The State-of-the-art
59
Growth in Effective Recognition Vocabulary Size
60
Improvement in Word Accuracy
10-20 relative reduction per year.
Switchboard/Call Home Vocabulary 40,000 words
Perplexity 85
61
Human Speech Recognition vs ASR
Machines Outperform Humans
x100
x10
x1

62
Challenges in ASR

System Performance
- Accuracy
- Efficiency (speed, memory)
- Robustness
Operational Performance
- End-point detection
- User barge-in
- Utterance rejection
- Confidence scoring

Machines are 10-100 times less accurate than
humans
63
Large-Vocabulary ASR Demo
64
Multimedia Customer Care
Courtesy of ATT
65
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
66
Spoken Language Understanding (SLU)

Goal Extract and interpret the meaning of the
recognized speech so that to identify a users
request.
accurate understanding can often be achieved
without correctly recognizing every word
SLU makes it possible to offer natural-language
based services where the customer can speak
openly without learning a specific set of
terms.
Methodology Exploit task grammar (syntax) and
task semantics to restrict the range of meaning
associated with the recognized word string.
Applications Automation of complex
operator-based tasks, e.g., customer care,
customer help lines, etc.

67
SLU Formulation

Let W be a sequence of words and C be its
underlying meaning (conceptual structure), then
using Bayes rule
Finding the best conceptual structure can be done
by parsing and ranking using a combination of
acoustic, linguistic and semantics scores.

C
68
Knowledge Sources for Speech Understanding
DM
ASR
SLU
Phonotactic
Syntactic
Pragmatic
Acoustic/ Phonetic
Semantic
Relationship of speech sounds and English
phonemes
Rules for phoneme sequences and pronunciation
Structure of words, phrases in a sentence
Relationship and meanings among words
Discourse, interaction history, world knowledge
Acoustic Model
Word Lexicon
Language Model
Understanding Model
Dialog Manager
69
SLU Components
From ASR (text, lattices, n-best, history)
Text Normalization
Morphology, Synonyms
Database Access
Extracting named entities, semantic concepts,
syntactic tree
Parsing/ Decoding
Interpretation
Slot filling, reasoning, task knowledge
representation
To DM (concepts, entities, parse tree)
70
Text Normalization
Text Normalization
Goal Reduce language variation (lexical analysis)
Parsing/ Decoding

Morphology
Decomposing words into their minimal unit
of meaning (or grammatical analysis)

Interpretation

Synonyms
Finding words that mean the same (hello, hi, how
do you do)
Equalization
- disfluencies, non-alphanumeric, capitals,
non-white space characters, etc.

71
Parsing/Decoding
Text Normalization
Goal Mapping textual representation
into semantic concepts using knowledge
rules and/or data.
Parsing/ Decoding
Interpretation

Entity Extraction (Matcher)
Finite state machines (FSM)
Context Free Grammar (CFG)
Semantic Parsing
Tree structured meaning representation allowing
arbitrary nesting of concepts

Decoding
Segment an utterance into phrases each
representing a
concept (e.g. using HMMs)
Classification
Categorizing an utterance into one or more
semantic concepts.

72
Interpretation
Text Normalization
Goal Interpret users utterance in the context
of the dialog.
Parsing/ Decoding

Interpretation

Ellipses and anaphora
What about for New York?
History Mechanism
Communicated messages are differential.
Removing ambiguities
Semantic Frames (slots)
Associating entities and relations
(attribute/value pairs)
Rule-based
Database Look-up
Retrieving or checking entities

73
DARPA Communicator

Darpa sponsored research and development of
mixed-initiative dialogue systems
Travel task involving airline, hotel and car
information and reservation
Yeah I uh I would like to go from New York to
Boston tomorrow night with United
SLU output (Concept decoding)

XML Schema ltitinerarygt ltorigingt ltcitygtlt/ci
tygt ltstategtlt/stategt lt/origingt
ltdestinationgt ltcitygtlt/citygt ltstategtlt/stategt
lt/destinationgt ltdategtlt/dategt
lttimegtlt/timegt ltairlinegtlt/airlinegt lt/itinerar
ygt
Topic Itinerary Origin New
York Destination Boston Day of the week
Sunday Date May 25th, 2002 Time gt6pm Airline
United
74
Semantic CFG

ltrule nameitinerarygt
ltogtShow me flightslt/ogt ltruleref nameorigin"/gt
ltruleref namedestination"/gt
lt/rulegt
ltrule nameorigingt
from ltruleref namecitygt
lt/rulegt
ltrule namedestinationgt
to ltruleref namecitygt
lt/rulegt
ltrule namecitygt
New York San Francisco Boston
lt/rulegt

75
ATT How May I Help You? - Customer Care Services
-

User responds with unconstrained fluent speech
System recognizes salient phrases and determines
meaning of users speech, then routes the call

There is a number on my bill I didnt make
unrecognized number
How May I help You?
Account balance
Unrecognized number
billing credit
Agent
Combined bill
76
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
77
Dialog Management (DM)

Goal
Manage elaborate exchanges with the user
Provide automated access to information
Implementation
Mathematical model programmed from rules and/or
data

78
Computational Models for DM

Structural-based Approach
Assumes that there exists a regular structure in
dialog that can be modeled as a state transition
network or dialog grammars
Dialog strategies need to be predefined, thus
limiting
Several real-time deployed systems exist today
which have inference engines and knowledge
representation
Plan-based Approach
Considers communication as acting. Dialog acts
and actions are oriented toward goal achievement
Motivated by human/human interaction in which
humans generally have goals and plans when
interacting with others
Aims to account for general models and theories
of discourse

79
DM Technology Components
TTS
ASR
SLU
SLG
DM
Context Interpretation
Dialog strategies (modules)
Backend (Database access)
80
Context Interpretation
Context Interpretation
Dialog Strategies
User Input
Backend
Action
Current Context
New Context
State(t1)
State(t)

A formal representation of the context history
is necessary so that for the DM to interpret a
users utterance given previously exchanged
utterances and identify a new action.
Natural communication is a differential process

81
Dialog Strategies
Context Interpretation
Dialog Strategies

Completion (continuation) elicits missing input
information from the user
Constraining (disambiguation) reduces the scope
of the request when multiple information has been
retrieved
Relaxation increases the scope of the request
when no information has been retrieved
Confirmation verifies that the system understood
correctly strategy may differ depending on SLU
confidence measure.
Reprompting used when the system expected input
but did not receive any or did not understand
what it received.
Context Help provides user with help during
periods of misunderstanding of either the system
or the user.
Greeting/Closing maintains social protocol at
the beginning and end of an interaction
Mixed-initiative allows users to manage the
dialog.

Backend
82
Mixed-initiative Dialog
Who manages the dialog?
System
User
Initiative
Please say collect, calling card, third number.
How can I help you?
83
Example Mixed-Initiative Dialog

System Initiative

System Please say just your departure
city. User Chicago System Please say just
your arrival city. User Newark
Long dialogs but easier to design

Mixed Initiative

System Please say your departure city User
I need to travel from Chicago to Newark tomorrow.

User Initiative

Shorter dialogs (better user experience) but more
difficult to design
System How may I help you? User I need to
travel from Chicago to Newark tomorrow.
84
Dialog Evaluation

Why?
Identify problematic situations (task failure)
Minimize user hang-ups and routing to an
operator
Improve user experience
Perform data analysis and monitoring
Conduct data mining for sales and marketing
How?
Log exchange features from ASR, SLU, DM, ANI
Define and automatically optimize a task
objective function
Compute service statistics (Instant and delayed)

85
Optimizing Task Objective Function

User Satisfaction is the ultimate objective of a
dialog system
Task completion rate
Efficiency of the dialog interaction
Usability of the system
Perceived intelligibility of the system
Quality of the audio output
Perplexity and quality of the response
generation
Machine learning techniques can be applied to
predict user satisfaction (e.g., Paradise
framework)

86
Example of Plan-based Dialog Systems.
TRIPS System Architecture
87

Example of Structure-based Dialog Systems
Help Desk
- ATT Labs Natural Voices
TM

A finite state engine is used to control the
action of the interpreter output.
The system routes customers to appropriate agents
or departments
Provide information about products, services,
costs and the business.
Show demonstrations of various voice fonts and
languages.
Trouble shoot problems and concerns raised by
customers (in progress)

88
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words to be synthesized
Words spoken
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
DM
Action
Meaning
DialogManagement
89
A Good User Interface

makes the application easy-to-use
Makes application robust to the kinds of
confusion that arise in human-machine
communications by voice
keeps the conversation moving forward, even in
periods of great uncertainty on the parts of
either the user or the machine
A great UI can not save a system with poor ASR
and NLU. But, UI can make or break a system, even
with excellent ASR NLU
Effective UI design is based on a set of
elementary principles common widgets sequenced
screen presentations simple error-trap dialog a
user manual.

90
Correction UI

N-best list

91
Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Action
Meaning
DM
DialogManagement
92
Multimodal Experience

Access to information anytime and anywhere using
any device
Most appropriate UI mode or combination of
modes depends on the device, the task, the
environment, and the users abilities and
preferences.

93
Multimodal Architecture
Language model
What does that mean P(SnF, Sn-1)
What the user might do (say) P(Fx, Sn-1)
Semantic model
Parsing
x (Multimodal Inputs)
F (Surface Semantics)
Understanding
Application Logic
Sn (Discourse Semantics)
Rendering
A (Multimedia Outputs)
Behavior model
What to show (say) to the user P(ASn)
94
MIPad

Multimodal Interactive Pad
Usability studies show double throughput for
English
Speech is mostly useful in cases with lots of
alternatives

95
MiPad video
96
Speech-enabled MapPoint
97
Other Research Areas not covered

Speaker recognition (verification and
identification)
Language identification
Human/Human and Human/Machine translation
Multimedia and document retrieval
Speech coding
Microphone array processing
Multilingual speech recognition

98
Application Development
99
Speech APIs

Open-standard APIs provide separation of the
ASR/TTS engines and platform from the application
layer.
Application is engine independent and contains
the SLU, DM, Content Server and Host Interface

100
Voice XML Architecture

Multimedia
HTML
Scripts
Voice XML Gateway
VoiceXML
Audio/Grammars
Voice Browser
Web Server
The temperature is
101
A VoiceXML example

lt?xml version"1.0"?gt ltvxml application"tutorial.
vxml" version"1.0"gt ltform
id"getPhoneNumber"gt ltfield name"PhoneNumber"gt
ltpromptgtWhat's your phone number?lt/promptgt ltgram
mar src"../grammars/phone.gram
type"application/x-jsgf" /gt lthelpgtPlease say
your ten digit phone number. lt/helpgt ltif
condPhoneNumber lt1000000'"gt lt/fieldgt
lt/formgt lt/vxmlgt
102
Speech Application Language Tags (SALT)

Add 4 tags to HTML/XHTML/cHTML/WML ltlistengt,
ltpromptgt, ltdtmfgt, ,smexgt

lthtmlgt ltform action"nextpage.html"gt
ltinput typetext idtxtBoxCity /gt lt/formgt
ltlisten idreco1gt ltgrammar
srccities.gram /gt ltbind targetElementtxtB
oxCity valuecity /gt lt/listengt lt/htmlgt
103
The Speech Advantage!

Reduce costs
Reduce labor expenses while still providing
customers an easy-to-use and natural way to
access information and services
New revenue opportunities
24x7 high-quality customer care automation
Access to information without a keyboard or
touch-tone
Customer retention
Stickiness of services
Add personalization

104
Just the Beginning
105
Market Opportunities

Consumer communication
Voice portals
Desktop dictation
- Telematic applications
- Disabilities
Call center automation
- Help Lines
- Customer care
Voice-assisted E-commerce
- B2B
Enterprise Communication
- Unified messaging
- Enterprise sales

106
A look into the future..
Keyword spotting Handcrafted grammars
No dialogue

Constrained speech
Minimal data collection
Manual design

Directory Assistance VRCP
1990
Airline reservation Banking

Constrained speech
Moderate data collection
Some automation

Medium size ASR Handcrafted Grammars
System Initiative
1995
MATCH Multimodal Access To City Help

Spontaneous speech
Extensive data collection
Semi-automation

Large size ASR Limited NLU
Mixed-initiative
Call centers, E-commerce
2000

Spontaneous speech/pen
Fully automated systems

Multimodal, Multilingual Help Desks, E-commerce
Unlimited ASR Deeper NLU Adaptive
systems
2005
107
Spoken Dialog Interfaces Vs. Touch-Tone IVR?
108
Reading Materials

Spoken Language Processing, X. Huang, A. Acero,
H-W. Hon, Prentice Hall, 2001.
Fundamentals of Speech Recognition, L. Rabiner
and B.H. Juang, Prentice Hall, 1993.
Automatic Speech and Speaker Recognition, C.H.
Lee, F. Soong, K. Paliwal, Kluwer Academic
Publishers, 1996.
Spoken Dialogues with Computers, Editor R.
De-Mori, Academic Press 1998.

Write a Comment

User Comments (0)