Title: Speech Recognition and Understanding
1Speech Recognition and Understanding
- Alex Acero
- Microsoft Research
Thanks to Mazin Rahim (ATT)
2A Vision into the 21st Century
3Milestones in Speech Recognition
Small Vocabulary, Acoustic Phonetics-based
Large Vocabulary Syntax, Semantics,
Very Large Vocabulary Semantics, Multimodal
Dialog
Medium Vocabulary, Template-based
Large Vocabulary, Statistical-based
Isolated Words Connected Digits Continuous Speech
Continuous Speech Speech Understanding
Spoken dialog Multiple modalities
Connected Words Continuous Speech
Isolated Words
Stochastic language understanding
Finite-state machines Statistical learning
Pattern recognition LPC analysis Clustering
algorithms Level building
Filter-bank analysis Time-normalization Dynamic
programming
Concatenative synthesis Machine learning
Mixed-initiative dialog
Hidden Markov models Stochastic
Language modeling
1962 1967 1972 1977 1982 1987 1
992 1997 2003
Year
4 Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Action
Meaning
DM
DialogManagement
5 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
6Automatic Speech Recognition
- Goal
- convert a speech signal into a text message
- Accurately and efficiently
- independent of the device, speaker or the
environment. - Applications
- Accessibility
- Eyes-busy hands-busy (automobile, doctors, etc)
- Call Centers for customer care
- Dictation
7Basic Formulation
- Basic equation of speech recognition is
- XX1,X2,,Xn is the acoustic observation
- is the word sequence.
- P(XW) is the acoustic model
- P(W) is the language model
8Speech Recognition Process
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
9Feature Extraction
Pattern Classification
Confidence Scoring
Feature Extraction
- Goal
- Extract robust features relevant for ASR
- Method
- Spectral analysis
- Result
- A feature vector every 10ms
- Challenges
- Robustness to environment (office, airport, car)
- Devices (speakerphones, cellphones)
- Speakers (accents, dialect, style, speaking
defects)
Language Model
Word Lexicon
10Spectral Analysis
- Female speech (/aa/, pitch of 200Hz)
- Fourier transform
- 30ms Hamming Window
xn time signal
Xk Fourier transform
11Spectrograms
- Short-time Fourier transform
- Pitch and formant structure
12Feature Extraction Process
Noise removal, Normalization
Quantization
Filtering
Cepstral Analysis
Preemphasis
M,N
Pitch Formants
Spectral Analysis
Energy Zero-crossing
Segmentation (blocking)
Equalization
Bias removal or normalization
Temporal Derivative
Windowing
Delta cepstrum Delta2 cepstrum
13Robust Speech Recognition
A mismatch in the speech signal between the
training phase and testing phase results in
performance degradation.
Signal
Features
Model
Training
Enhancement
Normalization
Adaptation
Signal
Features
Model
Testing
14Noise and Channel Distortion
y(t) s(t) n(t) h(t)
Noise
n(t)
h(t)
Channel
Distorted Speech
y(t)
Speech
s(t)
5dB
50dB
Fourier Transform
Fourier Transform
frequency
frequency
15Speaker Variations
- Vocal tract length varies from 15-20cm
- Longer vocal tracts gtlower frequency contents
- Maximum Likelihood Speaker Normalization
- Warp the frequency of a signal
16Acoustic Modeling
- Goal
- Map acoustic features into distinct subword units
- Such as phones, syllables, words, etc.
- Hidden Markov Model (HMM)
- Spectral properties modeled by a parametric
random process - A collection of HMMs is associated with a subword
unit - HMMs are also assigned for modeling extraneous
events - Advantages
- Powerful statistical method for a wide range of
data and conditions - Highly reliable for recognizing speech
17Discrete-Time Markov Process
- The Dow Jones Industrial Average
Discrete-time first order Markov chain
18Hidden Markov Models
19Example
- I observe (up, up, down, up, unchanged, up)
- Is it a bull market? bear market?
- P(bull)0.70.70.10.70.20.70.5
(0.6)51.86710-4 - P(bear)0.10.10.60.10.30.10.2
(0.3)58.74810-9 - P(steady)0.30.30.30.30.40.30.3(0.5)59.1
12510-6 - Its 20 times more likely that we are in a bull
market than a steady market! - How about
- P(bull,bull,bear,bull,steady,bull)
- (0.70.70.60.70.40.7)(0.50.60.20.50.20.
4)1.38297610-4
20Basic Problems in HMMs
- Given acoustic observation X and model ?
- Evaluation Compute P(X ?)
- Decoding Choose optimal state sequence
- Re-estimation Adjust ? to maximize P(X ?)
21Evaluation Forward-Backward algorithm
Forward
Backward
22Decoding Viterbi Algorithm
Step 1 Initialization D1(i)pibi(x1),
B1(i)0 j1,N Step 2
Iterations for t2,,T
for j1,,N Vt(j)minVt-1(i)aij
bj(xt) Bt(j)argminVt-1(i)aij Step 3
Backtracking The optimal score is VT
max Vt(i) Final state is sT argmax
Vt(i) Optimal path is (s1,s2,,sT)
where stBt1(st1) tT-1,1
23ReestimationBaum-Welch Algorithm
- Find ?(A, B, ?) that maximize p(X ?)
- No closed-form solution gt
- EM algorithm
- Start with old parameter value ?
- Obtain a new parameter that maximizes
- EM guaranteed to have higher likelihood
24Continuous Densities
- Output distribution is mixture of Gaussians
- Posterior probabilities of state j at time t,
(mixture k and state i at time t - 1) - Reestimation Formulae
25EM Training Procedure
Input speech database
Updated HMM Model
Estimate Posteriors ?t(j,k), ?t(i,j)
Maximize parameters aij, cjk, ?jk, ?jk
Old HMM Model
26Design Issues
- Continuous vs. Discrete HMM
- Whole-word vs. subword (phone units)
- Number of states, number of Gaussians
- Ergodic vs. Bakis
- Context-dependent vs. context-independent
a
27Training with continuous speech
- No segmentation is needed
- Composed HMM
28Context Variability in Speech
- At word/sentence level
- Mr. Wright should write to Ms. Wright right away
about his Ford or four door Honda.
Peat
Wheel
- At phone level
- /ee/ for words peat and wheel
- Triphones capture
- Coarticulation
- phonetic context
29Context-dependent models
Triphone IY(P, CH) captures coarticulation,
phonetic context
Stress Italy vs Italian
30Clustering similar triphones
- /iy/ with two different left contexts /r/ and /w/
- Similar effects on /iy/
- Cluster those triphones together
31Clustering with decision trees
32Other variability in Speech
- Style
- discrete vs. continuous speech,
- read vs spontaneous
- slow vs fast
- Speaker
- speaker independent
- speaker dependent
- speaker adapted
- Environment
- additive noise (cocktail party effect)
- telephone channel
33Acoustic Adaptation
- Model adaptation needed if
- Mismatched test conditions
- Desire to tune to a given speaker
- Maximum a Posteriori (MAP)
- adds a prior for the parameters ?
- Maximum Likelihood Linear Regression (MLLR)
- Transform mean vectors
- Can have more than one MLLR transform
- Speaker Adaptive Training (SAT) applies MLLR to
training data as well
34MAP vs MLLR
Speaker-dependent system is trained with 1000
sentences
35Discriminative Training
- Maximum Likelihood Training
- Parameters obtained from true classes
- Discriminative Training
- maximize discrimination between classes
- Discriminative Feature Transformation
- Maximize inter-class difference to intra-class
difference - Done at the state level
- Linear Discriminant Analysis (LDA)
- Discriminative Model Training
- Maximize posterior probability
- Correct class and competing classes are used
- Maximum mutual information (MMI) Minimum
classification Error (MCE), Minimum Phone Error
(MPE)
36Word Lexicon
- Goal
- Map legal phone sequences into words
- according to phonotactic rules
- David /d/ /ey/ /v/ /ih/ /d/
- Multiple Pronunciation
- Several words may have multiple pronunciations
- Data /d/ /ae/ /t/ /ax/
- Data /d/ /ey/ /t/ /ax/
- Challenges
- How do you generate a word lexicon
automatically? - How do you add new variant dialects and word
pronunciations?
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
37The Lexicon
- An entry per word (gt 100K words for dictation)
- Multiple pronunciations (tomato)
- Done by hand or with letter-to-sound rules (LTS)
- LTS rules can be automatically trained with
decision trees (CART) - less than 8 errors, but proper nouns are hard!
38Language Model
Goal Model acceptable spoken
phrases constrained by task syntax Rule-based De
terministic knowledge-driven grammars Statistic
al Compute estimate of word probabilities
(N-gram, class-based, CFG)
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
flying from city to city on date
flying from Newark to Boston tomorrow
0.4
0.6
39Formal grammars
40Chomsky Grammar Hierarchy
41Ngrams
42Understanding Bigrams
- Training data
- John read her book
- I read a different book
- John read a book by Mark
- But we have a problem here
43Ngram Smoothing
- Data sparseness in millions of words more than
50 of trigrams occur only once. - Cant assign p(wiwi-1, wi-2)0
- Solution assign non-zero probability for each
ngram by lowering the probability mass of seen
ngrams
44Perplexity
- Cross-entropy of a language model on word
sequence W is - And its perplexity
- measures the complexity of a language model
(geometric mean of branching factor).
45Perplexity
- For digit recognition task (TIDIGITS) has 10
words, PP10 and 0.2 error rate - Airline Travel Information System (ATIS) has 2000
words and PP20 and a 2.5 error rate - Wall Street Journal Task has 5000 words and
PP130 with bigram and 5 error rate - In general, lower perplexity gt lower error rate,
but it does not take acoustic confusability into
account E-set (B, C, D, E, G, P, T) has PP7 and
has 5 error rate.
46Ngram Smoothing
- Deleted Interpolation algorithm estimates ? that
maximizes probability on held-out data set - We can also map all out-of-vocabulary words to
the unknown word - Other backoff smoothing algorithms possible
Katz, Kneser-Ney, Good-Turing, class ngrams
47Adaptive Language Models
- Cache Language Models
- Topic Language Models
- Maximum Entropy Language Models
48Bigram Perplexity
- Trained on 500 million words and tested on
Encarta Encyclopedia
49OOV Rate
- OOV rate measured on Encarta Encyclopedia.
Trained on 500 million words.
50WSJ Results
- Perplexity and word error rate on the 60000-word
Wall Street Journal continuous speech recognition
task - Unigrams, bigrams and trigrams were trained from
260 million words - Smoothing mechanisms Kneser-Ney
51Pattern Classification
- Goal
- Find optimal word sequence
- Combine information (probabilities) from
- Acoustic model
- Word lexicon
- Language model
- Method
- Decoder searches through all possible recognition
- choices using a Viterbi decoding algorithm
- Challenge
- Efficient search through a large network space is
computationally expensive for large vocabulary ASR
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
52The Problem of Large Vocabulary ASR
- The basic problem in ASR is to find the sequence
of words - that explain the input signal. This implies the
following mapping
Features HMM states HMM units
Phones Words Sentences
For the WSJ 20K vocabulary, this results in a
network of 10 bytes!
22
- Stateof-the-art methods include fast match,
multi-pass - decoding, A stack, and finite state transducers
all - provide tremendous speed-up by searching through
the - network and finding the best path that maximizes
the - likelihood function
53Weighted Finite State Transducers (WFST)
- Unified Mathematical framework to ASR
- Efficiency in time and space
WFST
WordPhrase
WFST
Search Network
PhoneWord
Composition
Optimization
WFST
HMMPhone
8
WFST can compile the network to 10 states
14 orders of magnitude more efficient
WFST
StateHMM
54Word Pronunciation WFST
eye/.4
dxe/.8
Data
axdata/1
d e/1
aee/.6
te/.2
55Confidence Scoring
Goal Identify possible recognition errors and
out-of-vocabulary events. Potentially improves
the performance of ASR, SLU and DM. Method A
confidence score based on a hypothesis likelihood
ratio test is associated with each recognized
word Label credit please
Recognized credit fees Confidence
(0.9) (0.3) Challenges Rejection of
extraneous acoustic events (noise, background
speech, door slams) without rejection of valid
user input speech.
Pattern Classification
ConfidenceScoring
Feature Extraction
Language Model
Word Lexicon
56Speech Recognition Process
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
57How to evaluate performance?
- Dictation applications Insertions, substitutions
and deletions - Command-and-control false rejection and false
acceptance gt ROC curves.
58ASR Performance The State-of-the-art
59Growth in Effective Recognition Vocabulary Size
60Improvement in Word Accuracy
10-20 relative reduction per year.
Switchboard/Call Home Vocabulary 40,000 words
Perplexity 85
61Human Speech Recognition vs ASR
Machines Outperform Humans
x100
x10
x1
62Challenges in ASR
- System Performance
- - Accuracy
- - Efficiency (speed, memory)
- - Robustness
- Operational Performance
- - End-point detection
- - User barge-in
- - Utterance rejection
- - Confidence scoring
Machines are 10-100 times less accurate than
humans
63Large-Vocabulary ASR Demo
64Multimedia Customer Care
Courtesy of ATT
65 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
66Spoken Language Understanding (SLU)
- Goal Extract and interpret the meaning of the
recognized speech so that to identify a users
request. - accurate understanding can often be achieved
without correctly recognizing every word - SLU makes it possible to offer natural-language
based services where the customer can speak
openly without learning a specific set of
terms. - Methodology Exploit task grammar (syntax) and
task semantics to restrict the range of meaning
associated with the recognized word string. - Applications Automation of complex
operator-based tasks, e.g., customer care,
customer help lines, etc.
67SLU Formulation
- Let W be a sequence of words and C be its
underlying meaning (conceptual structure), then
using Bayes rule - Finding the best conceptual structure can be done
by parsing and ranking using a combination of
acoustic, linguistic and semantics scores.
C
68 Knowledge Sources for Speech Understanding
DM
ASR
SLU
Phonotactic
Syntactic
Pragmatic
Acoustic/ Phonetic
Semantic
Relationship of speech sounds and English
phonemes
Rules for phoneme sequences and pronunciation
Structure of words, phrases in a sentence
Relationship and meanings among words
Discourse, interaction history, world knowledge
Acoustic Model
Word Lexicon
Language Model
Understanding Model
Dialog Manager
69SLU Components
From ASR (text, lattices, n-best, history)
Text Normalization
Morphology, Synonyms
Database Access
Extracting named entities, semantic concepts,
syntactic tree
Parsing/ Decoding
Interpretation
Slot filling, reasoning, task knowledge
representation
To DM (concepts, entities, parse tree)
70Text Normalization
Text Normalization
Goal Reduce language variation (lexical analysis)
Parsing/ Decoding
- Morphology
- Decomposing words into their minimal unit
- of meaning (or grammatical analysis)
Interpretation
- Synonyms
- Finding words that mean the same (hello, hi, how
do you do) - Equalization
- - disfluencies, non-alphanumeric, capitals,
non-white space characters, etc.
71Parsing/Decoding
Text Normalization
Goal Mapping textual representation
into semantic concepts using knowledge
rules and/or data.
Parsing/ Decoding
Interpretation
- Entity Extraction (Matcher)
- Finite state machines (FSM)
- Context Free Grammar (CFG)
- Semantic Parsing
- Tree structured meaning representation allowing
- arbitrary nesting of concepts
- Decoding
- Segment an utterance into phrases each
representing a - concept (e.g. using HMMs)
- Classification
- Categorizing an utterance into one or more
semantic concepts.
72Interpretation
Text Normalization
Goal Interpret users utterance in the context
of the dialog.
Parsing/ Decoding
Interpretation
- Ellipses and anaphora
- What about for New York?
- History Mechanism
- Communicated messages are differential.
- Removing ambiguities
- Semantic Frames (slots)
- Associating entities and relations
(attribute/value pairs) - Rule-based
- Database Look-up
- Retrieving or checking entities
73DARPA Communicator
- Darpa sponsored research and development of
mixed-initiative dialogue systems - Travel task involving airline, hotel and car
information and reservation - Yeah I uh I would like to go from New York to
Boston tomorrow night with United - SLU output (Concept decoding)
-
XML Schema ltitinerarygt ltorigingt ltcitygtlt/ci
tygt ltstategtlt/stategt lt/origingt
ltdestinationgt ltcitygtlt/citygt ltstategtlt/stategt
lt/destinationgt ltdategtlt/dategt
lttimegtlt/timegt ltairlinegtlt/airlinegt lt/itinerar
ygt
Topic Itinerary Origin New
York Destination Boston Day of the week
Sunday Date May 25th, 2002 Time gt6pm Airline
United
74Semantic CFG
- ltrule nameitinerarygt
- ltogtShow me flightslt/ogt ltruleref nameorigin"/gt
- ltruleref namedestination"/gt
- lt/rulegt
- ltrule nameorigingt
- from ltruleref namecitygt
- lt/rulegt
- ltrule namedestinationgt
- to ltruleref namecitygt
- lt/rulegt
- ltrule namecitygt
- New York San Francisco Boston
- lt/rulegt
75ATT How May I Help You? - Customer Care Services
-
- User responds with unconstrained fluent speech
- System recognizes salient phrases and determines
meaning of users speech, then routes the call
There is a number on my bill I didnt make
unrecognized number
How May I help You?
Account balance
Unrecognized number
billing credit
Agent
Combined bill
76 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
77Dialog Management (DM)
- Goal
- Manage elaborate exchanges with the user
- Provide automated access to information
- Implementation
- Mathematical model programmed from rules and/or
data
78 Computational Models for DM
- Structural-based Approach
- Assumes that there exists a regular structure in
dialog that can be modeled as a state transition
network or dialog grammars - Dialog strategies need to be predefined, thus
limiting - Several real-time deployed systems exist today
which have inference engines and knowledge
representation - Plan-based Approach
- Considers communication as acting. Dialog acts
and actions are oriented toward goal achievement - Motivated by human/human interaction in which
humans generally have goals and plans when
interacting with others - Aims to account for general models and theories
of discourse
79DM Technology Components
TTS
ASR
SLU
SLG
DM
Context Interpretation
Dialog strategies (modules)
Backend (Database access)
80 Context Interpretation
Context Interpretation
Dialog Strategies
User Input
Backend
Action
Current Context
New Context
State(t1)
State(t)
- A formal representation of the context history
is necessary so that for the DM to interpret a
users utterance given previously exchanged
utterances and identify a new action. - Natural communication is a differential process
81 Dialog Strategies
Context Interpretation
Dialog Strategies
- Completion (continuation) elicits missing input
information from the user - Constraining (disambiguation) reduces the scope
of the request when multiple information has been
retrieved - Relaxation increases the scope of the request
when no information has been retrieved - Confirmation verifies that the system understood
correctly strategy may differ depending on SLU
confidence measure. - Reprompting used when the system expected input
but did not receive any or did not understand
what it received. - Context Help provides user with help during
periods of misunderstanding of either the system
or the user. - Greeting/Closing maintains social protocol at
the beginning and end of an interaction - Mixed-initiative allows users to manage the
dialog.
Backend
82Mixed-initiative Dialog
Who manages the dialog?
System
User
Initiative
Please say collect, calling card, third number.
How can I help you?
83Example Mixed-Initiative Dialog
System Please say just your departure
city. User Chicago System Please say just
your arrival city. User Newark
Long dialogs but easier to design
System Please say your departure city User
I need to travel from Chicago to Newark tomorrow.
Shorter dialogs (better user experience) but more
difficult to design
System How may I help you? User I need to
travel from Chicago to Newark tomorrow.
84Dialog Evaluation
- Why?
- Identify problematic situations (task failure)
- Minimize user hang-ups and routing to an
operator - Improve user experience
- Perform data analysis and monitoring
- Conduct data mining for sales and marketing
- How?
- Log exchange features from ASR, SLU, DM, ANI
- Define and automatically optimize a task
objective function - Compute service statistics (Instant and delayed)
85 Optimizing Task Objective Function
- User Satisfaction is the ultimate objective of a
dialog system - Task completion rate
- Efficiency of the dialog interaction
- Usability of the system
- Perceived intelligibility of the system
- Quality of the audio output
- Perplexity and quality of the response
generation - Machine learning techniques can be applied to
predict user satisfaction (e.g., Paradise
framework)
86Example of Plan-based Dialog Systems.
TRIPS System Architecture
87 Example of Structure-based Dialog Systems
Help Desk
- ATT Labs Natural Voices
TM
- A finite state engine is used to control the
action of the interpreter output. - The system routes customers to appropriate agents
or departments - Provide information about products, services,
costs and the business. - Show demonstrations of various voice fonts and
languages. - Trouble shoot problems and concerns raised by
customers (in progress)
88 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words to be synthesized
Words spoken
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
DM
Action
Meaning
DialogManagement
89A Good User Interface
- makes the application easy-to-use
- Makes application robust to the kinds of
confusion that arise in human-machine
communications by voice - keeps the conversation moving forward, even in
periods of great uncertainty on the parts of
either the user or the machine - A great UI can not save a system with poor ASR
and NLU. But, UI can make or break a system, even
with excellent ASR NLU - Effective UI design is based on a set of
elementary principles common widgets sequenced
screen presentations simple error-trap dialog a
user manual.
90Correction UI
91 Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Action
Meaning
DM
DialogManagement
92Multimodal Experience
- Access to information anytime and anywhere using
any device - Most appropriate UI mode or combination of
modes depends on the device, the task, the
environment, and the users abilities and
preferences.
93Multimodal Architecture
Language model
What does that mean P(SnF, Sn-1)
What the user might do (say) P(Fx, Sn-1)
Semantic model
Parsing
x (Multimodal Inputs)
F (Surface Semantics)
Understanding
Application Logic
Sn (Discourse Semantics)
Rendering
A (Multimedia Outputs)
Behavior model
What to show (say) to the user P(ASn)
94 MIPad
- Multimodal Interactive Pad
- Usability studies show double throughput for
English - Speech is mostly useful in cases with lots of
alternatives
95MiPad video
96Speech-enabled MapPoint
97 Other Research Areas not covered
- Speaker recognition (verification and
identification) - Language identification
- Human/Human and Human/Machine translation
- Multimedia and document retrieval
- Speech coding
- Microphone array processing
- Multilingual speech recognition
98Application Development
99Speech APIs
- Open-standard APIs provide separation of the
ASR/TTS engines and platform from the application
layer. - Application is engine independent and contains
the SLU, DM, Content Server and Host Interface
100Voice XML Architecture
Multimedia
HTML
Scripts
Voice XML Gateway
VoiceXML
Audio/Grammars
Voice Browser
Web Server
The temperature is
101A VoiceXML example
lt?xml version"1.0"?gt ltvxml application"tutorial.
vxml" version"1.0"gt ltform
id"getPhoneNumber"gt ltfield name"PhoneNumber"gt
ltpromptgtWhat's your phone number?lt/promptgt ltgram
mar src"../grammars/phone.gram
type"application/x-jsgf" /gt lthelpgtPlease say
your ten digit phone number. lt/helpgt ltif
condPhoneNumber lt1000000'"gt lt/fieldgt
lt/formgt lt/vxmlgt
102Speech Application Language Tags (SALT)
- Add 4 tags to HTML/XHTML/cHTML/WML ltlistengt,
ltpromptgt, ltdtmfgt, ,smexgt
lthtmlgt ltform action"nextpage.html"gt
ltinput typetext idtxtBoxCity /gt lt/formgt
ltlisten idreco1gt ltgrammar
srccities.gram /gt ltbind targetElementtxtB
oxCity valuecity /gt lt/listengt lt/htmlgt
103The Speech Advantage!
- Reduce costs
- Reduce labor expenses while still providing
customers an easy-to-use and natural way to
access information and services - New revenue opportunities
- 24x7 high-quality customer care automation
- Access to information without a keyboard or
touch-tone - Customer retention
- Stickiness of services
- Add personalization
104Just the Beginning
105Market Opportunities
- Consumer communication
- Voice portals
- Desktop dictation
- - Telematic applications
- - Disabilities
- Call center automation
- - Help Lines
- - Customer care
- Voice-assisted E-commerce
- - B2B
- Enterprise Communication
- - Unified messaging
- - Enterprise sales
106A look into the future..
Keyword spotting Handcrafted grammars
No dialogue
- Constrained speech
- Minimal data collection
- Manual design
Directory Assistance VRCP
1990
Airline reservation Banking
- Constrained speech
- Moderate data collection
- Some automation
Medium size ASR Handcrafted Grammars
System Initiative
1995
MATCH Multimodal Access To City Help
- Spontaneous speech
- Extensive data collection
- Semi-automation
Large size ASR Limited NLU
Mixed-initiative
Call centers, E-commerce
2000
- Spontaneous speech/pen
- Fully automated systems
Multimodal, Multilingual Help Desks, E-commerce
Unlimited ASR Deeper NLU Adaptive
systems
2005
107Spoken Dialog Interfaces Vs. Touch-Tone IVR?
108Reading Materials
- Spoken Language Processing, X. Huang, A. Acero,
H-W. Hon, Prentice Hall, 2001. - Fundamentals of Speech Recognition, L. Rabiner
and B.H. Juang, Prentice Hall, 1993. - Automatic Speech and Speaker Recognition, C.H.
Lee, F. Soong, K. Paliwal, Kluwer Academic
Publishers, 1996. - Spoken Dialogues with Computers, Editor R.
De-Mori, Academic Press 1998.