Speech Recognition and Understanding

1 / 107
About This Presentation
Title:

Speech Recognition and Understanding

Description:

Pitch and formant structure. Quantization. Segmentation (blocking) Windowing. Preemphasis ... Formants. Noise. removal, Normalization. M,N. Energy. Zero ... – PowerPoint PPT presentation

Number of Views:388
Avg rating:3.0/5.0
Slides: 108
Provided by: maz91

less

Transcript and Presenter's Notes

Title: Speech Recognition and Understanding


1
Speech Recognition and Understanding
  • Alex Acero
  • Microsoft Research

Thanks to Mazin Rahim (ATT)
2
A Vision into the 21st Century
3
Milestones in Speech Recognition
Small Vocabulary, Acoustic Phonetics-based
Large Vocabulary Syntax, Semantics,
Very Large Vocabulary Semantics, Multimodal
Dialog
Medium Vocabulary, Template-based
Large Vocabulary, Statistical-based
Isolated Words Connected Digits Continuous Speech
Continuous Speech Speech Understanding
Spoken dialog Multiple modalities
Connected Words Continuous Speech
Isolated Words
Stochastic language understanding
Finite-state machines Statistical learning
Pattern recognition LPC analysis Clustering
algorithms Level building
Filter-bank analysis Time-normalization Dynamic
programming
Concatenative synthesis Machine learning
Mixed-initiative dialog
Hidden Markov models Stochastic
Language modeling
1962 1967 1972 1977 1982 1987 1
992 1997 2003
Year
4
Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Action
Meaning
DM
DialogManagement
5
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
6
Automatic Speech Recognition
  • Goal
  • convert a speech signal into a text message
  • Accurately and efficiently
  • independent of the device, speaker or the
    environment.
  • Applications
  • Accessibility
  • Eyes-busy hands-busy (automobile, doctors, etc)
  • Call Centers for customer care
  • Dictation

7
Basic Formulation
  • Basic equation of speech recognition is
  • XX1,X2,,Xn is the acoustic observation
  • is the word sequence.
  • P(XW) is the acoustic model
  • P(W) is the language model

8
Speech Recognition Process
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
9
Feature Extraction
Pattern Classification
Confidence Scoring
Feature Extraction
  • Goal
  • Extract robust features relevant for ASR
  • Method
  • Spectral analysis
  • Result
  • A feature vector every 10ms
  • Challenges
  • Robustness to environment (office, airport, car)
  • Devices (speakerphones, cellphones)
  • Speakers (accents, dialect, style, speaking
    defects)

Language Model
Word Lexicon
10
Spectral Analysis
  • Female speech (/aa/, pitch of 200Hz)
  • Fourier transform
  • 30ms Hamming Window

xn time signal
Xk Fourier transform
11
Spectrograms
  • Short-time Fourier transform
  • Pitch and formant structure

12
Feature Extraction Process
Noise removal, Normalization
Quantization
Filtering
Cepstral Analysis
Preemphasis
M,N
Pitch Formants
Spectral Analysis
Energy Zero-crossing
Segmentation (blocking)
Equalization
Bias removal or normalization
Temporal Derivative
Windowing
Delta cepstrum Delta2 cepstrum
13
Robust Speech Recognition
A mismatch in the speech signal between the
training phase and testing phase results in
performance degradation.
Signal
Features
Model
Training
Enhancement
Normalization
Adaptation
Signal
Features
Model
Testing
14
Noise and Channel Distortion
y(t) s(t) n(t) h(t)
Noise
n(t)
h(t)
Channel

Distorted Speech
y(t)
Speech
s(t)
5dB
50dB
Fourier Transform
Fourier Transform
frequency
frequency
15
Speaker Variations
  • Vocal tract length varies from 15-20cm
  • Longer vocal tracts gtlower frequency contents
  • Maximum Likelihood Speaker Normalization
  • Warp the frequency of a signal

16
Acoustic Modeling
  • Goal
  • Map acoustic features into distinct subword units
  • Such as phones, syllables, words, etc.
  • Hidden Markov Model (HMM)
  • Spectral properties modeled by a parametric
    random process
  • A collection of HMMs is associated with a subword
    unit
  • HMMs are also assigned for modeling extraneous
    events
  • Advantages
  • Powerful statistical method for a wide range of
    data and conditions
  • Highly reliable for recognizing speech

17
Discrete-Time Markov Process
  • The Dow Jones Industrial Average

Discrete-time first order Markov chain
18
Hidden Markov Models

19
Example
  • I observe (up, up, down, up, unchanged, up)
  • Is it a bull market? bear market?
  • P(bull)0.70.70.10.70.20.70.5
    (0.6)51.86710-4
  • P(bear)0.10.10.60.10.30.10.2
    (0.3)58.74810-9
  • P(steady)0.30.30.30.30.40.30.3(0.5)59.1
    12510-6
  • Its 20 times more likely that we are in a bull
    market than a steady market!
  • How about
  • P(bull,bull,bear,bull,steady,bull)
  • (0.70.70.60.70.40.7)(0.50.60.20.50.20.
    4)1.38297610-4

20
Basic Problems in HMMs
  • Given acoustic observation X and model ?
  • Evaluation Compute P(X ?)
  • Decoding Choose optimal state sequence
  • Re-estimation Adjust ? to maximize P(X ?)

21
Evaluation Forward-Backward algorithm

Forward
Backward
22
Decoding Viterbi Algorithm
Step 1 Initialization D1(i)pibi(x1),
B1(i)0 j1,N Step 2
Iterations for t2,,T
for j1,,N Vt(j)minVt-1(i)aij
bj(xt) Bt(j)argminVt-1(i)aij Step 3
Backtracking The optimal score is VT
max Vt(i) Final state is sT argmax
Vt(i) Optimal path is (s1,s2,,sT)
where stBt1(st1) tT-1,1

23
ReestimationBaum-Welch Algorithm
  • Find ?(A, B, ?) that maximize p(X ?)
  • No closed-form solution gt
  • EM algorithm
  • Start with old parameter value ?
  • Obtain a new parameter that maximizes
  • EM guaranteed to have higher likelihood

24
Continuous Densities
  • Output distribution is mixture of Gaussians
  • Posterior probabilities of state j at time t,
    (mixture k and state i at time t - 1)
  • Reestimation Formulae

25
EM Training Procedure
Input speech database
Updated HMM Model
Estimate Posteriors ?t(j,k), ?t(i,j)
Maximize parameters aij, cjk, ?jk, ?jk
Old HMM Model
26
Design Issues
  • Continuous vs. Discrete HMM
  • Whole-word vs. subword (phone units)
  • Number of states, number of Gaussians
  • Ergodic vs. Bakis
  • Context-dependent vs. context-independent

a
27
Training with continuous speech
  • No segmentation is needed
  • Composed HMM

28
Context Variability in Speech
  • At word/sentence level
  • Mr. Wright should write to Ms. Wright right away
    about his Ford or four door Honda.

Peat
Wheel
  • At phone level
  • /ee/ for words peat and wheel
  • Triphones capture
  • Coarticulation
  • phonetic context

29
Context-dependent models
Triphone IY(P, CH) captures coarticulation,
phonetic context
Stress Italy vs Italian
30
Clustering similar triphones
  • /iy/ with two different left contexts /r/ and /w/
  • Similar effects on /iy/
  • Cluster those triphones together

31
Clustering with decision trees

32
Other variability in Speech
  • Style
  • discrete vs. continuous speech,
  • read vs spontaneous
  • slow vs fast
  • Speaker
  • speaker independent
  • speaker dependent
  • speaker adapted
  • Environment
  • additive noise (cocktail party effect)
  • telephone channel

33
Acoustic Adaptation
  • Model adaptation needed if
  • Mismatched test conditions
  • Desire to tune to a given speaker
  • Maximum a Posteriori (MAP)
  • adds a prior for the parameters ?
  • Maximum Likelihood Linear Regression (MLLR)
  • Transform mean vectors
  • Can have more than one MLLR transform
  • Speaker Adaptive Training (SAT) applies MLLR to
    training data as well

34
MAP vs MLLR
Speaker-dependent system is trained with 1000
sentences
35
Discriminative Training
  • Maximum Likelihood Training
  • Parameters obtained from true classes
  • Discriminative Training
  • maximize discrimination between classes
  • Discriminative Feature Transformation
  • Maximize inter-class difference to intra-class
    difference
  • Done at the state level
  • Linear Discriminant Analysis (LDA)
  • Discriminative Model Training
  • Maximize posterior probability
  • Correct class and competing classes are used
  • Maximum mutual information (MMI) Minimum
    classification Error (MCE), Minimum Phone Error
    (MPE)

36
Word Lexicon
  • Goal
  • Map legal phone sequences into words
  • according to phonotactic rules
  • David /d/ /ey/ /v/ /ih/ /d/
  • Multiple Pronunciation
  • Several words may have multiple pronunciations
  • Data /d/ /ae/ /t/ /ax/
  • Data /d/ /ey/ /t/ /ax/
  • Challenges
  • How do you generate a word lexicon
    automatically?
  • How do you add new variant dialects and word
    pronunciations?

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
37
The Lexicon
  • An entry per word (gt 100K words for dictation)
  • Multiple pronunciations (tomato)
  • Done by hand or with letter-to-sound rules (LTS)
  • LTS rules can be automatically trained with
    decision trees (CART)
  • less than 8 errors, but proper nouns are hard!

38
Language Model
Goal Model acceptable spoken
phrases constrained by task syntax Rule-based De
terministic knowledge-driven grammars Statistic
al Compute estimate of word probabilities
(N-gram, class-based, CFG)
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
flying from city to city on date
flying from Newark to Boston tomorrow
0.4
0.6
39
Formal grammars

40
Chomsky Grammar Hierarchy

41
Ngrams
  • Trigram Estimation

42
Understanding Bigrams
  • Training data
  • John read her book
  • I read a different book
  • John read a book by Mark
  • But we have a problem here

43
Ngram Smoothing
  • Data sparseness in millions of words more than
    50 of trigrams occur only once.
  • Cant assign p(wiwi-1, wi-2)0
  • Solution assign non-zero probability for each
    ngram by lowering the probability mass of seen
    ngrams

44
Perplexity
  • Cross-entropy of a language model on word
    sequence W is
  • And its perplexity
  • measures the complexity of a language model
    (geometric mean of branching factor).

45
Perplexity
  • For digit recognition task (TIDIGITS) has 10
    words, PP10 and 0.2 error rate
  • Airline Travel Information System (ATIS) has 2000
    words and PP20 and a 2.5 error rate
  • Wall Street Journal Task has 5000 words and
    PP130 with bigram and 5 error rate
  • In general, lower perplexity gt lower error rate,
    but it does not take acoustic confusability into
    account E-set (B, C, D, E, G, P, T) has PP7 and
    has 5 error rate.

46
Ngram Smoothing
  • Deleted Interpolation algorithm estimates ? that
    maximizes probability on held-out data set
  • We can also map all out-of-vocabulary words to
    the unknown word
  • Other backoff smoothing algorithms possible
    Katz, Kneser-Ney, Good-Turing, class ngrams

47
Adaptive Language Models
  • Cache Language Models
  • Topic Language Models
  • Maximum Entropy Language Models

48
Bigram Perplexity
  • Trained on 500 million words and tested on
    Encarta Encyclopedia

49
OOV Rate
  • OOV rate measured on Encarta Encyclopedia.
    Trained on 500 million words.

50
WSJ Results
  • Perplexity and word error rate on the 60000-word
    Wall Street Journal continuous speech recognition
    task
  • Unigrams, bigrams and trigrams were trained from
    260 million words
  • Smoothing mechanisms Kneser-Ney

51
Pattern Classification
  • Goal
  • Find optimal word sequence
  • Combine information (probabilities) from
  • Acoustic model
  • Word lexicon
  • Language model
  • Method
  • Decoder searches through all possible recognition
  • choices using a Viterbi decoding algorithm
  • Challenge
  • Efficient search through a large network space is
    computationally expensive for large vocabulary ASR

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
52
The Problem of Large Vocabulary ASR
  • The basic problem in ASR is to find the sequence
    of words
  • that explain the input signal. This implies the
    following mapping

Features HMM states HMM units
Phones Words Sentences
For the WSJ 20K vocabulary, this results in a
network of 10 bytes!
22
  • Stateof-the-art methods include fast match,
    multi-pass
  • decoding, A stack, and finite state transducers
    all
  • provide tremendous speed-up by searching through
    the
  • network and finding the best path that maximizes
    the
  • likelihood function

53
Weighted Finite State Transducers (WFST)
  • Unified Mathematical framework to ASR
  • Efficiency in time and space

WFST
WordPhrase
WFST
Search Network
PhoneWord
Composition
Optimization
WFST
HMMPhone
8
WFST can compile the network to 10 states
14 orders of magnitude more efficient
WFST
StateHMM
54
Word Pronunciation WFST
eye/.4
dxe/.8
Data
axdata/1
d e/1
aee/.6
te/.2
55
Confidence Scoring
Goal Identify possible recognition errors and
out-of-vocabulary events. Potentially improves
the performance of ASR, SLU and DM. Method A
confidence score based on a hypothesis likelihood
ratio test is associated with each recognized
word Label credit please
Recognized credit fees Confidence
(0.9) (0.3) Challenges Rejection of
extraneous acoustic events (noise, background
speech, door slams) without rejection of valid
user input speech.
Pattern Classification
ConfidenceScoring
Feature Extraction
Language Model
Word Lexicon
56
Speech Recognition Process
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
57
How to evaluate performance?
  • Dictation applications Insertions, substitutions
    and deletions
  • Command-and-control false rejection and false
    acceptance gt ROC curves.

58
ASR Performance The State-of-the-art
59
Growth in Effective Recognition Vocabulary Size
60
Improvement in Word Accuracy
10-20 relative reduction per year.
Switchboard/Call Home Vocabulary 40,000 words
Perplexity 85
61
Human Speech Recognition vs ASR
Machines Outperform Humans
x100
x10
x1

62
Challenges in ASR
  • System Performance
  • - Accuracy
  • - Efficiency (speed, memory)
  • - Robustness
  • Operational Performance
  • - End-point detection
  • - User barge-in
  • - Utterance rejection
  • - Confidence scoring

Machines are 10-100 times less accurate than
humans
63
Large-Vocabulary ASR Demo
64
Multimedia Customer Care
Courtesy of ATT
65
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
66
Spoken Language Understanding (SLU)
  • Goal Extract and interpret the meaning of the
    recognized speech so that to identify a users
    request.
  • accurate understanding can often be achieved
    without correctly recognizing every word
  • SLU makes it possible to offer natural-language
    based services where the customer can speak
    openly without learning a specific set of
    terms.
  • Methodology Exploit task grammar (syntax) and
    task semantics to restrict the range of meaning
    associated with the recognized word string.
  • Applications Automation of complex
    operator-based tasks, e.g., customer care,
    customer help lines, etc.

67
SLU Formulation
  • Let W be a sequence of words and C be its
    underlying meaning (conceptual structure), then
    using Bayes rule
  • Finding the best conceptual structure can be done
    by parsing and ranking using a combination of
    acoustic, linguistic and semantics scores.

C
68
Knowledge Sources for Speech Understanding
DM
ASR
SLU
Phonotactic
Syntactic
Pragmatic
Acoustic/ Phonetic
Semantic
Relationship of speech sounds and English
phonemes
Rules for phoneme sequences and pronunciation
Structure of words, phrases in a sentence
Relationship and meanings among words
Discourse, interaction history, world knowledge
Acoustic Model
Word Lexicon
Language Model
Understanding Model
Dialog Manager
69
SLU Components
From ASR (text, lattices, n-best, history)
Text Normalization
Morphology, Synonyms
Database Access
Extracting named entities, semantic concepts,
syntactic tree
Parsing/ Decoding
Interpretation
Slot filling, reasoning, task knowledge
representation
To DM (concepts, entities, parse tree)
70
Text Normalization
Text Normalization
Goal Reduce language variation (lexical analysis)
Parsing/ Decoding
  • Morphology
  • Decomposing words into their minimal unit
  • of meaning (or grammatical analysis)

Interpretation
  • Synonyms
  • Finding words that mean the same (hello, hi, how
    do you do)
  • Equalization
  • - disfluencies, non-alphanumeric, capitals,
    non-white space characters, etc.

71
Parsing/Decoding
Text Normalization
Goal Mapping textual representation
into semantic concepts using knowledge
rules and/or data.
Parsing/ Decoding
Interpretation
  • Entity Extraction (Matcher)
  • Finite state machines (FSM)
  • Context Free Grammar (CFG)
  • Semantic Parsing
  • Tree structured meaning representation allowing
  • arbitrary nesting of concepts
  • Decoding
  • Segment an utterance into phrases each
    representing a
  • concept (e.g. using HMMs)
  • Classification
  • Categorizing an utterance into one or more
    semantic concepts.

72
Interpretation
Text Normalization
Goal Interpret users utterance in the context
of the dialog.
Parsing/ Decoding

Interpretation
  • Ellipses and anaphora
  • What about for New York?
  • History Mechanism
  • Communicated messages are differential.
  • Removing ambiguities
  • Semantic Frames (slots)
  • Associating entities and relations
    (attribute/value pairs)
  • Rule-based
  • Database Look-up
  • Retrieving or checking entities

73
DARPA Communicator
  • Darpa sponsored research and development of
    mixed-initiative dialogue systems
  • Travel task involving airline, hotel and car
    information and reservation
  • Yeah I uh I would like to go from New York to
    Boston tomorrow night with United
  • SLU output (Concept decoding)

XML Schema ltitinerarygt ltorigingt ltcitygtlt/ci
tygt ltstategtlt/stategt lt/origingt
ltdestinationgt ltcitygtlt/citygt ltstategtlt/stategt
lt/destinationgt ltdategtlt/dategt
lttimegtlt/timegt ltairlinegtlt/airlinegt lt/itinerar
ygt
Topic Itinerary Origin New
York Destination Boston Day of the week
Sunday Date May 25th, 2002 Time gt6pm Airline
United
74
Semantic CFG
  • ltrule nameitinerarygt
  • ltogtShow me flightslt/ogt ltruleref nameorigin"/gt
  • ltruleref namedestination"/gt
  • lt/rulegt
  • ltrule nameorigingt
  • from ltruleref namecitygt
  • lt/rulegt
  • ltrule namedestinationgt
  • to ltruleref namecitygt
  • lt/rulegt
  • ltrule namecitygt
  • New York San Francisco Boston
  • lt/rulegt

75
ATT How May I Help You? - Customer Care Services
-
  • User responds with unconstrained fluent speech
  • System recognizes salient phrases and determines
    meaning of users speech, then routes the call

There is a number on my bill I didnt make
unrecognized number
How May I help You?
Account balance
Unrecognized number
billing credit
Agent
Combined bill
76
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
77
Dialog Management (DM)
  • Goal
  • Manage elaborate exchanges with the user
  • Provide automated access to information
  • Implementation
  • Mathematical model programmed from rules and/or
    data

78
Computational Models for DM
  • Structural-based Approach
  • Assumes that there exists a regular structure in
    dialog that can be modeled as a state transition
    network or dialog grammars
  • Dialog strategies need to be predefined, thus
    limiting
  • Several real-time deployed systems exist today
    which have inference engines and knowledge
    representation
  • Plan-based Approach
  • Considers communication as acting. Dialog acts
    and actions are oriented toward goal achievement
  • Motivated by human/human interaction in which
    humans generally have goals and plans when
    interacting with others
  • Aims to account for general models and theories
    of discourse

79
DM Technology Components
TTS
ASR
SLU
SLG
DM
Context Interpretation
Dialog strategies (modules)
Backend (Database access)
80
Context Interpretation
Context Interpretation
Dialog Strategies
User Input
Backend
Action
Current Context
New Context
State(t1)
State(t)
  • A formal representation of the context history
    is necessary so that for the DM to interpret a
    users utterance given previously exchanged
    utterances and identify a new action.
  • Natural communication is a differential process

81
Dialog Strategies
Context Interpretation
Dialog Strategies
  • Completion (continuation) elicits missing input
    information from the user
  • Constraining (disambiguation) reduces the scope
    of the request when multiple information has been
    retrieved
  • Relaxation increases the scope of the request
    when no information has been retrieved
  • Confirmation verifies that the system understood
    correctly strategy may differ depending on SLU
    confidence measure.
  • Reprompting used when the system expected input
    but did not receive any or did not understand
    what it received.
  • Context Help provides user with help during
    periods of misunderstanding of either the system
    or the user.
  • Greeting/Closing maintains social protocol at
    the beginning and end of an interaction
  • Mixed-initiative allows users to manage the
    dialog.

Backend
82
Mixed-initiative Dialog
Who manages the dialog?
System
User
Initiative
Please say collect, calling card, third number.
How can I help you?
83
Example Mixed-Initiative Dialog
  • System Initiative

System Please say just your departure
city. User Chicago System Please say just
your arrival city. User Newark
Long dialogs but easier to design
  • Mixed Initiative

System Please say your departure city User
I need to travel from Chicago to Newark tomorrow.
  • User Initiative

Shorter dialogs (better user experience) but more
difficult to design
System How may I help you? User I need to
travel from Chicago to Newark tomorrow.
84
Dialog Evaluation
  • Why?
  • Identify problematic situations (task failure)
  • Minimize user hang-ups and routing to an
    operator
  • Improve user experience
  • Perform data analysis and monitoring
  • Conduct data mining for sales and marketing
  • How?
  • Log exchange features from ASR, SLU, DM, ANI
  • Define and automatically optimize a task
    objective function
  • Compute service statistics (Instant and delayed)

85
Optimizing Task Objective Function
  • User Satisfaction is the ultimate objective of a
    dialog system
  • Task completion rate
  • Efficiency of the dialog interaction
  • Usability of the system
  • Perceived intelligibility of the system
  • Quality of the audio output
  • Perplexity and quality of the response
    generation
  • Machine learning techniques can be applied to
    predict user satisfaction (e.g., Paradise
    framework)

86
Example of Plan-based Dialog Systems.
TRIPS System Architecture
87

Example of Structure-based Dialog Systems
Help Desk
- ATT Labs Natural Voices
TM
  • A finite state engine is used to control the
    action of the interpreter output.
  • The system routes customers to appropriate agents
    or departments
  • Provide information about products, services,
    costs and the business.
  • Show demonstrations of various voice fonts and
    languages.
  • Trouble shoot problems and concerns raised by
    customers (in progress)

88
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words to be synthesized
Words spoken
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
DM
Action
Meaning
DialogManagement
89
A Good User Interface
  • makes the application easy-to-use
  • Makes application robust to the kinds of
    confusion that arise in human-machine
    communications by voice
  • keeps the conversation moving forward, even in
    periods of great uncertainty on the parts of
    either the user or the machine
  • A great UI can not save a system with poor ASR
    and NLU. But, UI can make or break a system, even
    with excellent ASR NLU
  • Effective UI design is based on a set of
    elementary principles common widgets sequenced
    screen presentations simple error-trap dialog a
    user manual.

90
Correction UI
  • N-best list

91
Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Action
Meaning
DM
DialogManagement
92
Multimodal Experience
  • Access to information anytime and anywhere using
    any device
  • Most appropriate UI mode or combination of
    modes depends on the device, the task, the
    environment, and the users abilities and
    preferences.

93
Multimodal Architecture
Language model
What does that mean P(SnF, Sn-1)
What the user might do (say) P(Fx, Sn-1)
Semantic model
Parsing
x (Multimodal Inputs)
F (Surface Semantics)
Understanding
Application Logic
Sn (Discourse Semantics)
Rendering
A (Multimedia Outputs)
Behavior model
What to show (say) to the user P(ASn)
94
MIPad
  • Multimodal Interactive Pad
  • Usability studies show double throughput for
    English
  • Speech is mostly useful in cases with lots of
    alternatives

95
MiPad video
96
Speech-enabled MapPoint
97
Other Research Areas not covered
  • Speaker recognition (verification and
    identification)
  • Language identification
  • Human/Human and Human/Machine translation
  • Multimedia and document retrieval
  • Speech coding
  • Microphone array processing
  • Multilingual speech recognition

98
Application Development
99
Speech APIs
  • Open-standard APIs provide separation of the
    ASR/TTS engines and platform from the application
    layer.
  • Application is engine independent and contains
    the SLU, DM, Content Server and Host Interface

100
Voice XML Architecture

Multimedia
HTML
Scripts
Voice XML Gateway
VoiceXML
Audio/Grammars
Voice Browser
Web Server
The temperature is
101
A VoiceXML example

lt?xml version"1.0"?gt ltvxml application"tutorial.
vxml" version"1.0"gt ltform
id"getPhoneNumber"gt ltfield name"PhoneNumber"gt
ltpromptgtWhat's your phone number?lt/promptgt ltgram
mar src"../grammars/phone.gram
type"application/x-jsgf" /gt lthelpgtPlease say
your ten digit phone number. lt/helpgt ltif
condPhoneNumber lt1000000'"gt lt/fieldgt
lt/formgt lt/vxmlgt
102
Speech Application Language Tags (SALT)
  • Add 4 tags to HTML/XHTML/cHTML/WML ltlistengt,
    ltpromptgt, ltdtmfgt, ,smexgt

lthtmlgt ltform action"nextpage.html"gt
ltinput typetext idtxtBoxCity /gt lt/formgt
ltlisten idreco1gt ltgrammar
srccities.gram /gt ltbind targetElementtxtB
oxCity valuecity /gt lt/listengt lt/htmlgt
103
The Speech Advantage!
  • Reduce costs
  • Reduce labor expenses while still providing
    customers an easy-to-use and natural way to
    access information and services
  • New revenue opportunities
  • 24x7 high-quality customer care automation
  • Access to information without a keyboard or
    touch-tone
  • Customer retention
  • Stickiness of services
  • Add personalization

104
Just the Beginning
105
Market Opportunities
  • Consumer communication
  • Voice portals
  • Desktop dictation
  • - Telematic applications
  • - Disabilities
  • Call center automation
  • - Help Lines
  • - Customer care
  • Voice-assisted E-commerce
  • - B2B
  • Enterprise Communication
  • - Unified messaging
  • - Enterprise sales

106
A look into the future..
Keyword spotting Handcrafted grammars
No dialogue
  • Constrained speech
  • Minimal data collection
  • Manual design

Directory Assistance VRCP
1990
Airline reservation Banking
  • Constrained speech
  • Moderate data collection
  • Some automation

Medium size ASR Handcrafted Grammars
System Initiative
1995
MATCH Multimodal Access To City Help
  • Spontaneous speech
  • Extensive data collection
  • Semi-automation

Large size ASR Limited NLU
Mixed-initiative
Call centers, E-commerce
2000
  • Spontaneous speech/pen
  • Fully automated systems

Multimodal, Multilingual Help Desks, E-commerce
Unlimited ASR Deeper NLU Adaptive
systems
2005
107
Spoken Dialog Interfaces Vs. Touch-Tone IVR?
108
Reading Materials
  • Spoken Language Processing, X. Huang, A. Acero,
    H-W. Hon, Prentice Hall, 2001.
  • Fundamentals of Speech Recognition, L. Rabiner
    and B.H. Juang, Prentice Hall, 1993.
  • Automatic Speech and Speaker Recognition, C.H.
    Lee, F. Soong, K. Paliwal, Kluwer Academic
    Publishers, 1996.
  • Spoken Dialogues with Computers, Editor R.
    De-Mori, Academic Press 1998.
Write a Comment
User Comments (0)