Seminar

About This Presentation

Transcript and Presenter's Notes

Title: Seminar

1

Seminar
Speech Recognition
2003
E.M. Bakker
LIACS Media Lab
Leiden University

2
Outline

Introduction and State of the Art
A Speech Recognition Architecture
Acoustic modeling
Language modeling
Practical issues
Applications
NB Some of the slides are adapted from the
presentation Can Advances in Speech Recognition
make Spoken Language as Convenient and as
Accessible as Online Text?, an excellent
presentation by Dr. Patti Price, Speech
Technology Consulting Menlo Park, California
94025, and Dr. Joseph Picone Institute for Signal
and Information Processing Dept. of Elect. and
Comp. Eng. Mississippi State University

3
Research Areas

Speech Analysis (Production, Perception,
Parameter Estimation)
Speech Coding/Compression
Speech Synthesis (TTS)
Speaker Identification/Recognition/Verification
(Sprint, TI)
Language Identification (Transparent Dialogue)
Speech Recognition (Dragon, IBM, ATT)
Speech recognition sub-categories
Discrete/Connected/Continuous Speech/Word
Spotting
Speaker Dependent/Independent
Small/Medium/Large/Unlimited Vocabulary
Speaker-Independent Large Vocabulary Continuous
Speech Recognition (or LVCSR for short )

4
Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal

Other interesting areas
Who is talker (speaker recognition,
identification)
Speech output (speech synthesis)
What the words mean (speech understanding,
semantics)

5
IntroductionApplications

Command and control
Manufacturing
Consumer products

http//www.speech.philips.com

Database query
Resource management
Air travel information
Stock quote

Nuance, American Airlines 1-800-433-7300, touch 1

Dictation
http//www.lhsl.com/contacts/
http//www-4.ibm.com/software/speech
http//www.microsoft.com/speech/

6
Introduction State of the Art

Speech-recognition software
IBM (Via Voice, Voice Server Applications,...)
Speaker independent, continuous command
recognition
Large vocabulary recognition
Text-to-speech confirmation
Barge in (The ability to interrupt an audio
prompt as it is playing)
Dragon Systems, Lernout Hauspie (LH Voice
Xpress (( )
Philips
Dictation
Telephone
Voice Control (SpeechWave, VoCon SDK, chip-sets)
Microsoft (Whisper, Dr Who)

7
Introduction State of the Art

Speech over the telephone.
ATT Bell Labs pioneered the use of
speech-recognition systems for telephone
transactions
companies such as Nuance, Philips and SpeechWorks
are active in this field for some years now.
IBM Applications over telephone
request news, internet pages, e-mail
stock quotes, traveling info
weather information

8
Introduction State of the Art

Speech over the telephone (Philips)
SpeechPearl large vocabulary natural language
recognition (up to 200,000 words)
SpeechMania mixed initiative dialog gives the
caller the impression of a truly natural
dialogue full replacement of the human operator.
SpeechWave relatively small vocabularies (up to
hundreds of words) available in nearly 40
languages
Voice ReQuest The system recognizes the request
and routes the call to the appropriate extension,
all without the intervention of an operator.

9
Introduction State of the Art

Speech over the telephone.
Brokerage ETrade,...
Banking
Travel United Airlines, Continental Airlines
BellSouth
HP
Federal Express
Foodline

10
Introduction State of the Art

Speech over the telephone.

11
Introduction State of the Art

The following is a transcript of a telephone
conversation
between Victor Zue and the Jupiter system for
weather
information. Jupiter's toll-free number is
1-888-573-8255
Phone ringing
Jupiter Hi. Welcome to Jupiter, the M.I.T. Lab
for Computer Science Weather Information System.
This call is being recorded for system
development.You may hang up or ask for help at
any time. How can I help you?
V.Z. What is the forecast for Boston today?
Jupiter In Boston this afternoon, partly sunny
and windy. High around 50. Tonight, clear, lows
around 35. What other information can I give you?
V.Z. Will it rain tomorrow?
Jupiter To my knowledge, the forecast calls for
no rain in Boston tomorrow. Is there anything
else?

12
Introduction State of the Art

V.Z. How about Seattle?
Jupiter To my knowledge, the forecast calls for
no rain in Seattle tomorrow. Is there something
else you would like to know?
V.Z. Is it sunny anywhere in the Caribbean?
Jupiter Sorry, I am not sure what you said.
V.Z. Is it sunny anywhere in the Caribbean?
Jupiter I am sorry, I may not know the city you
are asking about. Try asking for the state or
country. For example, what cities do you know
about in Massachusetts?
V.Z. What is the temperature in Montego Bay?
Jupiter In Montego Bay today, high 86 and low
73. Is there something else?
V.Z. Good-bye.

13
Factors that Affect Performance of Speech
Recognition Systems
14
How Do You Measure the Performance?

USC, October 15, 1999 the world's first machine
system that can recognize spoken words better
than humans can.
In benchmark testing using just a few spoken
words, USC's Berger-Liaw System not only bested
all existing computer speech recognition systems
but outperformed the keenest human ears.
What benchmarks?
What was training?
What was the test?
Were they independent?
How large was the vocabulary and the sample size?
Did they really test all existing systems?Is that
different from chance?
Was the noise added or coincident with speech?
What kind of noise? Was it independent of the
speech?

15
Evaluation Metrics
Word Error Rate (WER)
Conversational Speech
40

Spontaneous telephone speech is still a grand
challenge.
Telephone-quality speech is still central to the
problem.
Broadcast news is a very dynamic domain.

30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
16
Evaluation MetricsHuman Performance
Word Error Rate

Human performance exceeds machine
performance by a factor ranging from
4x to 10x depending on the task.
On some tasks, such as credit card number
recognition, machine performance exceeds humans
due to human memory retrieval capacity.
The nature of the noise is as important as the
SNR (e.g., cellular phones).
A primary failure mode for humans is inattention.
A second major failure mode is the lack of
familiarity with the domain (i.e., business
terms and corporation names).

20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
17
Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
20k vocabularies
Broadcast Speech
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
18
What does a speech signal look like?
19
Spectrogram
20
Speech Recognition
21
Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1

Measurements of the
signal are ambiguous.
Region of overlap represents classification
errors.
Reduce overlap by introducing acoustic and
linguistic context (e.g., context-dependent
phones).

22
Overlap in the ceptral space (alphadigits)
Female iy
Female aa
Male iy
Male aa
23
Overlap in the cepstral space (alphadigits)
Male iy (blue) vs. Female iy (red)
Male aa (green) vs. Female aa (black)

Combined Comparisons
Male "aa" (green)
Female "aa" (black)
Male "iy" (blue)
Female "iy" (red)

24
OVERLAP IN THE CEPSTRAL SPACE (SWB-All)
The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of all vowels excised from tokens
in the SWITCHBOARD conversational speech corpus.
All Male Vowels
All Vowels
All Female Vowels
25
Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds

Bayesian formulation for speech recognition
P(WA) P(AW) P(W) / P(A)

Objective minimize the word error
rate Approach maximize P(WA) during training

Components
P(AW) acoustic model (hidden Markov models,
mixtures)
P(W) language model (statistical, finite
state networks, etc.)
The language model typically predicts a small set
of next words based on
knowledge of a finite number of previous words
(N-grams).

26
Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
27
Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
28
Acoustic ModelingHidden Markov Models

Acoustic models encode the temporal evolution of
the features (spectrum).
Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation.
Phonetic model topologies are simple
left-to-right structures.
Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models.
Sharing model parameters is a common strategy to
reduce complexity.

29
Acoustic ModelingParameter Estimation

Word level transcription
Supervises a closed-loop data-driven modeling
Initial parameter estimation
The expectation/maximization (EM) algorithm is
used to improve our parameter estimates.
Computationally efficient training algorithms
(Forward-Backward) are crucial.
Batch mode parameter updates are typically
preferred.
Decision trees and the use of additional
linguistic knowledge are used to optimize
parameter-sharing, and system complexity,.

30
Language ModelingIs A Lot Like Wheel of Fortune
31
Language ModelingN-Grams The Good, The Bad, and
The Ugly
32
Language ModelingIntegration of Natural Language
33
Implementation IssuesDynamic Programming-Based
Search
34
Implementation IssuesCross-Word Decoding Is
Expensive

Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries.
Cross-word decoding significantly increases
memory requirements.

35
Implementation Issues Search Is Resource
Intensive

Typical LVCSR systems have about 10M free
parameters, which makes training a challenge.
Large speech databases are required (several
hundred hours of speech).
Tying, smoothing, and interpolation are required.

36
Applications Conversational Speech

Conversational speech collected over the
telephone contains background
noise, music, fluctuations in the speech rate,
laughter, partial words,
hesitations, mouth noises, etc.
WER (Word Error Rate) has decreased from 100 to
30 in six years.

Laughter
Singing
Unintelligible
Spoonerism
Background Speech
No pauses
Restarts
Vocalized Noise
Coinage

37
ApplicationsAudio Indexing of Broadcast News

Broadcast news offers some unique
challenges
Lexicon important information in
infrequently occurring words
Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location)
Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???)
Language multilingual systems?
language-independent acoustic modeling?

38
ApplicationsAutomatic Phone Centers

Portals Bevocal, TellMe, HeyAniat
VoiceXML 2.0
Automatic Information Desk
Reservation Desk
Automatic Help-Desk
With Speaker identification
bank account services
e-mail services
corporate services

39
Applications Real-Time Translation

From President Clintons State of the Union
address (January 27, 2000)
These kinds of innovations are also propelling
our remarkable prosperity...
Soon researchers will bring us devices that can
translate foreign languages
as fast as you can talk... molecular computers
the size of a tear drop with the
power of todays fastest supercomputers.

Human Language Engineering a sophisticated
integration of many speech and
language related technologies... a science
for the next millennium.

40
Technology Future Directions

The algorithmic issues for the next decade
Better features by extracting articulatory
information?
Bayesian statistics? Bayesian networks?
Decision Trees? Information-theoretic measures?
Nonlinear dynamics? Chaos?

Write a Comment

User Comments (0)

About PowerShow.com

Seminar PowerPoint PPT Presentation