Seminar - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Seminar

Description:

Speech Recognition 2003 E.M. Bakker LIACS Media Lab Leiden University Outline Introduction and State of the Art A Speech Recognition Architecture Acoustic modeling ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 41
Provided by: DAR1158
Category:

less

Transcript and Presenter's Notes

Title: Seminar


1
  • Seminar
  • Speech Recognition
  • 2003
  • E.M. Bakker
  • LIACS Media Lab
  • Leiden University

2
Outline
  • Introduction and State of the Art
  • A Speech Recognition Architecture
  • Acoustic modeling
  • Language modeling
  • Practical issues
  • Applications
  • NB Some of the slides are adapted from the
    presentation Can Advances in Speech Recognition
    make Spoken Language as Convenient and as
    Accessible as Online Text?, an excellent
    presentation by Dr. Patti Price, Speech
    Technology Consulting Menlo Park, California
    94025, and Dr. Joseph Picone Institute for Signal
    and Information Processing Dept. of Elect. and
    Comp. Eng. Mississippi State University

3
Research Areas
  • Speech Analysis (Production, Perception,
    Parameter Estimation)
  • Speech Coding/Compression
  • Speech Synthesis (TTS)
  • Speaker Identification/Recognition/Verification
    (Sprint, TI)
  • Language Identification (Transparent Dialogue)
  • Speech Recognition (Dragon, IBM, ATT)
  • Speech recognition sub-categories
  • Discrete/Connected/Continuous Speech/Word
    Spotting
  • Speaker Dependent/Independent
  • Small/Medium/Large/Unlimited Vocabulary
  • Speaker-Independent Large Vocabulary Continuous
    Speech Recognition (or LVCSR for short )

4
Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal
  • Other interesting areas
  • Who is talker (speaker recognition,
    identification)
  • Speech output (speech synthesis)
  • What the words mean (speech understanding,
    semantics)

5
IntroductionApplications
  • Command and control
  • Manufacturing
  • Consumer products

http//www.speech.philips.com
  • Database query
  • Resource management
  • Air travel information
  • Stock quote

Nuance, American Airlines 1-800-433-7300, touch 1
  • Dictation
  • http//www.lhsl.com/contacts/
  • http//www-4.ibm.com/software/speech
  • http//www.microsoft.com/speech/

6
Introduction State of the Art
  • Speech-recognition software
  • IBM (Via Voice, Voice Server Applications,...)
  • Speaker independent, continuous command
    recognition
  • Large vocabulary recognition
  • Text-to-speech confirmation
  • Barge in (The ability to interrupt an audio
    prompt as it is playing)
  • Dragon Systems, Lernout Hauspie (LH Voice
    Xpress (( )
  • Philips
  • Dictation
  • Telephone
  • Voice Control (SpeechWave, VoCon SDK, chip-sets)
  • Microsoft (Whisper, Dr Who)

7
Introduction State of the Art
  • Speech over the telephone.
  • ATT Bell Labs pioneered the use of
    speech-recognition systems for telephone
    transactions
  • companies such as Nuance, Philips and SpeechWorks
    are active in this field for some years now.
  • IBM Applications over telephone
  • request news, internet pages, e-mail
  • stock quotes, traveling info
  • weather information

8
Introduction State of the Art
  • Speech over the telephone (Philips)
  • SpeechPearl large vocabulary natural language
    recognition (up to 200,000 words)
  • SpeechMania mixed initiative dialog gives the
    caller the impression of a truly natural
    dialogue full replacement of the human operator.
  • SpeechWave relatively small vocabularies (up to
    hundreds of words) available in nearly 40
    languages
  • Voice ReQuest The system recognizes the request
    and routes the call to the appropriate extension,
    all without the intervention of an operator.

9
Introduction State of the Art
  • Speech over the telephone.
  • Brokerage ETrade,...
  • Banking
  • Travel United Airlines, Continental Airlines
  • BellSouth
  • HP
  • Federal Express
  • Foodline

10
Introduction State of the Art
  • Speech over the telephone.

11
Introduction State of the Art
  • The following is a transcript of a telephone
    conversation
  • between Victor Zue and the Jupiter system for
    weather
  • information. Jupiter's toll-free number is
    1-888-573-8255
  • Phone ringing
  • Jupiter Hi. Welcome to Jupiter, the M.I.T. Lab
    for Computer Science Weather Information System.
    This call is being recorded for system
    development.You may hang up or ask for help at
    any time. How can I help you?
  • V.Z. What is the forecast for Boston today?
  • Jupiter In Boston this afternoon, partly sunny
    and windy. High around 50. Tonight, clear, lows
    around 35. What other information can I give you?
  • V.Z. Will it rain tomorrow?
  • Jupiter To my knowledge, the forecast calls for
    no rain in Boston tomorrow. Is there anything
    else?

12
Introduction State of the Art
  • V.Z. How about Seattle?
  • Jupiter To my knowledge, the forecast calls for
    no rain in Seattle tomorrow. Is there something
    else you would like to know?
  • V.Z. Is it sunny anywhere in the Caribbean?
  • Jupiter Sorry, I am not sure what you said.
  • V.Z. Is it sunny anywhere in the Caribbean?
  • Jupiter I am sorry, I may not know the city you
    are asking about. Try asking for the state or
    country. For example, what cities do you know
    about in Massachusetts?
  • V.Z. What is the temperature in Montego Bay?
  • Jupiter In Montego Bay today, high 86 and low
    73. Is there something else?
  • V.Z. Good-bye.

13
Factors that Affect Performance of Speech
Recognition Systems
14
How Do You Measure the Performance?
  • USC, October 15, 1999 the world's first machine
    system that can recognize spoken words better
    than humans can.
  • In benchmark testing using just a few spoken
    words, USC's Berger-Liaw System not only bested
    all existing computer speech recognition systems
    but outperformed the keenest human ears.
  • What benchmarks?
  • What was training?
  • What was the test?
  • Were they independent?
  • How large was the vocabulary and the sample size?
  • Did they really test all existing systems?Is that
    different from chance?
  • Was the noise added or coincident with speech?
  • What kind of noise? Was it independent of the
    speech?

15
Evaluation Metrics
Word Error Rate (WER)
Conversational Speech
40
  • Spontaneous telephone speech is still a grand
    challenge.
  • Telephone-quality speech is still central to the
    problem.
  • Broadcast news is a very dynamic domain.

30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
16
Evaluation MetricsHuman Performance
Word Error Rate
  • Human performance exceeds machine
  • performance by a factor ranging from
  • 4x to 10x depending on the task.
  • On some tasks, such as credit card number
    recognition, machine performance exceeds humans
    due to human memory retrieval capacity.
  • The nature of the noise is as important as the
    SNR (e.g., cellular phones).
  • A primary failure mode for humans is inattention.
  • A second major failure mode is the lack of
    familiarity with the domain (i.e., business
    terms and corporation names).

20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
17
Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
20k vocabularies
Broadcast Speech
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
18
What does a speech signal look like?
19
Spectrogram
20
Speech Recognition
21
Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
  • Measurements of the
  • signal are ambiguous.
  • Region of overlap represents classification
    errors.
  • Reduce overlap by introducing acoustic and
    linguistic context (e.g., context-dependent
    phones).

22
Overlap in the ceptral space (alphadigits)
Female iy
Female aa
Male iy
Male aa
23
Overlap in the cepstral space (alphadigits)
Male iy (blue) vs. Female iy (red)
Male aa (green) vs. Female aa (black)
  • Combined Comparisons
  • Male "aa" (green)
  • Female "aa" (black)
  • Male "iy" (blue)
  • Female "iy" (red)

24
OVERLAP IN THE CEPSTRAL SPACE (SWB-All)
The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of all vowels excised from tokens
in the SWITCHBOARD conversational speech corpus.
All Male Vowels
All Vowels
All Female Vowels
25
Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
  • Bayesian formulation for speech recognition
  • P(WA) P(AW) P(W) / P(A)

Objective minimize the word error
rate Approach maximize P(WA) during training
  • Components
  • P(AW) acoustic model (hidden Markov models,
    mixtures)
  • P(W) language model (statistical, finite
    state networks, etc.)
  • The language model typically predicts a small set
    of next words based on
  • knowledge of a finite number of previous words
    (N-grams).

26
Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
27
Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
28
Acoustic ModelingHidden Markov Models
  • Acoustic models encode the temporal evolution of
    the features (spectrum).
  • Gaussian mixture distributions are used to
    account for variations in speaker, accent, and
    pronunciation.
  • Phonetic model topologies are simple
    left-to-right structures.
  • Skip states (time-warping) and multiple paths
    (alternate pronunciations) are also common
    features of models.
  • Sharing model parameters is a common strategy to
    reduce complexity.

29
Acoustic ModelingParameter Estimation
  • Word level transcription
  • Supervises a closed-loop data-driven modeling
  • Initial parameter estimation
  • The expectation/maximization (EM) algorithm is
    used to improve our parameter estimates.
  • Computationally efficient training algorithms
    (Forward-Backward) are crucial.
  • Batch mode parameter updates are typically
    preferred.
  • Decision trees and the use of additional
    linguistic knowledge are used to optimize
    parameter-sharing, and system complexity,.

30
Language ModelingIs A Lot Like Wheel of Fortune
31
Language ModelingN-Grams The Good, The Bad, and
The Ugly
32
Language ModelingIntegration of Natural Language
33
Implementation IssuesDynamic Programming-Based
Search
34
Implementation IssuesCross-Word Decoding Is
Expensive
  • Cross-word Decoding since word boundaries dont
    occur in spontaneous speech, we must allow for
    sequences of sounds that span word boundaries.
  • Cross-word decoding significantly increases
    memory requirements.

35
Implementation Issues Search Is Resource
Intensive
  • Typical LVCSR systems have about 10M free
    parameters, which makes training a challenge.
  • Large speech databases are required (several
    hundred hours of speech).
  • Tying, smoothing, and interpolation are required.

36
Applications Conversational Speech
  • Conversational speech collected over the
    telephone contains background
  • noise, music, fluctuations in the speech rate,
    laughter, partial words,
  • hesitations, mouth noises, etc.
  • WER (Word Error Rate) has decreased from 100 to
    30 in six years.
  • Laughter
  • Singing
  • Unintelligible
  • Spoonerism
  • Background Speech
  • No pauses
  • Restarts
  • Vocalized Noise
  • Coinage

37
ApplicationsAudio Indexing of Broadcast News
  • Broadcast news offers some unique
  • challenges
  • Lexicon important information in
  • infrequently occurring words
  • Acoustic Modeling variations in channel,
    particularly within the same segment ( in the
    studio vs. on location)
  • Language Model must adapt ( Bush,
    Clinton, Bush, McCain, ???)
  • Language multilingual systems?
    language-independent acoustic modeling?

38
ApplicationsAutomatic Phone Centers
  • Portals Bevocal, TellMe, HeyAniat
  • VoiceXML 2.0
  • Automatic Information Desk
  • Reservation Desk
  • Automatic Help-Desk
  • With Speaker identification
  • bank account services
  • e-mail services
  • corporate services

39
Applications Real-Time Translation
  • From President Clintons State of the Union
    address (January 27, 2000)
  • These kinds of innovations are also propelling
    our remarkable prosperity...
  • Soon researchers will bring us devices that can
    translate foreign languages
  • as fast as you can talk... molecular computers
    the size of a tear drop with the
  • power of todays fastest supercomputers.
  • Human Language Engineering a sophisticated
    integration of many speech and
  • language related technologies... a science
    for the next millennium.

40
Technology Future Directions
  • The algorithmic issues for the next decade
  • Better features by extracting articulatory
    information?
  • Bayesian statistics? Bayesian networks?
  • Decision Trees? Information-theoretic measures?
  • Nonlinear dynamics? Chaos?
Write a Comment
User Comments (0)
About PowerShow.com