Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends
1- Seminar
- Speech Recognition
- a Short Overview
- E.M. Bakker
- LIACS Media Lab
- Leiden University
2Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal
- Other interesting areas
- Who is talker (speaker recognition,
identification) - Speech output (speech synthesis)
- What the words mean (speech understanding,
semantics)
3Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
- Bayesian formulation for speech recognition
- P(WA) P(AW) P(W) / P(A)
Objective minimize the word error
rate Approach maximize P(WA) during training
- Components
- P(AW) acoustic model (hidden Markov models,
mixtures) - P(W) language model (statistical, finite
state networks, etc.) - The language model typically predicts a small set
of next words based on - knowledge of a finite number of previous words
(N-grams).
4Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
5Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
6Acoustic ModelingHidden Markov Models
- Acoustic models encode the temporal evolution of
the features (spectrum). - Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation. - Phonetic model topologies are simple
left-to-right structures. - Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models. - Sharing model parameters is a common strategy to
reduce complexity.
7Acoustic ModelingParameter Estimation
- Closed-loop data-driven modeling supervised only
from a word-level transcription. - The expectation/maximization (EM) algorithm is
used to improve our parameter estimates. - Computationally efficient training algorithms
(Forward-Backward) have been crucial. - Batch mode parameter updates are typically
preferred. - Decision trees are used to optimize
parameter-sharing, system complexity, and the
use of additional linguistic knowledge.
8Language ModelingIs A Lot Like Wheel of Fortune
9Language ModelingN-Grams The Good, The Bad, and
The Ugly
10Language ModelingIntegration of Natural Language
11Implementation Issues Search Is Resource
Intensive
- Typical LVCSR systems have about 10M free
parameters, which makes training a challenge. - Large speech databases are required (several
hundred hours of speech). - Tying, smoothing, and interpolation are required.
12Implementation IssuesDynamic Programming-Based
Search
13Implementation IssuesCross-Word Decoding Is
Expensive
- Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries. -
- Cross-word decoding significantly increases
memory requirements.
14General Specification
15Applications Conversational Speech
- Conversational speech collected over the
telephone contains background - noise, music, fluctuations in the speech rate,
laughter, partial words, - hesitations, mouth noises, etc.
- WER (Word Error Rate) has decreased from 100 to
30 in six years.
- Laughter
- Singing
- Unintelligible
- Spoonerism
- Background Speech
- No pauses
- Restarts
- Vocalized Noise
- Coinage
16ApplicationsAudio Indexing of Broadcast News
- Broadcast news offers some unique
- challenges
- Lexicon important information in
- infrequently occurring words
- Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location) - Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???) - Language multilingual systems?
language-independent acoustic modeling?
17Applications Real-Time Translation
- From President Clintons State of the Union
address (January 27, 2000) -
- These kinds of innovations are also propelling
our remarkable prosperity... - Soon researchers will bring us devices that can
translate foreign languages - as fast as you can talk... molecular computers
the size of a tear drop with the - power of todays fastest supercomputers.
- Human Language Engineering a sophisticated
integration of many speech and - language related technologies... a science
for the next millennium.
18A Generic Solution
19A Pattern Recognition Formulation
20SolutionSignal Modeling
21Speech Recognition
- Erwin M. Bakker
- Leiden University
22THE SPEECH RECOGNITION PROBLEM
- Boundaries between words or phonemes
- Large variations in speaking rates
- in fluent speech words and word-endings are less
pronounced - Great deal of inter- as well as intra-speaker
variability - Quality of speech signal
- Task-inherent syntactic-semantic constraints
should be exploited
23SEARCH ALGORITHMS
24STATISTICAL METHODS IN SPEECH RECOGNITION
- The Bayesian Approach
- Acoustic Models
- Language Models
25a statistical speech recognition system
26Acoustic Models (HMM)
Some typical HMM topologies used for acoustic
modeling in large vocabulary speech
recognition (a) typical triphone, (b) short
pause (c)silence. The shaded states denote the
start and stop states for each model.
27Language Models
28SEARCH ALGORITHMS
- The Complexity of Search.
- Typical Search Algorithms
- Viterbi Search
- Stack Decoders
- Multi-Pass Search
- Forward-Backward Search
29Hierarchical representation of the search space.
30An outline of the Viterbi search algorithm
31Simple overview of the stack decoding algorithm.
32Multi-Pass Search
33Complexity of Search
- lexicon contains all the words in the systems
vocabulary along with their pronunciations (often
there are multiple pronunciations per word) - acoustic models HMMs that represent the basic
sound units the system is capable of recognizing - language model determines the possible word
sequences allowed by the system (encodes
knowledge of the syntax and semantics of the
language)
34(No Transcript)
35(No Transcript)
36References
Neeraj Deshmukh, Aravind Ganapathiraju and Joseph
Picone Hierarchical Search for Large Vocabulary
Conversational Speech Recognition IEEE Signal
Processing Magazine, September 1999. H. Ney and
S. Ortmanns Dynamic Programming Search for
Continuous Speech Recognition IEEE Signal
Processing Magazine, September 1999. V.
Zue Talking with Your Computer Scientific
American, August 1999
37relative complexity of the search problem for
large vocabulary conversationalspeech recognition
38A TIME-SYNCHRONOUS VITERBI-BASED DECODER
- Complexity of Search
- Network Decoding
- N-Gram Decoding
- Cross-Word Acoustic Models
- Search Space Organization
- Lexical Trees
- Language Model Lookahead
- Acoustic Evaluation
39Network decoding using word-internal
context-dependent models.
- The word network providing linguistic constraints
- The pronunciation lexicon for the words involved
- The network expanded using the corresponding
word-internal triphones derived from the
pronunciations of the words.
40(No Transcript)
41Search Space Organization
42Lexical Tree
43Generation of triphones
44A TIME-SYNCHRONOUS VITERBI-BASED DECODER
- Search Space Reduction
- Pruning
- setting pruning beams based on the hypothesis
score - limiting the total number of model instances
active at a given time - setting an upper bound on the number of words
allowed to end at a given frame - Path Merging
- Word Graph Compaction
45(No Transcript)
46A TIME-SYNCHRONOUS VITERBI-BASED DECODER
- System Architecture
- PERFORMANCE ANALYSIS
- a substitution error refers to the case where the
decoder miss-recognizes a word in the reference - sequence as another in the hypothesis.
- a deletion error occurs when the there is no word
recognized corresponding to a word in the - reference transcription.
- an insertion error corresponds to the case where
the hypothesis contains an extra word that has no
counterpart in the reference.
47A TIME-SYNCHRONOUS VITERBI-BASED DECODER
- scalability Can the algorithm scale gracefully
from small constrained tasks to large
unconstrained - tasks?
- recognition accuracy How accurate is the best
word sequence found by the system? - word graph accuracy Can the system generate
alternate choices that contain the correct word - sequence? How large must this list of choices be?
- memory What memory is required to achieve
optimal performance? How does performance vary - with the amount of memory required?
- run-time How many seconds of CPU time per second
of speech are required (xRT) to achieve - optimal performance? How does run-time vary with
performance (run-time should decrease - significantly as error rates increase)?
48A TIME-SYNCHRONOUS VITERBI-BASED DECODER
- Alphadigits
- Switchboard
- Beam Pruning
- MAPMI Pruning
49(No Transcript)
50Comparisons performed on a 333 MHz Pentium II
processor with 512MB RAM.
51Forward-backward search
52Introduction Speech in the Information Age
- Speech text were revolutionary because of
information access - New media and connectivity yield information
overload - Can speech technology help?
Time
Access to Information
Listen, remember
Read books
Computer typing
Conversational language
Careful spoken, written input
53Conclusion and Future DirectionsTrends
- We need new technology to help with information
overload - Speech information sources are everywhere
- Voice mail messages
- Professional talk
- Lectures, broadcasts
- Speech sources of information will increase
- As devices shrink
- As mobility increases
- New uses annotation, documentation
54Conclusion and Future DirectionsApplications on
the Horizon
- Beginnings of speech as source of information
- ISLIP http//www.mediasite.net/info/frames.htm
- Virage http//www.virage.com
- Speech technology in education and training
- Cliff Stoll, High Tech Heretic
- Good schools need no computers
- Bad schools wont be improved by them
- BravoBrava Co-evolving technology and people
can - Dramatically reduce the cost of delivery of
content - Increase its timeliness, quality and
appropriateness - Target needs of individual and/or group
- Reading Pal demo
55- OVERLAP IN THE CEPSTRAL SPACE (ALPHADIGITS)
- The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of the vowels "aa" (as in "lock")
and "iy" (as in "beat") excised from tokens in
the OGI Alphadigit speech corpus. In these
plots, the first two cepstral coefficients are
shown (c1 and c2 energy, which is c0, is
not shown). Comparisons are provided as a
function of the vowel spoken and the gender of
the speaker - Vowel Comparison a comparison of male "aa" to
male "iy" - Vowel Comparison a comparison of female "aa" to
female "iy" - Vowel Comparison a combined plot of the above
conditions - Gender Comparisons a comparison of males and
females for the vowels "aa" and "iy" - Combined Comparisons a comparison of "aa" to
"iy" for both genders - The Alphadigits vowel data used to generate these
plots is available for classification
experiments.
56OVERLAP IN THE CEPSTRAL SPACE (SWB)
- The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of the vowels "aa" (as in "lock")
and "iy" (as in "beat") excised from tokens in
the SWITCHBOARD conversational speech corpus.
In these plots, the first two cepstral
coefficients are shown (c1 and c2 energy,
which is c0, is not shown). Comparisons are
provided as a function of the vowel spoken and
the gender of the speaker - Vowel Comparison a comparison of male "aa" to
male "iy" - Vowel Comparison a comparison of female "aa" to
female "iy" - Vowel Comparison a combined plot of the above
conditions - Gender Comparisons a comparison of males and
females for the vowels "aa" and "iy" - Combined Comparisons a comparison of "aa" to
"iy" for both genders - The Switchboard vowel data used to generate these
plots is available for classification
experiments.
57Implementation IssuesDecoding Example
58Implementation IssuesInternet-Based Speech
Recognition