Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends

1

Seminar
Speech Recognition
a Short Overview
E.M. Bakker
LIACS Media Lab
Leiden University

2
Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal

Other interesting areas
Who is talker (speaker recognition,
identification)
Speech output (speech synthesis)
What the words mean (speech understanding,
semantics)

3
Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds

Bayesian formulation for speech recognition
P(WA) P(AW) P(W) / P(A)

Objective minimize the word error
rate Approach maximize P(WA) during training

Components
P(AW) acoustic model (hidden Markov models,
mixtures)
P(W) language model (statistical, finite
state networks, etc.)
The language model typically predicts a small set
of next words based on
knowledge of a finite number of previous words
(N-grams).

4
Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
5
Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
6
Acoustic ModelingHidden Markov Models

Acoustic models encode the temporal evolution of
the features (spectrum).
Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation.
Phonetic model topologies are simple
left-to-right structures.
Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models.
Sharing model parameters is a common strategy to
reduce complexity.

7
Acoustic ModelingParameter Estimation

Closed-loop data-driven modeling supervised only
from a word-level transcription.
The expectation/maximization (EM) algorithm is
used to improve our parameter estimates.
Computationally efficient training algorithms
(Forward-Backward) have been crucial.
Batch mode parameter updates are typically
preferred.
Decision trees are used to optimize
parameter-sharing, system complexity, and the
use of additional linguistic knowledge.

8
Language ModelingIs A Lot Like Wheel of Fortune
9
Language ModelingN-Grams The Good, The Bad, and
The Ugly
10
Language ModelingIntegration of Natural Language
11
Implementation Issues Search Is Resource
Intensive

Typical LVCSR systems have about 10M free
parameters, which makes training a challenge.
Large speech databases are required (several
hundred hours of speech).
Tying, smoothing, and interpolation are required.

12
Implementation IssuesDynamic Programming-Based
Search
13
Implementation IssuesCross-Word Decoding Is
Expensive

Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries.
Cross-word decoding significantly increases
memory requirements.

14
General Specification
15
Applications Conversational Speech

Conversational speech collected over the
telephone contains background
noise, music, fluctuations in the speech rate,
laughter, partial words,
hesitations, mouth noises, etc.
WER (Word Error Rate) has decreased from 100 to
30 in six years.

Laughter
Singing
Unintelligible
Spoonerism
Background Speech
No pauses
Restarts
Vocalized Noise
Coinage

16
ApplicationsAudio Indexing of Broadcast News

Broadcast news offers some unique
challenges
Lexicon important information in
infrequently occurring words
Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location)
Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???)
Language multilingual systems?
language-independent acoustic modeling?

17
Applications Real-Time Translation

From President Clintons State of the Union
address (January 27, 2000)
These kinds of innovations are also propelling
our remarkable prosperity...
Soon researchers will bring us devices that can
translate foreign languages
as fast as you can talk... molecular computers
the size of a tear drop with the
power of todays fastest supercomputers.

Human Language Engineering a sophisticated
integration of many speech and
language related technologies... a science
for the next millennium.

18
A Generic Solution
19
A Pattern Recognition Formulation
20
SolutionSignal Modeling
21
Speech Recognition

Erwin M. Bakker
Leiden University

22
THE SPEECH RECOGNITION PROBLEM

Boundaries between words or phonemes
Large variations in speaking rates
in fluent speech words and word-endings are less
pronounced
Great deal of inter- as well as intra-speaker
variability
Quality of speech signal
Task-inherent syntactic-semantic constraints
should be exploited

23
SEARCH ALGORITHMS
24
STATISTICAL METHODS IN SPEECH RECOGNITION

The Bayesian Approach
Acoustic Models
Language Models

25
a statistical speech recognition system
26
Acoustic Models (HMM)
Some typical HMM topologies used for acoustic
modeling in large vocabulary speech
recognition (a) typical triphone, (b) short
pause (c)silence. The shaded states denote the
start and stop states for each model.
27
Language Models
28
SEARCH ALGORITHMS

The Complexity of Search.
Typical Search Algorithms
Viterbi Search
Stack Decoders
Multi-Pass Search
Forward-Backward Search

29
Hierarchical representation of the search space.
30
An outline of the Viterbi search algorithm
31
Simple overview of the stack decoding algorithm.
32
Multi-Pass Search
33
Complexity of Search

lexicon contains all the words in the systems
vocabulary along with their pronunciations (often
there are multiple pronunciations per word)
acoustic models HMMs that represent the basic
sound units the system is capable of recognizing
language model determines the possible word
sequences allowed by the system (encodes
knowledge of the syntax and semantics of the
language)

34
(No Transcript)
35
(No Transcript)
36
References
Neeraj Deshmukh, Aravind Ganapathiraju and Joseph
Picone Hierarchical Search for Large Vocabulary
Conversational Speech Recognition IEEE Signal
Processing Magazine, September 1999. H. Ney and
S. Ortmanns Dynamic Programming Search for
Continuous Speech Recognition IEEE Signal
Processing Magazine, September 1999. V.
Zue Talking with Your Computer Scientific
American, August 1999
37
relative complexity of the search problem for
large vocabulary conversationalspeech recognition
38
A TIME-SYNCHRONOUS VITERBI-BASED DECODER

Complexity of Search
Network Decoding
N-Gram Decoding
Cross-Word Acoustic Models
Search Space Organization
Lexical Trees
Language Model Lookahead
Acoustic Evaluation

39
Network decoding using word-internal
context-dependent models.

The word network providing linguistic constraints
The pronunciation lexicon for the words involved
The network expanded using the corresponding
word-internal triphones derived from the
pronunciations of the words.

40
(No Transcript)
41
Search Space Organization
42
Lexical Tree
43
Generation of triphones
44
A TIME-SYNCHRONOUS VITERBI-BASED DECODER

Search Space Reduction
Pruning
setting pruning beams based on the hypothesis
score
limiting the total number of model instances
active at a given time
setting an upper bound on the number of words
allowed to end at a given frame
Path Merging
Word Graph Compaction

45
(No Transcript)
46
A TIME-SYNCHRONOUS VITERBI-BASED DECODER

System Architecture
PERFORMANCE ANALYSIS
a substitution error refers to the case where the
decoder miss-recognizes a word in the reference
sequence as another in the hypothesis.
a deletion error occurs when the there is no word
recognized corresponding to a word in the
reference transcription.
an insertion error corresponds to the case where
the hypothesis contains an extra word that has no
counterpart in the reference.

47
A TIME-SYNCHRONOUS VITERBI-BASED DECODER

scalability Can the algorithm scale gracefully
from small constrained tasks to large
unconstrained
tasks?
recognition accuracy How accurate is the best
word sequence found by the system?
word graph accuracy Can the system generate
alternate choices that contain the correct word
sequence? How large must this list of choices be?
memory What memory is required to achieve
optimal performance? How does performance vary
with the amount of memory required?
run-time How many seconds of CPU time per second
of speech are required (xRT) to achieve
optimal performance? How does run-time vary with
performance (run-time should decrease
significantly as error rates increase)?

48
A TIME-SYNCHRONOUS VITERBI-BASED DECODER

Alphadigits
Switchboard
Beam Pruning
MAPMI Pruning

49
(No Transcript)
50
Comparisons performed on a 333 MHz Pentium II
processor with 512MB RAM.
51
Forward-backward search
52
Introduction Speech in the Information Age

Speech text were revolutionary because of
information access
New media and connectivity yield information
overload
Can speech technology help?

Time
Access to Information
Listen, remember
Read books
Computer typing
Conversational language
Careful spoken, written input
53
Conclusion and Future DirectionsTrends

We need new technology to help with information
overload
Speech information sources are everywhere
Voice mail messages
Professional talk
Lectures, broadcasts
Speech sources of information will increase
As devices shrink
As mobility increases
New uses annotation, documentation

54
Conclusion and Future DirectionsApplications on
the Horizon

Beginnings of speech as source of information
ISLIP http//www.mediasite.net/info/frames.htm
Virage http//www.virage.com

Speech technology in education and training
Cliff Stoll, High Tech Heretic
Good schools need no computers
Bad schools wont be improved by them

BravoBrava Co-evolving technology and people
can
Dramatically reduce the cost of delivery of
content
Increase its timeliness, quality and
appropriateness
Target needs of individual and/or group
Reading Pal demo

OVERLAP IN THE CEPSTRAL SPACE (ALPHADIGITS)
The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of the vowels "aa" (as in "lock")
and "iy" (as in "beat") excised from tokens in
the OGI Alphadigit speech corpus. In these
plots, the first two cepstral coefficients are
shown (c1 and c2 energy, which is c0, is
not shown). Comparisons are provided as a
function of the vowel spoken and the gender of
the speaker
Vowel Comparison a comparison of male "aa" to
male "iy"
Vowel Comparison a comparison of female "aa" to
female "iy"
Vowel Comparison a combined plot of the above
conditions
Gender Comparisons a comparison of males and
females for the vowels "aa" and "iy"
Combined Comparisons a comparison of "aa" to
"iy" for both genders
The Alphadigits vowel data used to generate these
plots is available for classification
experiments.

56
OVERLAP IN THE CEPSTRAL SPACE (SWB)

The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of the vowels "aa" (as in "lock")
and "iy" (as in "beat") excised from tokens in
the SWITCHBOARD conversational speech corpus.
In these plots, the first two cepstral
coefficients are shown (c1 and c2 energy,
which is c0, is not shown). Comparisons are
provided as a function of the vowel spoken and
the gender of the speaker
Vowel Comparison a comparison of male "aa" to
male "iy"
Vowel Comparison a comparison of female "aa" to
female "iy"
Vowel Comparison a combined plot of the above
conditions
Gender Comparisons a comparison of males and
females for the vowels "aa" and "iy"
Combined Comparisons a comparison of "aa" to
"iy" for both genders
The Switchboard vowel data used to generate these
plots is available for classification
experiments.

57
Implementation IssuesDecoding Example
58
Implementation IssuesInternet-Based Speech
Recognition

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends PowerPoint PPT Presentation