Title: Isolated-Word Speech Recognition Using Hidden Markov Models
1Isolated-Word Speech Recognition Using Hidden
Markov Models
- 6.962 Week 10 Presentation
Irina Medvedev Massachusetts Institute of
Technology April 19, 2001
2Outline
- Markov Processes, Chains, Models
- Isolated-Word Speech Recognition
- Feature Analysis
- Unit Matching
- Training
- Recognition
- Conclusions
3Markov Process
- The process, x(t), is first-order Markov if, for
any set of ordered times, , - The current value of a Markov process depends
all of the memory necessary to predict the future - The past does not add any additional information
about the future
4Markov Process
- The transition probability density provides an
important statistical description of a Markov
process and is defined as - A complete specification of a Markov process
consists of - first-order density
- transition density
5Markov Chains
- A Markov chain can be used to describe a system
which, at any time, belongs to one of N distinct
states,
- At regularly spaced times, the system may stay
in the same state or transition to a different
state - State at time t is denoted by qt
Fully Connected Markov Model
6Markov Chains
- The state transition probabilities are made
according to a set of probabilities associated
with each state - These probabilities are stored in the state
transition matrix - where N is the number of states in the Markov
chain. - The state transition probabilities are
- and have the properties and
7Hidden Markov Models
- Hidden Markov Models (HMMs) are used when the
states are not observable events. - Instead, the observation is a probabilistic
function of the state rather than the state
itself - The states are described by a probability model
- The HMM is a doubly embedded stochastic process
8HMM Example Coin Toss
- How do we build an HMM to explain the observed
sequence of head and tails? - Choose a 2-state model
- Several possibilities exist
1-coin Model Observable
2-coin Model States are Hidden
9Hidden Markov Models
- Hidden Markov Models are characterized by
- N, the number of states in the model
- A, the state transition matrix
- Observation probability distribution state
- Initial state distribution,
- Model Parameter Set
10Left-Right HMM
- Can only transition to a higher state or stay in
the same state - No-skip constraint allows states to transition
only to the next state or remain the same state - Zeros in the state transition matrix represent
illegal state transitions
4-state left-right HMM with no skip transitions
11Isolated-Word Speech Recognition
- Recognize one word at a time
- Assume incoming signal is of the form
- silence speech silence
- Feature Analysis Training
- Unit Matching Recognition
12Feature Analysis
- We perform feature analysis to extract
observation vectors upon which all processing
will be performed - The discrete-time speech signal is
with discrete Fourier transform - To reduce the dimensionality of the V-dim speech
vector, we use cepstral coefficients, which serve
as the feature observation vector for all future
processing
13Cepstral Coefficients
- Feature vectors are cepstral coefficients
obtained from the sampled speech vector - where is the periodogram estimate of
the power spectral density of the speech - We eliminate the zeroth component and keep
cepstral coefficients 1 through L-1 - Dimensionality reduction
14Properties of Cepstral Coefficients
- Serve to undo the convolution between the pitch
and the vocal tract - High-order cepstral components carry speaker
dependent pitch information, which is not
relevant for speech recognition - Cepstral coefficients are well approximated by a
Gaussian probability density function (pdf) - Correlation values of cepstral coefficients are
very low
15Modeling of Cepstral Coefficients
- HMM assumes that the Markovian states generate
the cepstral vectors - Each state represents a Gaussian source with
mean vector and covariance matrix - Each feature vector of cepstral coefficients can
be modeled as a sample vector of an L-dim
Gaussian random vector with mean vector and
diagonal covariance matrix
16Formulation of the Feature Vectors
17Unit Matching
- Initial Goal obtain an HMM for each speech
recognition unit - Large vocabulary (300 words)
recognition units are phonemes - Small-vocabulary (10 words) recognition
units are words - We will consider an isolated-word speech
recognition system for a small vocabulary of M
words
18Notation
- Observation vector is , where each
is a cepstral feature vector and is the
number of feature vectors in an observation - State Sequence is , where each
- State index
- Word index
- Time index
- The term model will be used for both the HMM and
the parameter set describing the HMM,
19Training
- We need to obtain an HMM for each of the M words
- The process of building the HMMs is called
training - Each HMM is characterized by the number of
states, N, and the model parameter set, - Each cepstral feature vector, , in state,
, can be modeled by an L-dim Gaussian pdf - where is the mean vector and is the
covariance matrix in state
20Training
- A Gaussian pdf is completely characterized by
the mean vector and covariance matrix - The model parameter set can be modified to
- The training procedure is the same for each
word. For convenience, we will drop the
subscript from
21Building the HMM
- To build the HMM, we need to determine the
parameter set that maximizes the likelihood of
the observation for that word. - Objective
- The double maximization can be performed by
optimizing over the state sequence and the model
individually
22Uniform Segmentation
Determining the initial state sequence
50 segments ? 8 states
23Maximization over the Model
- Given the initial state sequence, we maximize
over the model - The maximization entails estimating the model
parameters from the observation given the state
sequence - Estimation is performed using the Baum-Welch
re-estimation formulas
24Re-estimation Formulas
is the number of feature vectors in state
25Model Estimation
26Maximization over the state sequence
- Given the model, we maximize over the state
sequence - The probability expression can be rewritten as
27Maximization over the state sequence
- Applying the logarithm transforms the
maximization of a product into a maximization of
a sum - We are still looking for the state sequence that
maximizes the expression - The optimal state sequence can be determined
using the Viterbi algorithm
28Trellis Structure of HMMs
- Redrawing the HMM as a trellis makes it easy to
see the state sequence as a path through the
trellis
- The optimal state sequence is determined by the
Viterbi algorithm as the single best path that
maximizes
29Training Procedure
Uniform Segmentation
Cepstral Calculation
Estimation of (Baum-Welch)
State Sequence Segmentation (Viterbi)
No
Converged?
Yes
30Recognition
- We have a set of HMMs, one for each word
- Objective Choose the word model that maximizes
the probability of the observation given the
model (Maximum Likelihood detection rule) - Classifier for observation is
- The likelihood can be written as a summation
over all state sequences
31Recognition
- Replace the full likelihood by an approximation
that takes into account only the most probable
state sequence capable of producing the
observation - Treating the most probable state sequence as the
best path in the HMM trellis allows us to use the
Viterbi algorithm to maximize the above
probability - The best-path classifier for observation is
32Recognition
Index of recognized word
Cepstral Calculation
Select Maximum
33Conclusion
- Introduced hidden Markov models
- Described process of isolated-word speech
recognition - Feature vectors Unit matching
- Unit matching Training Recognition
- Other considerations
- Artificial Neural Networks (ANNs) for speech
recognition - Hybrid HMM/ANN models
- Minimum classification error HMM design