Title: HMM - Part 2
1HMM - Part 2
- The EM algorithm
- Continuous density HMM
2The EM Algorithm
- EM Expectation Maximization
- Why EM?
- Simple optimization algorithms for likelihood
functions rely on the intermediate variables,
called latent dataFor HMM, the state sequence is
the latent data - Direct access to the data necessary to estimate
the parameters is impossible or difficultFor
HMM, it is almost impossible to estimate (A, B,
?) without considering the state sequence - Two Major Steps
- E step calculate expectation with respect to the
latent data given the current estimate of the
parameters and the observations - M step estimate a new set of parameters
according to Maximum Likelihood (ML) or Maximum A
Posteriori (MAP) criteria
ML vs. MAP
3The EM Algorithm (cont.)
- The EM algorithm is important to HMMs and many
other model learning techniques - Basic idea
- Assume we have ? and the probability that each
Qq occurred in the generation of Oo - i.e., we have in fact observed a complete
data pair (o,q) with frequency proportional to
the probability P(Oo,Qq?) - We then find a new that maximizes
-
- It can be guaranteed that
- EM can discover parameters of model ? to maximize
the log-likelihood of the incomplete data,
logP(Oo?), by iteratively maximizing the
expectation of the log-likelihood of the complete
data, logP(Oo,Qq?)
4The EM Algorithm (cont.)
5The EM Algorithm (cont.)
1. Jensens inequality If f is a concave
function, and X is a r.v., then Ef(X)
f(EX) 2. log x x-1
6Solution to Problem 3 - The EM Algorithm
- The auxiliary function
- Where and
can be expressed as
7Solution to Problem 3 - The EM Algorithm (cont.)
- The auxiliary function can be rewritten as
8Solution to Problem 3 - The EM Algorithm (cont.)
- The auxiliary function is separated into three
independent terms, each respectively corresponds
to , , and - Maximization procedure on can be
done by maximizing the individual terms
separately subject to probability constraints - All these terms have the following form
9Solution to Problem 3 - The EM Algorithm (cont.)
- Proof Apply Lagrange Multiplier
Constraint
10Solution to Problem 3 - The EM Algorithm (cont.)
11Solution to Problem 3 - The EM Algorithm (cont.)
12Solution to Problem 3 - The EM Algorithm (cont.)
13Solution to Problem 3 - The EM Algorithm (cont.)
- The new model parameter set
can be expressed as
14Discrete vs. Continuous Density HMMs
- Two major types of HMMs according to the
observations - Discrete and finite observation
- The observations that all distinct states
generate are finite in number, i.e., Vv1, v2,
v3, , vM, vk?RL - In this case, the observation probability
distribution in state j, Bbj(k), is defined as
bj(k)P(otvkqtj), 1?k?M, 1?j?Not
observation at time t, qt state at time t - ? bj(k) consists of only M probability values
- Continuous and infinite observation
- The observations that all distinct states
generate are infinite and continuous, i.e., Vv
v?RL - In this case, the observation probability
distribution in state j, Bbj(v), is defined as
bj(v)f(otvqtj), 1?j?Not observation at
time t, qt state at time t - ? bj(v) is a continuous probability density
function (pdf) and is often a mixture of
Multivariate Gaussian (Normal) Distributions
15Gaussian Distribution
- A continuous random variable X is said to have a
Gaussian distribution with mean µand variance
s2(sgt0) if X has a continuous pdf in the
following form
16Multivariate Gaussian Distribution
- If X(X1,X2,X3,,XL) is an L-dimensional random
vector with a multivariate Gaussian distribution
with mean vector ? and covariance matrix ?, then
the pdf can be expressed as - If X1,X2,X3,,XL are independent random
variables, the covariance matrix is reduced to
diagonal, i.e.,
17Multivariate Mixture Gaussian Distribution
- An L-dimensional random vector X(X1,X2,X3,,XL)
is with a multivariate mixture Gaussian
distribution if - In CDHMM, bj(v) is a continuous probability
density function (pdf) and is often a mixture of
multivariate Gaussian distributions
18Solution to Problem 3 The Intuitive View
(CDHMM)
- Define a new variable ?t(j,k)
- probability of being in state j at time t with
the k-th mixture component accounting for ot
19Solution to Problem 3 The Intuitive View
(CDHMM) (cont.)
- Re-estimation formulae for
are
20Solution to Problem 3 - The EM Algorithm(CDHMM)
- Express with respect to each single
mixture component
K one of the possible mixture component sequence
along with the state sequence Q
21Solution to Problem 3 - The EM Algorithm(CDHMM)
(cont.)
- The auxiliary function can be written as
- Compared to the DHMM case, we need to further
solve
22Solution to Problem 3 - The EM Algorithm(CDHMM)
(cont.)
- The new model parameter set can be
derived as
23Solution to Problem 3 - The EM Algorithm(CDHMM)
(cont.)
- The new model parameter sets can
be derived as
24Solution to Problem 3 - The EM Algorithm(CDHMM)
(cont.)
We thus solve
25Solution to Problem 3 - The EM Algorithm(CDHMM)
(cont.)
26Solution to Problem 3 - The EM Algorithm(CDHMM)
(cont.)
27HMM Topology
- Speech is a time-evolving non-stationary signal
- Each HMM state has the ability to capture some
quasi-stationary segment in the non-stationary
speech signal - A left-to-right topology is a natural candidate
to model the speech signal - Each state has a state-dependent output
probability distribution that can be used to
interpret the observable speech signal - It is general to represent a phone using 35
states (English) and a syllable using 68 states
(Mandarin Chinese)
28HMM Limitations
- HMMs have proved themselves to be a good model of
speech variability in time and feature space
simultaneously - There are a number of limitations in the
conventional HMMs - The state duration follows an exponential
distribution - Dont provide adequate representation of the
temporal structure of speech - First order (Markov) assumption the state
transition depends only on the previous state - Output-independent assumption all observation
frames are dependent on the state that generated
them, not on neighboring observation frames - HMMs are well defined only for processes that are
a function of a single independent variable, such
as time or one-dimensional position - Although speech recognition remains the dominant
field in which HMMs are applied, their use has
been spreading steadily to other fields
29ML vs. MAP
- Estimation principle based on observations
Oo1, o2, , oT - The Maximum Likelihood (ML) principlefind the
model parameter ? so that the likelihood P(O?)
is maximum - for example, if ? ?,? is the parameters of a
multivariate normal distribution, and O is i.i.d.
(independent, identically distributed), then the
ML estimate of ? ?,? is - The Maximum a Posteriori (MAP) principlefind
the model parameter ? so that the likelihood P(?
O) is maximum
back
30A Simple Example
The Forward/Backward Procedure
S1
S1
S1
State
S2
S2
S2
1 2 3 Time
o1
o2
o3
31A Simple Example (cont.)
q 1 1 1
q 1 1 2
Total 8 paths
32A Simple Example (cont.)
back
33Appendix - Matrix Calculus
34Appendix - Matrix Calculus (cont.)
35Appendix - Matrix Calculus (cont.)
- Property 1 - Extension
- proof
36Appendix - Matrix Calculus (cont.)
back
37Appendix - Matrix Calculus (cont.)
38Appendix - Matrix Calculus (cont.)
back
39Appendix - Matrix Calculus (cont.)
40Appendix - Matrix Calculus (cont.)
41Appendix - Matrix Calculus (cont.)
42Appendix - Matrix Calculus (cont.)
43Appendix - Matrix Calculus (cont.)
back