14'0 Some Fundamental Problemsolving Approaches - PowerPoint PPT Presentation

1 / 11

About This Presentation

Title:

14'0 Some Fundamental Problemsolving Approaches

Description:

7. 'Minimum Phone Error and I-smoothing for Improved Discriminative ... Minimum-Classification-Error (MCE) and Discriminative Training ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 12

Provided by: yic8

Category:

more less

Transcript and Presenter's Notes

Title: 14'0 Some Fundamental Problemsolving Approaches

1

14.0 Some Fundamental Problem-solving Approaches

References 1. 4.3.2, 4.4.2 of Huang, or
9.1-9.3 of Jelinek 2. 6.4.3 of Rabiner and
Juang 3. 9.8 of Gold and Morgan, Speech and
Audio Signal Processing 4.
http//www.stanford .edu/class/cs229/materials.htm
l 5. Minimum Classification Error Rate
Methods for Speech Recognition,
IEEE Trans. Speech and Audio Processing,
May 1997 6. Segmental Minimum Bayes-Rick
Decoding for Automatic Speech
Recognition, IEEE Trans. Speech and Audio
Processing, 2004 7. Minimum Phone Error
and I-smoothing for Improved Discriminative
Training, International Conference on
Acoustics, Speech and Signal Processing,
2002
2
EM (Expectation and Maximization) Algorithm

Goal estimating the parameters for some
probabilistic models based on some criteria
Parameter Estimation Principles given some
observations Xx1, x2, , xN
Maximum Likelihood (ML) Principlefind the model
parameter set ? such that the likelihood function
is maximized, P(X ?) max.
For example, if ? ?,? is the parameters of a
normal distribution, and X is i.i.d, then the ML
estimate of ? ?,? is
the Maximum A Posteriori (MAP) Principle
Find the model parameter ? so that the A
Posterior probability is maximized
i.e. P(? X) P(X ?) P(?)/ P(X) max
? P(X ?) P(?) max

3
EM ( Expectation and Maximization) Algorithm

Why EM?
In some cases the evaluation of the objective
function (e.g. likelihood function) depends on
some intermediate variables (latent data) which
are not observable (e.g. the state sequence for
HMM parameter training)
direct estimation of the desired parameters
without such latent data is impossible or
difficulte.g. almost impossible to estimate
A,B, ? for HMM without considerations on the
state sequence
Iteractive Procedure with Two Steps in Each
Iteration
E (Expectation) expectation with respect to the
possible distribution (values and probabilities)
of the latent data based on the current estimates
of the desired parameters conditioned on the
given observations
M (Maximization) generating a new set of
estimates of the desired parameters by maximizing
the objective function (e.g. according to ML or
MAP)
the objective function increased after each
iteration, eventually converged

4
EM Algorithm An example

First, randomly assigned ?(0)P(0)(A),P(0)(B),P(0
)(RA),P(0)(GA), P(0)(RB), P (0)(GB)for
example
P(0)(A)0.4,P(0)(B)0.6,P(0
)(RA)0.5,P(0)(GA) 0.5, P(0)(RB) 0.5, P
(0)(GB) 0.5
Expectation Step find the expectation of
logP(O ?) 8 possible state sequences qi
AAA,BBB,AAB,BBA,ABA,BAB,ABB,BAA
Maximization Step find?(1) to maximize the
expectation function Eq(logP(O?))
Iterations ?(0)? ?(1) ? ?(2) ?....

For example, when qi AAB
5
EM Algorithm

In Each Iteration (assuming logP(x ?) is the
objective function)
E step expressing the log-likelihood logP(x?)
in terms of the distribution of the latent data
conditioned on x, ?(k)
M step find a way to maximized the above
function, such that the above function increases
monotonically, i.e., logP(x?(k1))?logP(x?(k))
The Conditions for each Iteration to Proceed
based on the Criterion
x observed (incomplete) data, z latent data,
x, z complete data

6
EM Algorithm

For the EM Iterations to Proceed based on the
Criterion
to make sure logP(x?k1) ? logP(x?k)
H(?k1,?k) ?H(?k,?k) due to Jensons
Inequality
the only requirement is to have ?k1 such that
Q(?k1,?k) -Q(?k,?k) ? 0
E-step to estimate Q(?,?k) auxiliary
function, or Q-function, the expectation of the
objective function in terms of the distribution
of the latent data conditioned on (x,?k)
M-step ?k1 Q(?,?k)

arg max ?
7
Example Use of EM Algorithm in Solving Problem 3
of HMM

Observed data observations O, latent data
state sequence q
The probability of the complete data is
P(O,q?) P(Oq,?)P(q?)
E-Step Q(?, ?k)Elog P(O,q?)O, ?k Sq
P(qO,?k)logP(O,q?)
?k k-th estimate of ? (known), ? unknown
parameter to be estimated
M-Step
Find ? k1 such that ? k1 arg max?Q(?,
?k)
Given the Various Constraints (e.g.
), It can be shown
the above maximization leads to the formulas
obtained previously

8
Minimum-Classification-Error (MCE) and
Discriminative Training

General Objective find an optimal set of
parameters (e.g. for recognition models) to
minimize the expected error of classification
the statistics of test data may be quite
different from that of the training data
training data is never enough
Assume the recognizer is operated with the
following classification principles Ci,
i1,2,...M, M classes
?(i) statistical model for Ci
??(i)i1M , the set of all models for all
classes
X observations gi(X,?) class conditioned
likelihood function, for example,
gi(X,?) P (X?(i))
C(X) Ci if gi(X,?) maxj gj(X,?)
classification principles
an error happens when P(X?(i)) max but X ?Ci
Conventional Training Criterion find ?(i)
such that P(X?(i)) is maximum (Maximum
Likelihood) if X ?Ci
This does not always lead to minimum
classification error, since it doesn't consider
the mutual relationship among competing classes
The competing classes may give higher likelihood
function for the test data

9
Minimum-Classification-Error (MCE) Training

One form of the misclassification measure
Comparison between the likelihood functions for
the correct class and the competing classes
A continuous loss function is defined
l(d) ?0 when d ?-8
l(d) ?1 when d ?8
? 0 switching from 0 to 1 near ?
? determining the slope at switching point
Overall Classification Performance Measure

10
Minimum-Classification-Error (MCE) Training

Find ? such that
the above objective function in general is
difficult to minimize directly
local minimum can be obtained iteratively using
gradient (steepest) descent algorithm
every training observation may change the
parameters of ALL models, not the model for its
class only

11
Discriminative Training For Large Vocabulary
Speech Recognition