Title: 14'0 Some Fundamental Problemsolving Approaches
1- 14.0 Some Fundamental Problem-solving Approaches
References 1. 4.3.2, 4.4.2 of Huang, or
9.1-9.3 of Jelinek 2. 6.4.3 of Rabiner and
Juang 3. 9.8 of Gold and Morgan, Speech and
Audio Signal Processing 4.
http//www.stanford .edu/class/cs229/materials.htm
l 5. Minimum Classification Error Rate
Methods for Speech Recognition,
IEEE Trans. Speech and Audio Processing,
May 1997 6. Segmental Minimum Bayes-Rick
Decoding for Automatic Speech
Recognition, IEEE Trans. Speech and Audio
Processing, 2004 7. Minimum Phone Error
and I-smoothing for Improved Discriminative
Training, International Conference on
Acoustics, Speech and Signal Processing,
2002
2EM (Expectation and Maximization) Algorithm
- Goal estimating the parameters for some
probabilistic models based on some criteria - Parameter Estimation Principles given some
observations Xx1, x2, , xN - Maximum Likelihood (ML) Principlefind the model
parameter set ? such that the likelihood function
is maximized, P(X ?) max. - For example, if ? ?,? is the parameters of a
normal distribution, and X is i.i.d, then the ML
estimate of ? ?,? is - the Maximum A Posteriori (MAP) Principle
- Find the model parameter ? so that the A
Posterior probability is maximized - i.e. P(? X) P(X ?) P(?)/ P(X) max
- ? P(X ?) P(?) max
3EM ( Expectation and Maximization) Algorithm
- Why EM?
- In some cases the evaluation of the objective
function (e.g. likelihood function) depends on
some intermediate variables (latent data) which
are not observable (e.g. the state sequence for
HMM parameter training) - direct estimation of the desired parameters
without such latent data is impossible or
difficulte.g. almost impossible to estimate
A,B, ? for HMM without considerations on the
state sequence - Iteractive Procedure with Two Steps in Each
Iteration - E (Expectation) expectation with respect to the
possible distribution (values and probabilities)
of the latent data based on the current estimates
of the desired parameters conditioned on the
given observations - M (Maximization) generating a new set of
estimates of the desired parameters by maximizing
the objective function (e.g. according to ML or
MAP) - the objective function increased after each
iteration, eventually converged
4EM Algorithm An example
- First, randomly assigned ?(0)P(0)(A),P(0)(B),P(0
)(RA),P(0)(GA), P(0)(RB), P (0)(GB)for
example
P(0)(A)0.4,P(0)(B)0.6,P(0
)(RA)0.5,P(0)(GA) 0.5, P(0)(RB) 0.5, P
(0)(GB) 0.5 - Expectation Step find the expectation of
logP(O ?) 8 possible state sequences qi
AAA,BBB,AAB,BBA,ABA,BAB,ABB,BAA - Maximization Step find?(1) to maximize the
expectation function Eq(logP(O?)) - Iterations ?(0)? ?(1) ? ?(2) ?....
For example, when qi AAB
5EM Algorithm
- In Each Iteration (assuming logP(x ?) is the
objective function) - E step expressing the log-likelihood logP(x?)
in terms of the distribution of the latent data
conditioned on x, ?(k) - M step find a way to maximized the above
function, such that the above function increases
monotonically, i.e., logP(x?(k1))?logP(x?(k)) - The Conditions for each Iteration to Proceed
based on the Criterion - x observed (incomplete) data, z latent data,
x, z complete data
6EM Algorithm
- For the EM Iterations to Proceed based on the
Criterion - to make sure logP(x?k1) ? logP(x?k)
- H(?k1,?k) ?H(?k,?k) due to Jensons
Inequality -
- the only requirement is to have ?k1 such that
Q(?k1,?k) -Q(?k,?k) ? 0
- E-step to estimate Q(?,?k) auxiliary
function, or Q-function, the expectation of the
objective function in terms of the distribution
of the latent data conditioned on (x,?k) - M-step ?k1 Q(?,?k)
arg max ?
7Example Use of EM Algorithm in Solving Problem 3
of HMM
- Observed data observations O, latent data
state sequence q - The probability of the complete data is
P(O,q?) P(Oq,?)P(q?) - E-Step Q(?, ?k)Elog P(O,q?)O, ?k Sq
P(qO,?k)logP(O,q?) - ?k k-th estimate of ? (known), ? unknown
parameter to be estimated - M-Step
- Find ? k1 such that ? k1 arg max?Q(?,
?k) - Given the Various Constraints (e.g.
), It can be shown - the above maximization leads to the formulas
obtained previously -
8Minimum-Classification-Error (MCE) and
Discriminative Training
- General Objective find an optimal set of
parameters (e.g. for recognition models) to
minimize the expected error of classification - the statistics of test data may be quite
different from that of the training data - training data is never enough
- Assume the recognizer is operated with the
following classification principles Ci,
i1,2,...M, M classes - ?(i) statistical model for Ci
- ??(i)i1M , the set of all models for all
classes - X observations gi(X,?) class conditioned
likelihood function, for example,
gi(X,?) P (X?(i)) - C(X) Ci if gi(X,?) maxj gj(X,?)
classification principles - an error happens when P(X?(i)) max but X ?Ci
- Conventional Training Criterion find ?(i)
such that P(X?(i)) is maximum (Maximum
Likelihood) if X ?Ci - This does not always lead to minimum
classification error, since it doesn't consider
the mutual relationship among competing classes - The competing classes may give higher likelihood
function for the test data
9Minimum-Classification-Error (MCE) Training
- One form of the misclassification measure
- Comparison between the likelihood functions for
the correct class and the competing classes -
- A continuous loss function is defined
- l(d) ?0 when d ?-8
- l(d) ?1 when d ?8
- ? 0 switching from 0 to 1 near ?
- ? determining the slope at switching point
- Overall Classification Performance Measure
10Minimum-Classification-Error (MCE) Training
- Find ? such that
- the above objective function in general is
difficult to minimize directly - local minimum can be obtained iteratively using
gradient (steepest) descent algorithm - every training observation may change the
parameters of ALL models, not the model for its
class only
11Discriminative Training For Large Vocabulary
Speech Recognition
- Minimum Bayesian Risc (MBR)
- adjusting
all model parameters to minimize the
Bayesian Risc - ? ?i,i1,2,N acoustic models
- G Language model parameters
- Or r-th training utterance
- sr correct transcription of Or
- Bayesian
Risc - u a possible recognition output
- L(u,sr) Loss function
- P?,G (uOr) posteriori probability of u given
Or based on ?,G -
- Other definitions of L(u,sr) possible
- Minimum Phone Error Rate (MPE) Training
-
- Acc(u,sr) phone accuracy
- Better features obtainable in the same way
- e.g. yt xt Mht feature-space MPE