14'0 Some Fundamental Problemsolving Approaches - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

14'0 Some Fundamental Problemsolving Approaches

Description:

7. 'Minimum Phone Error and I-smoothing for Improved Discriminative ... Minimum-Classification-Error (MCE) and Discriminative Training ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 12
Provided by: yic8
Category:

less

Transcript and Presenter's Notes

Title: 14'0 Some Fundamental Problemsolving Approaches


1
  • 14.0 Some Fundamental Problem-solving Approaches

References 1. 4.3.2, 4.4.2 of Huang, or
9.1-9.3 of Jelinek 2. 6.4.3 of Rabiner and
Juang 3. 9.8 of Gold and Morgan, Speech and
Audio Signal Processing 4.
http//www.stanford .edu/class/cs229/materials.htm
l 5. Minimum Classification Error Rate
Methods for Speech Recognition,
IEEE Trans. Speech and Audio Processing,
May 1997 6. Segmental Minimum Bayes-Rick
Decoding for Automatic Speech
Recognition, IEEE Trans. Speech and Audio
Processing, 2004 7. Minimum Phone Error
and I-smoothing for Improved Discriminative
Training, International Conference on
Acoustics, Speech and Signal Processing,
2002
2
EM (Expectation and Maximization) Algorithm
  • Goal estimating the parameters for some
    probabilistic models based on some criteria
  • Parameter Estimation Principles given some
    observations Xx1, x2, , xN
  • Maximum Likelihood (ML) Principlefind the model
    parameter set ? such that the likelihood function
    is maximized, P(X ?) max.
  • For example, if ? ?,? is the parameters of a
    normal distribution, and X is i.i.d, then the ML
    estimate of ? ?,? is
  • the Maximum A Posteriori (MAP) Principle
  • Find the model parameter ? so that the A
    Posterior probability is maximized
  • i.e. P(? X) P(X ?) P(?)/ P(X) max
  • ? P(X ?) P(?) max

3
EM ( Expectation and Maximization) Algorithm
  • Why EM?
  • In some cases the evaluation of the objective
    function (e.g. likelihood function) depends on
    some intermediate variables (latent data) which
    are not observable (e.g. the state sequence for
    HMM parameter training)
  • direct estimation of the desired parameters
    without such latent data is impossible or
    difficulte.g. almost impossible to estimate
    A,B, ? for HMM without considerations on the
    state sequence
  • Iteractive Procedure with Two Steps in Each
    Iteration
  • E (Expectation) expectation with respect to the
    possible distribution (values and probabilities)
    of the latent data based on the current estimates
    of the desired parameters conditioned on the
    given observations
  • M (Maximization) generating a new set of
    estimates of the desired parameters by maximizing
    the objective function (e.g. according to ML or
    MAP)
  • the objective function increased after each
    iteration, eventually converged

4
EM Algorithm An example
  • First, randomly assigned ?(0)P(0)(A),P(0)(B),P(0
    )(RA),P(0)(GA), P(0)(RB), P (0)(GB)for
    example
    P(0)(A)0.4,P(0)(B)0.6,P(0
    )(RA)0.5,P(0)(GA) 0.5, P(0)(RB) 0.5, P
    (0)(GB) 0.5
  • Expectation Step find the expectation of
    logP(O ?) 8 possible state sequences qi
    AAA,BBB,AAB,BBA,ABA,BAB,ABB,BAA
  • Maximization Step find?(1) to maximize the
    expectation function Eq(logP(O?))
  • Iterations ?(0)? ?(1) ? ?(2) ?....

For example, when qi AAB
5
EM Algorithm
  • In Each Iteration (assuming logP(x ?) is the
    objective function)
  • E step expressing the log-likelihood logP(x?)
    in terms of the distribution of the latent data
    conditioned on x, ?(k)
  • M step find a way to maximized the above
    function, such that the above function increases
    monotonically, i.e., logP(x?(k1))?logP(x?(k))
  • The Conditions for each Iteration to Proceed
    based on the Criterion
  • x observed (incomplete) data, z latent data,
    x, z complete data

6
EM Algorithm
  • For the EM Iterations to Proceed based on the
    Criterion
  • to make sure logP(x?k1) ? logP(x?k)
  • H(?k1,?k) ?H(?k,?k) due to Jensons
    Inequality
  • the only requirement is to have ?k1 such that
    Q(?k1,?k) -Q(?k,?k) ? 0
  • E-step to estimate Q(?,?k) auxiliary
    function, or Q-function, the expectation of the
    objective function in terms of the distribution
    of the latent data conditioned on (x,?k)
  • M-step ?k1 Q(?,?k)

arg max ?
7
Example Use of EM Algorithm in Solving Problem 3
of HMM
  • Observed data observations O, latent data
    state sequence q
  • The probability of the complete data is
    P(O,q?) P(Oq,?)P(q?)
  • E-Step Q(?, ?k)Elog P(O,q?)O, ?k Sq
    P(qO,?k)logP(O,q?)
  • ?k k-th estimate of ? (known), ? unknown
    parameter to be estimated
  • M-Step
  • Find ? k1 such that ? k1 arg max?Q(?,
    ?k)
  • Given the Various Constraints (e.g.
    ), It can be shown
  • the above maximization leads to the formulas
    obtained previously

8
Minimum-Classification-Error (MCE) and
Discriminative Training
  • General Objective find an optimal set of
    parameters (e.g. for recognition models) to
    minimize the expected error of classification
  • the statistics of test data may be quite
    different from that of the training data
  • training data is never enough
  • Assume the recognizer is operated with the
    following classification principles Ci,
    i1,2,...M, M classes
  • ?(i) statistical model for Ci
  • ??(i)i1M , the set of all models for all
    classes
  • X observations gi(X,?) class conditioned
    likelihood function, for example,
    gi(X,?) P (X?(i))
  • C(X) Ci if gi(X,?) maxj gj(X,?)
    classification principles
  • an error happens when P(X?(i)) max but X ?Ci
  • Conventional Training Criterion find ?(i)
    such that P(X?(i)) is maximum (Maximum
    Likelihood) if X ?Ci
  • This does not always lead to minimum
    classification error, since it doesn't consider
    the mutual relationship among competing classes
  • The competing classes may give higher likelihood
    function for the test data

9
Minimum-Classification-Error (MCE) Training
  • One form of the misclassification measure
  • Comparison between the likelihood functions for
    the correct class and the competing classes
  • A continuous loss function is defined
  • l(d) ?0 when d ?-8
  • l(d) ?1 when d ?8
  • ? 0 switching from 0 to 1 near ?
  • ? determining the slope at switching point
  • Overall Classification Performance Measure

10
Minimum-Classification-Error (MCE) Training
  • Find ? such that
  • the above objective function in general is
    difficult to minimize directly
  • local minimum can be obtained iteratively using
    gradient (steepest) descent algorithm
  • every training observation may change the
    parameters of ALL models, not the model for its
    class only

11
Discriminative Training For Large Vocabulary
Speech Recognition
  • Minimum Bayesian Risc (MBR)
  • adjusting
    all model parameters to minimize the
    Bayesian Risc
  • ? ?i,i1,2,N acoustic models
  • G Language model parameters
  • Or r-th training utterance
  • sr correct transcription of Or
  • Bayesian
    Risc
  • u a possible recognition output
  • L(u,sr) Loss function
  • P?,G (uOr) posteriori probability of u given
    Or based on ?,G
  • Other definitions of L(u,sr) possible
  • Minimum Phone Error Rate (MPE) Training
  • Acc(u,sr) phone accuracy
  • Better features obtainable in the same way
  • e.g. yt xt Mht feature-space MPE
Write a Comment
User Comments (0)
About PowerShow.com