Discriminative Models for Speech Recognition - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Discriminative Models for Speech Recognition

Description:

Separate transition and observation probabilities are replaced with one function ... Maximum Entropy modeling is used to model the conditional distributions ... – PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 20
Provided by: Ryan85
Category:

less

Transcript and Presenter's Notes

Title: Discriminative Models for Speech Recognition


1
Discriminative Models for Speech Recognition
  • M.J.F. Gales
  • Cambridge University Engineering Department
  • 2007

2
Outline
  • Introduction
  • Hidden Markov Models
  • Discriminative Training Criteria
  • Maximum Mutual Information
  • Minimum Classification Error
  • Minimum Bayes Risk
  • Techniques to improve generalization
  • Large Margin HMMs
  • Maximum Entropy Markov Models
  • Conditional Random Field
  • Dynamic Kernels
  • Conditional Augmented Models
  • Conclusions

3
Automatic Speech Recognition
  • The task of the speech recognition is to
    determine the identity of an given observation
    sequence by assigning the recognized word
    sequence to it
  • The decision is to find the identity with maximum
    a posterior (MAP) probability
  • The so-called Bayes decision (or
    minimum-error-rate) rule
  • A certain parametric representation of these
    distributions is needed
  • HMMs are widely adopted for acoustic modeling

Multinomial
Gaussian
4
Acoustic Modeling (1/2)
  • In the development of an ASR system, acoustic
    modeling is always an indispensable and crucial
    ingredient
  • The purpose of acoustic modeling is to provide a
    method to calculate the likelihood of a speech
    utterance occurring given a word sequence,
  • In principle, the word sequence can be decomposed
    into a sequence of phone-like units (acoustic
    models)
  • Each of which is normally represented by a HMM,
    and can be estimated from a corpus of training
    utterances
  • Traditionally, the maximum likelihood (ML)
    training can be employed for this estimation

5
Acoustic Modeling (2/2)
  • Besides the ML training, the acoustic model can
    be alternative trained with discriminative
    training criteria
  • MCE training?MMI training ?MPE trainingetc
  • In MCE training, an approximation to the error
    rate on the training data is optimized
  • The MMI and MPE algorithms were developed in an
    attempt to correctly discriminate the recognition
    hypotheses for the best recognition results
  • However..
  • The underlying acoustic model is still
    generative, with the associated constraints on
    the state and transition probability
    distributions
  • Classification is based on Bayes decision rule

6
Introduction
  • Initially these discriminative criteria were
    applied to small vocabulary speech recognition
    tasks
  • A number of techniques were then developed to
    enable their use for LVCSR tasks
  • I-smoothing
  • Language model weakening
  • The use of lattices to compactly represent the
    denominator score
  • But the performance on LVCSR tasks is still not
    satisfactory for many speech-enabled applications
  • This has led to interest in discriminative (or
    direct) models for speech recognition where the
    posterior of the word-sequence given the
    observation, ,is directly modeled

7
Hidden Markov Models
  • HMMs are the standard acoustic model used in
    speech recognition
  • The likelihood function is
  • The standard training of HMM is based on Maximum
    Likelihood training
  • This optimization is normally performed using
    Expectation Maximization

8
Discriminative Training Criteria
  • The discriminative training criteria are more
    closely linked to minimizing the error rate,
    rather than maximizing the likelihood of
    generating the training data
  • Three main forms of discriminative training have
    been examined
  • Maximum Mutual Information (MMI)
  • Minimum Classification Error (MCE)
  • Minimum Bayes Risk (MBR)
  • Minimum Phone Error (MPE)

9
Discriminative Training Criteria
  • Maximum Mutual Information
  • To maximizing the mutual information between the
    observed sequences and models
  • Minimum Classification Error
  • Based on a smooth function of the difference
    between the log-likelihood of the correct
    sequence and all other competing word sequences

10
Discriminative Training Criteria
  • Minimum Bayes Risk
  • Rather than trying to model the correct
    distribution, the expected loss during inference
    is minimized
  • A number of loss function
  • 1/0 function
  • equivalent to a sentence-level loss function
  • Word
  • the loss function directly related to minimizing
    the expected Word Error Rate (WER)
  • Phone

11
Large Margin HMMs
  • The simplest form of large margin training
    criterion can be expressed as maximizing Li et
    al. 2005
  • This aims to maximize the minimum distance
    between the log-posterior of the correct label
    and all the incorrect labels
  • Some properties related to both the MMI and MCE
    criterion
  • A log-posterior cost function is used as in the
    MMI criterion
  • The denominator term used with this approach does
    not include an element from the correct label in
    a similar fashion to the MCE criterion

12
Large Margin HMMs
  • A couple of variants of large margin training
  • Soft margin training Jinyu Li et al. 2006
  • Large margin GMM F. Sha and L.K. Saul 2007
  • The size of the margin is specified in terms of a
    loss function between the two sets of sequences

where
13
Direct Models
  • Direct modeling attempts to model the posterior
    probability directly
  • There are many potential advantages as well as
    challenges for direct modeling
  • The direct model can potentially make decoding
    simpler
  • The direct model allows for the potential
    combination of multiple sources of data in a
    unified fashion
  • Asynchronous and overlapping features can be
    incorporated formally
  • It will be possible to take advantage of
    supra-segmental features like prosodic features,
    acoustic phonetic features, speaker style, rate
    of speech, channel differences
  • However, joint estimation would require a large
    amount of parallel speech and text data (a
    challenge for data collection)

14
Direct Models
  • The relationship between observations and states
    is reversed
  • Separate transition and observation probabilities
    are replaced with one function
  • Directly modeling makes direct
    computation of possible
  • The model can also be conditioned flexibly on a
    variety of contextual features
  • Any computable property of the observation
    sequence can be used as a feature
  • The number of features at each time frame need
    not be the same

Assumption
15
Maximum Entropy Markov Models
  • Recently, McCallum et al. (ICML 2000) modeled
    sequential processes using a direct model similar
    to the HMM in graphical structure and used
    exponential models for transition-observation
    probabilities
  • Called Maximum Entropy Markov Model (MEMM)
  • Maximum Entropy modeling is used to model the
    conditional distributions
  • ME modeling is based on the principle of avoiding
    unnecessary assumptions
  • The principle states that the modeled probability
    distribution should be consistent with the given
    collection of facts about itself and otherwise be
    as uniform as possible

16
Maximum Entropy Markov Models
  • The mathematical interpretation of this principle
    results in a constrained optimization problem
  • Maximize the entropy of a conditional
    distribution , subject to given
    constraints
  • Constraints represent the known facts about the
    model from statistics of the training data

Definition 1
Definition 2
17
Maximum Entropy Markov Models
  • These definitions allow us to introduce the
    constraints of the model
  • The expected value of with respect to the
    model is
  • Using Lagrange multipliers for constrained
    optimization, the desired probability
    distribution is given by the maximum of the
    function

18
Maximum Entropy Markov Models
  • Finally, the solution of objective function is
    given by the exponential model

19
Reference
  • SAP06Jeff Kuo and Yuqing Gao Maximum Entropy
    Direct Models for Speech Recognition
Write a Comment
User Comments (0)
About PowerShow.com