Discriminative Models for Speech Recognition - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Discriminative Models for Speech Recognition

Description:

Number of Views:219

Avg rating:3.0/5.0

Slides: 20

Provided by: Ryan85

Category:

more less

Transcript and Presenter's Notes

Title: Discriminative Models for Speech Recognition

1
Discriminative Models for Speech Recognition

2
Outline

3
Automatic Speech Recognition

The task of the speech recognition is to
determine the identity of an given observation
sequence by assigning the recognized word
sequence to it
The decision is to find the identity with maximum
a posterior (MAP) probability
The so-called Bayes decision (or
minimum-error-rate) rule
A certain parametric representation of these
distributions is needed
HMMs are widely adopted for acoustic modeling

Multinomial
Gaussian
4
Acoustic Modeling (1/2)

In the development of an ASR system, acoustic
modeling is always an indispensable and crucial
ingredient
The purpose of acoustic modeling is to provide a
method to calculate the likelihood of a speech
utterance occurring given a word sequence,
In principle, the word sequence can be decomposed
into a sequence of phone-like units (acoustic
models)
Each of which is normally represented by a HMM,
and can be estimated from a corpus of training
utterances
Traditionally, the maximum likelihood (ML)
training can be employed for this estimation

5
Acoustic Modeling (2/2)

Besides the ML training, the acoustic model can
be alternative trained with discriminative
training criteria
MCE training?MMI training ?MPE trainingetc
In MCE training, an approximation to the error
rate on the training data is optimized
The MMI and MPE algorithms were developed in an
attempt to correctly discriminate the recognition
hypotheses for the best recognition results
However..
The underlying acoustic model is still
generative, with the associated constraints on
the state and transition probability
distributions
Classification is based on Bayes decision rule

6
Introduction

Initially these discriminative criteria were
applied to small vocabulary speech recognition
tasks
A number of techniques were then developed to
enable their use for LVCSR tasks
I-smoothing
Language model weakening
The use of lattices to compactly represent the
denominator score
But the performance on LVCSR tasks is still not
satisfactory for many speech-enabled applications
This has led to interest in discriminative (or
direct) models for speech recognition where the
posterior of the word-sequence given the
observation, ,is directly modeled

7
Hidden Markov Models

8
Discriminative Training Criteria

The discriminative training criteria are more
closely linked to minimizing the error rate,
rather than maximizing the likelihood of
generating the training data
Three main forms of discriminative training have
been examined
Maximum Mutual Information (MMI)
Minimum Classification Error (MCE)
Minimum Bayes Risk (MBR)
Minimum Phone Error (MPE)

9
Discriminative Training Criteria

Maximum Mutual Information
To maximizing the mutual information between the
observed sequences and models
Minimum Classification Error
Based on a smooth function of the difference
between the log-likelihood of the correct
sequence and all other competing word sequences

10
Discriminative Training Criteria

Minimum Bayes Risk
Rather than trying to model the correct
distribution, the expected loss during inference
is minimized
A number of loss function
1/0 function
equivalent to a sentence-level loss function
Word
the loss function directly related to minimizing
the expected Word Error Rate (WER)
Phone

11
Large Margin HMMs

The simplest form of large margin training
criterion can be expressed as maximizing Li et
al. 2005
This aims to maximize the minimum distance
between the log-posterior of the correct label
and all the incorrect labels
Some properties related to both the MMI and MCE
criterion
A log-posterior cost function is used as in the
MMI criterion
The denominator term used with this approach does
not include an element from the correct label in
a similar fashion to the MCE criterion

12
Large Margin HMMs

A couple of variants of large margin training
Soft margin training Jinyu Li et al. 2006
Large margin GMM F. Sha and L.K. Saul 2007
The size of the margin is specified in terms of a
loss function between the two sets of sequences

where
13
Direct Models

Direct modeling attempts to model the posterior
probability directly
There are many potential advantages as well as
challenges for direct modeling
The direct model can potentially make decoding
simpler
The direct model allows for the potential
combination of multiple sources of data in a
unified fashion
Asynchronous and overlapping features can be
incorporated formally
It will be possible to take advantage of
supra-segmental features like prosodic features,
acoustic phonetic features, speaker style, rate
of speech, channel differences
However, joint estimation would require a large
amount of parallel speech and text data (a
challenge for data collection)

14
Direct Models

The relationship between observations and states
is reversed
Separate transition and observation probabilities
are replaced with one function
Directly modeling makes direct
computation of possible
The model can also be conditioned flexibly on a
variety of contextual features
Any computable property of the observation
sequence can be used as a feature
The number of features at each time frame need
not be the same

Assumption
15
Maximum Entropy Markov Models

Recently, McCallum et al. (ICML 2000) modeled
sequential processes using a direct model similar
to the HMM in graphical structure and used
exponential models for transition-observation
probabilities
Called Maximum Entropy Markov Model (MEMM)
Maximum Entropy modeling is used to model the
conditional distributions
ME modeling is based on the principle of avoiding
unnecessary assumptions
The principle states that the modeled probability
distribution should be consistent with the given
collection of facts about itself and otherwise be
as uniform as possible

16
Maximum Entropy Markov Models

The mathematical interpretation of this principle
results in a constrained optimization problem
Maximize the entropy of a conditional
distribution , subject to given
constraints
Constraints represent the known facts about the
model from statistics of the training data

Definition 1
Definition 2
17
Maximum Entropy Markov Models

These definitions allow us to introduce the
constraints of the model
The expected value of with respect to the
model is
Using Lagrange multipliers for constrained
optimization, the desired probability
distribution is given by the maximum of the
function

18
Maximum Entropy Markov Models