Discriminative Learning for Hidden Markov Models - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Discriminative Learning for Hidden Markov Models

Description:

The objective function of MCE training is a smoothed recognition ... Traditional Stochastic GD. New Growth Transform. Gradient descent based online optimization ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 23
Provided by: richard308
Category:

less

Transcript and Presenter's Notes

Title: Discriminative Learning for Hidden Markov Models


1
Discriminative Learning for Hidden Markov Models
  • Li Deng

Microsoft Research
EE 516 UW Spring 2009
2
Minimum Classification Error (MCE)
  • The objective function of MCE training is a
    smoothed recognition error rate.
  • Traditionally, MCE criterion is optimized through
    stochastic gradient descent (e.g., GPD)
  • In this work we proposed the Growth
    Transformation based method for MCE based model
    estimation

3
Automatic Speech Recognition (ASR)
Speech recognition
4
Models (feature functions) in ASR
ASR in the log-linear framework
? is the parameter set of the acoustic model
(HMM), which is of interest at MCE training in
this work.
5
MCE Mis-classification measure
Define misclassification measure
(in the case of using correct and top one
incorrect competing tokens)
sr,1 the top one incorrect (not equal to Sr)
competing string
6
MCE Loss function
Classification
Classifi. error dr(Xr,?) gt 0 ? 1 classification
error dr(Xr,?) lt 0 ? 0
classification error
Loss function smoothed error count func.
7
MCE Objective function
MCE objective function
LMCE(?) is the smoothed recognition error rate on
the string (token) level. Model (acoustic model)
is trained to minimize LMCE(?), i.e., ? argmin
?LMCE(?)
8
MCE Optimization
9
MCE Optimization
  • Growth Transformation based MCE

If ?T(?') ensures P(?)gtP(?'), i.e., P(?) grows,
then T() is called a growth transformation of ?
for P(?).
Maximizing F(??') G-P'HD
Maximizing P(?) G(?)/H(?)
Minimizing LMCE(?) ? l ?d()?
Maximizing U(??') ? f '()log f ()
GT formula ?U()/?? 0 ? ? T(?')
Maximizing F(??') ? f ()
10
MCE Optimization
Re-write MCE loss function to
Then, min. LMCE(?) ? max. Q(?), where
11
MCE Optimization
Q(?) is further re-formulated to a single
fractional function P(?)
where
12
MCE Optimization
Increasing P(?) can be achieved by maximizing
as long as D is a ?-independent constant.
i.e.,
(?' is the parameter set obtained from last
iteration)
Substitute G() and H() into F(),
13
MCE Optimization
Reformulate F(??') to
where
F(??') is ready for EM style optimization
Note G(?') is a constant, and log p(?, q s, ?)
is easy to decompose.
14
MCE Optimization
Increasing F(??') can be achieved by maximizing
Use extend Baum-Welch for E step. log
f(?,q,s,??') is decomposable w.r.t ?, so M step
is easy to compute.
So the growth transformation of ? for CDHMM is
15
MCE Model estimation formulas
For Gaussian mixture CDHMM,
GT of mean and covariance of Gaussian m is
where
16
MCE Model estimation formulas
Setting of Dm
Theoretically, set Dm so that f(?,q,s,??') gt 0
Empirically,
17
MCE Workflow
Training utterances
Last iteration Model ?'
Recognition
Competing strings
Training transcripts
GT-MCE
next iteration
New model ?
18
Experiment TI-DIGITS
  • Vocabulary 1 to 9, plus oh and zero
  • Training set 8623 utterances / 28329 words
  • Test set 8700 utterances / 28583 words
  • 33-dimentional spectrum feature energy 10
    MFCCs, plus ? and ?? features.
  • Model Continuous Density HMMs
  • Total number of Gaussian components 3284

19
Experiment TI-DIGITS
GT-MCE vs. ML (maximum likelihood) baseline
Obtain the lowest error rate on this task Reduce
recognition Word Error Rate (WER) by 23 Fast and
stable convergence
20
Experiment Microsoft Tele. ASR
  • Microsoft Speech Server ENUTEL
  • A telephony speech recognition system
  • Training set 2000 hour speech / 2.7 million
    utterances
  • 33-dim spectrum features (EMFCCs) ? ??
  • Acoustic Model Gaussian mixture HMM
  • Total number of Gaussian components 100K
  • Vocabulary 120K (delivered vendor lexicon)
  • CPU Cluster 100 CPUs _at_ 1.8GHz 3.4GHz
  • Training Cost 45 hours per iteration

21
Experiment Microsoft Tele. ASR
  • Evaluate on four corpus-independent tests
  • Collected from sites other than training data
    providers
  • Cover major commercial Tele. ASR scenarios

22
Experiment Microsoft Tele. ASR
Significant performance improvements
across-the-board The first time MCE is
successfully applied to a 2000 hr. speech
database The Growth Transformation based MCE
training is well suited for large scale modeling
tasks
Write a Comment
User Comments (0)
About PowerShow.com