Title: Discriminative Learning for Hidden Markov Models
1Discriminative Learning for Hidden Markov Models
Microsoft Research
EE 516 UW Spring 2009
2Minimum Classification Error (MCE)
- The objective function of MCE training is a
smoothed recognition error rate. - Traditionally, MCE criterion is optimized through
stochastic gradient descent (e.g., GPD) - In this work we proposed the Growth
Transformation based method for MCE based model
estimation
3Automatic Speech Recognition (ASR)
Speech recognition
4Models (feature functions) in ASR
ASR in the log-linear framework
? is the parameter set of the acoustic model
(HMM), which is of interest at MCE training in
this work.
5MCE Mis-classification measure
Define misclassification measure
(in the case of using correct and top one
incorrect competing tokens)
sr,1 the top one incorrect (not equal to Sr)
competing string
6MCE Loss function
Classification
Classifi. error dr(Xr,?) gt 0 ? 1 classification
error dr(Xr,?) lt 0 ? 0
classification error
Loss function smoothed error count func.
7MCE Objective function
MCE objective function
LMCE(?) is the smoothed recognition error rate on
the string (token) level. Model (acoustic model)
is trained to minimize LMCE(?), i.e., ? argmin
?LMCE(?)
8MCE Optimization
9MCE Optimization
- Growth Transformation based MCE
If ?T(?') ensures P(?)gtP(?'), i.e., P(?) grows,
then T() is called a growth transformation of ?
for P(?).
Maximizing F(??') G-P'HD
Maximizing P(?) G(?)/H(?)
Minimizing LMCE(?) ? l ?d()?
Maximizing U(??') ? f '()log f ()
GT formula ?U()/?? 0 ? ? T(?')
Maximizing F(??') ? f ()
10MCE Optimization
Re-write MCE loss function to
Then, min. LMCE(?) ? max. Q(?), where
11MCE Optimization
Q(?) is further re-formulated to a single
fractional function P(?)
where
12MCE Optimization
Increasing P(?) can be achieved by maximizing
as long as D is a ?-independent constant.
i.e.,
(?' is the parameter set obtained from last
iteration)
Substitute G() and H() into F(),
13MCE Optimization
Reformulate F(??') to
where
F(??') is ready for EM style optimization
Note G(?') is a constant, and log p(?, q s, ?)
is easy to decompose.
14MCE Optimization
Increasing F(??') can be achieved by maximizing
Use extend Baum-Welch for E step. log
f(?,q,s,??') is decomposable w.r.t ?, so M step
is easy to compute.
So the growth transformation of ? for CDHMM is
15MCE Model estimation formulas
For Gaussian mixture CDHMM,
GT of mean and covariance of Gaussian m is
where
16MCE Model estimation formulas
Setting of Dm
Theoretically, set Dm so that f(?,q,s,??') gt 0
Empirically,
17MCE Workflow
Training utterances
Last iteration Model ?'
Recognition
Competing strings
Training transcripts
GT-MCE
next iteration
New model ?
18Experiment TI-DIGITS
- Vocabulary 1 to 9, plus oh and zero
- Training set 8623 utterances / 28329 words
- Test set 8700 utterances / 28583 words
- 33-dimentional spectrum feature energy 10
MFCCs, plus ? and ?? features. - Model Continuous Density HMMs
- Total number of Gaussian components 3284
19Experiment TI-DIGITS
GT-MCE vs. ML (maximum likelihood) baseline
Obtain the lowest error rate on this task Reduce
recognition Word Error Rate (WER) by 23 Fast and
stable convergence
20Experiment Microsoft Tele. ASR
- Microsoft Speech Server ENUTEL
- A telephony speech recognition system
- Training set 2000 hour speech / 2.7 million
utterances - 33-dim spectrum features (EMFCCs) ? ??
- Acoustic Model Gaussian mixture HMM
- Total number of Gaussian components 100K
- Vocabulary 120K (delivered vendor lexicon)
- CPU Cluster 100 CPUs _at_ 1.8GHz 3.4GHz
- Training Cost 45 hours per iteration
21Experiment Microsoft Tele. ASR
- Evaluate on four corpus-independent tests
- Collected from sites other than training data
providers - Cover major commercial Tele. ASR scenarios
22Experiment Microsoft Tele. ASR
Significant performance improvements
across-the-board The first time MCE is
successfully applied to a 2000 hr. speech
database The Growth Transformation based MCE
training is well suited for large scale modeling
tasks