Title: A Survey of Large Margin Hidden Markov Model
1A Survey of Large Margin Hidden Markov Model
- Xinwei Li, Hui Jiang
- York University
2Reference Papers
- Xinwei Li M.S. thesis Sep. 2005, Large
Margin HMMs for SR - Xinwei Li ICASSP 05, Large Margin HMMs for
SR - Chaojun Liu ICASSP 05, Discriminative
training of CDHMMs for Maximum Relative
Separation Margin - Xinwei Li ASRU 05, A constrained joint
optimization method for LME - Hui Jiang SAP 2006, Large Margin HMMs for
SR - Jinyu Li ICSLP 06, Soft Margin Estimation of
HMM parameters
3Outline
- Large Margin HMMs
- Analysis of Margin in CDHMM
- Optimization methods for Large Margin HMMs
estimation - Soft Margin Estimation for HMM
4Large Margin HMMs for ASR
- In ASR, given any speech utterance ?, a speech
recognizer will choose the word W as output based
on the plug-in MAP decision rule as follows - For a speech utterance Xi, assuming its true word
identity as Wi, the multiclass separation margin
for Xi is defined as
Discriminant function
O denotes the set of all possible words
5Large Margin HMMs for ASR
- According to the statistical learning theory
Vapnik, the generalization error rate of a
classifier in new test sets is theoretically
bounded by a quantity related to its margin - Motivated by the large margin principle, even for
those utterances in the training set which all
have positive margin, we may still want to
maximize the minimum margin to build an HMM-based
large margin classifier for ASR
6Large Margin HMMs for ASR
- Given a set of training data D X1, X2,,XT,
we usually know the true word identities for all
utterances in D, denoted as L W1, W2,,WT - First, from all utterances in D, we need to
identify a subset of utterances S as - We call S as support vector set and each
utterance in S is called a support token which
has relatively small positive margin among all
utterances in the training set D
where egt 0 is a preset positive number
7Large Margin HMMs for ASR
- This idea leads to estimating the HMM models ?
based on the criterion of maximizing the minimum
margin of all support tokens, which is named as
large margin estimation (LME) of HMM
8Analysis of Margin in CDHMM
- Adopt the Viterbi method to approximate the
summation with the single optimal Viterbi path,
the discriminant function can be expressed as
9Analysis of Margin in CDHMM
- Here, we only consider to estimate mean vectors
In this case, the discriminant functions can be
represented as a summation of some quadratic
terms related to mean values of CDHMMs
10Analysis of Margin in CDHMM
- As a result, the decision margin can be represent
as a standard diagonal quadratic form - Thus, for each feature vector xit, we can divide
all of its dimensions into two parts
we can see that each feature dimension
contributes to the decision margin separately
11Analysis of Margin in CDHMM
- After some math manipulation, we have
linear function
quadratic function
12Analysis of Margin in CDHMM
13Analysis of Margin in CDHMM
14Analysis of Margin in CDHMM
15Optimization methods for LM HMM estimation
- An iterative localized optimization method
- An constrained joint optimization method
- Semidefinite programming method
16Iterative localized optimization
- In order to increase the margin unlimitedly while
keeping the margins positive for all samples,
both of the models must be moved together - if we keep one of the models fixed, the other
model cannot be moved too far under the
constraint that all samples must have positive
margin - Otherwise the margin for some tokens will become
negative - Instead of optimizing parameters of all models at
the same time, only one selected model will be
adjusted in each step of optimization - Then the process iterates to update another model
until the optimal margin is achieved
17Iterative localized optimization
- How to select the target model in each step?
- The model should be relevant to the support token
with the minimum margin - The minimax optimization can be re-formulated as
-
18Iterative localized optimization
- Approximated by summation of exponential functions
19Iterative localized optimization
20Constrained Joint optimization
- Introduce some constraints to make the
optimization problem bounded - In this way, the optimization can be performed
jointly with respect to all model parameters
21Constrained Joint optimization
- In order to bound the margin contribution from
the linear part - In order to bound the margin contribution from
the quadratic part
22Constrained Joint optimization
- Reformulate the large margin estimation as the
following constrained minimax optimization
problem
23Constrained Joint optimization
- The constrained minimization problem can be
transformed into an unconstrained minimization
problem
24Constrained Joint optimization
25Soft Margin estimation
- Model separation measure and frame selection
- SME objective function and sample selection
26Soft Margin estimation
- Difference between SME and LME
- LME neglects the misclassified samples.
Consequently, LME often needs a very good
preliminary estimate from the training set - SME works on all the training data, both the
correctly classified and misclassified samples - While SME must first choose a margin ?
heuristically