A Survey of Boosting HMM Acoustic Model Training - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

A Survey of Boosting HMM Acoustic Model Training

Description:

The No Free Lunch Theorem states that ... decision trees, multilayer perceptrons, condensed nearest neighbor ... company, first and family names. Evaluations: ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 57

Provided by: Ryan85

Category:

more less

Transcript and Presenter's Notes

Title: A Survey of Boosting HMM Acoustic Model Training

1
A Survey of Boosting HMM Acoustic Model Training
2
Introduction

The No Free Lunch Theorem states that
There is no single learning algorithm that in any
domain always induces the most accurate learner
Learning is an ill-posed problem and with finite
data, each algorithm converges to a different
solution and fails under different circumstances
Though the performance of a learner may be
fine-tuned, but still there are instances on
which even the best learner is not accurate
enough
The idea is..
There may be another learner that is accurate on
these instances
By suitably combining multiple learners then,
accuracy can be improved

3
Introduction

Since there is no point in combining learners
that always make similar decisions
The aim is to be able to find a set of
base-learners who differ in their decisions so
that they will complement each other
There are different ways the multiple
base-learners are combined to generate the final
outputs
Multiexpert combination methods
Voting and its variants
Mixture of experts
Stacked generalization
Multistage combination methods
Cascading

4
Voting

The simplest way to combine multiple classifiers
which corresponds to taking a linear combination
of the learners
this is also known as ensembles and linear
opinion pools
The name voting comes from its use in
classification
if , called
plurality voting
if , called
majority voting

5
Bagging

Bagging is a voting method whereby base-learners
are made different by training them over slightly
different training sets
is done by bootstrap
where given a training set X of size N, we draw N
instances randomly from X with replacement
In bagging, generating complementary
base-learners is left to chance and to the
instability of the learning method
A learning algorithm is an unstable algorithm if
small changes in the training set causes a large
difference in the generated learner
decision trees, multilayer perceptrons, condensed
nearest neighbor
Bagging is short for Bootstrap aggregating

Breiman, L. 1996. Bagging Predictors. Machine
Learning 26, 123-140
6
Boosting

In boosting, we actively try to generate
complementary base-learners by training the next
learner on the mistakes of the previous learners
The original boosting algorithms (Schapire 1990)
combines three weak learners to generate a strong
learner
In the sense of the probably approximately
correct (PAC) learning model
Disadvantage
It requires a very large training sample

Schapire, R.E. 1990. The Strength of Weak
Learnability. Machine Learning 5, 197-227
7
AdaBoost

AdaBoost, short for adaptive boosting, uses the
same training set over and over and thus need not
be large and it can also combine an arbitrary
number of base-learners, not three
The idea is to modify the probabilities of
drawing the instances as a function of the error
The probability of a correctly classified
instance is decreased, then a new sample set is
drawn from the original sample according to these
modified probabilities
That focuses more on instances misclassified by
previous learner
Schapire et al. explain that the success of
AdaBoost is due to its property of increasing the
margin
Schapire. et al. 1998. Boosting the Margin A
New Explanation for Effectiveness of Voting
Methods Annals of Statistics 26, 1651-1686

Freund and Schapire. 1996. Experiments with a
New Boosting Algorithm In ICML 13, 148-156
8
AdaBoost.M2 (Freund and
Schapire, 1997)
Freund and Schapire. 1997. A decision-theoretic
generalization of on-line learning and an
application to boosting Journal of Computer and
System Sciences 55, 119-139
9
Evolution of Boosting Algo.
5
4
6
ICSLP 04R. Zhang A. RudnickyA Frame Level
Boosting Training Scheme for Acoustic Modeling
ICASSP 04C. Dimitrakakis S. BengioBoosting
HMMs with An Application to Speech Recognition
ICSLP 04R. Zhang A. RudnickyApply N-Best
List Re-Ranking to Acoustic Model Combinations of
Boosting Training
D
ICASSP 00G. Zweig M. PadmanabhanBoosting
Gaussian Mixtures in An LVCSR System
7
ICSLP 04R. Zhang A. RudnickyOptimizing
Boosting with Discriminative Criteria
8
EuroSpeech 05R. Zhang et al.Investigations on
Ensemble Based Semi-Supervised Acoustic Model
Training
C
ICASSP 99H. SchwenkUsing Boosting to Improve a
Hybrid HMM/Neural Network Speech Recognizer
1999
1996
2003
2002
1997
2000
2004
2005
2006
9
ICSLP 06R. Zhang A. Rudnicky Investigations
of Issues for Using Multiple Acoustic Models to
Improve CSR
A
ICSLP 96G. Cook T. RobinsonBoosting the
Performance of Connectionist LVSR
Neural Network
SpeechCom 06 C. Meyer H. Schramm Boosting HMM
Acoustic Models in LVCSR
2
ICASSP 03R. Zhang A. RudnickyImproving the
Performance of An LVCSR System Through Ensembles
of Acoustic Models
GMM
B
EuroSpeech 97G. Cook et al.Ensemble Methods
for Connectionist Acoustic Modeling
HMM
3
EuroSpeech 03 R. Zhang A. Rudnicky Comparative
Study of Boosting and Non-Boosting Training for
Constructing Ensembles of Acoustic Models
0
ICASSP 02C. MeyerUtterance-Level Boosting of
HMM Speech Recognition
1
ICASSP 02 I. Zitouni et al. Combination of
Boosting and Discriminative Training for Natural
Language Call Steering Systems
10
Improving The Performance of An LVCSR System
Through Ensembles of Acoustic Models

ICASSP 2003
Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute,
School of Computer Science
Carnegie Mellon University

11
Bagging vs. Boosting

Bagging
In each round, bagging randomly selects a number
of examples from the original training set, and
produces a new single classifier based on the
selected subset
The final classifier is built by choosing the
hypothesis best agreed on by single classifiers
Boosting
In boosting, the single classifiers are
iteratively trained in a fashion such that
hard-to-classify examples are given increasing
emphasis
A parameter that measures the classifiers
importance is determined in respect of its
classification accuracy
The final hypothesis is the weighted majority
vote from the single classifiers

12
Algorithms

The first algorithm is based on the intuition
that an incorrectly recognized utterance should
receive more attention in training
If the weight of an utterance is 2.6, we first
add two copies of the utterance to the new
training set, and then add its third copy with
probability 0.6

13
Algorithms

The exponential increase in the size of training
set is a severe problem for algorithm 1
Algorithm 2 is proposed to address this problem

14
Algorithms

In algorithm 1 and 2, there is no concern to
measure how important a model is relative to
others
Good model should play more important role than
bad one

15
Experiments

Corpus CMU Communicator system
Experimental results

16
Comparative Study of Boosting and Non-Boosting
Training for Constructing Ensembles of Acoustic
Models

Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute, CMU
EuroSpeech 2003

17
Non-Boosting method

Bagging
is a commonly used method in machine learning
field
randomly selects a number of examples from the
original training set and produces a new single
classifier
in this paper, we call it a non-Boosting method
Based on the intuition
The misrecognized utterance should receive more
attention in the successive training

18
Algorithms
? is a parameter that prevents the size of the
training set from being too large.
19
Experiments

The corpus
Training set 31248 utterances Test set 1689
utterances

20
A Frame Level Boosting Training Scheme for
Acoustic Modeling

ICSLP 2004
Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute,
School of Computer Science
Carnegie Mellon University

21
Introduction

In the current Boosting algorithm, utterance is
the basic unit used for acoustic model training
Our analysis shows that there are two notable
weaknesses in this setting..
First, the objective function of current Boosting
algorithm is designed to minimize utterance error
instead of word error
Second, in the current algorithm, an utterance is
treated as a unity for resample
This paper proposes a frame level Boosting
training scheme for acoustic modeling to address
these two problems

22
Frame Level Boosting Training Scheme

The metrics that we will use in Boosting training
is the frame level conditional probability
-----(word level)
Objective function

is the pseudo
loss for frame t, which describes the degree of
confusion of this frame for recognition
23
Frame Level Boosting Training Scheme

Training Scheme
How to resample the frame level training data?
to duplicate for times and creates
a new utterance for acoustic model training

24
Experiments

Corpus CMU Communicator system
Experimental results

25
Boosting HMM acoustic models in large vocabulary
speech recognition

Carsten Meyer, Hauke Schramm
Philips Research Laboratories, Germany
SPEECH COMMUNICATION 2006

26
Utterance approach for boosting in ASR

An intuitive way of applying boosting to HMM
speech recognition is at the utterance level
Thus, boosting is used to improve upon an initial
ranking of candidate word sequences
The utterance approach has two advantages
First, it is directly related to the sentence
error rate
Second, it is computationally much less expensive
than boosting applied at the level of feature
vectors

27
Utterance approach for boosting in ASR

In utterance approach, we define the input
patterns to be the sequence of feature
vectors corresponding to the entire utterance
denotes one possible candidate word sequence
of the speech recognizer, being the correct
word sequence for utterance
The a posteriori confidence measure is calculated
on basis of the N-best list for utterance

28
Utterance approach for boosting in ASR

Based on the confidence values and AdaBoost.M2
algorithm, we calculate an utterance weight
for each training utterance
Subsequently, the weight are used in maximum
likelihood and discriminative training of
Gaussian mixture model

29
Utterance approach for boosting in ASR

Some problem encountered when apply it to
large-scale continuous speech application
The N-best lists of reasonable length (e.g.
N100) generally contain only a tiny fraction of
the possible classification results
This has two consequences
In training, it may lead to sub-optimal utterance
weights
In recognition, Eq. (1) cannot be applied
appropriately

30
Utterance approach for CSR--Training

Training
A convenient strategy to reduce the complexity of
the classification task and to provide more
meaningful N-best lists consists in chopping of
the training data
For long sentences, it simply means to insert
additional sentence break symbols at silence
intervals with a given minimum length
This reduces the number of possible
classifications of each sentence fragment, so
that the resulting N-best lists should cover a
sufficiently large fraction of hypotheses

31
Utterance approach for CSR--Decoding

Decoding lexical approach for model combination
A single pass decoding setup, where the
combination of the boosted acoustic models is
realized at a lexical level
The basic idea is to add a new pronunciation
model by replicating the set of phoneme symbols
in each boosting iteration (e.g. by appending
the suffix _t to the phoneme symbol)
The new phoneme symbols, represent the underlying
acoustic model of boosting iteration

au, au_1 ,au_2,
32
Utterance approach for CSR--Decoding

Decoding lexical approach for model combination
(cont.)
Add to each phonetic transcription in the
decoding lexicon a new transcription using the
corresponding phoneme set
Use the reweighted training data to train the
boosted classifier
Decoding is then performed using the extended
lexicon and the set of acoustic models weighted
by their unigram prior probabilities which are
estimated on the training data

sic_a, sic_1 a_1 ,
weighted summation
33
In more detail
Training
Training corpus
_t
Boosting Iteration t
Mt
phonetically transcribed
training corpus(Mt)
ML/MMI training
pronunciation variant
sic_a, sic_1 a_1 ,
Decoding
Lexicon
M1,M2,,Mt
unweighted model combination weighted model
combination
extend
34
In more detail
35
Weighted model combination

Word level model combination

36
Experiments

Isolated word recognition
Telephone-bandwidth large vocabulary isolated
word recognition
SpeechDat(II) German meterial
Continuous speech recognition
Professional dictation and Switchboard

37
Isolated word recognition

Database
Training corpus consists of 18k utterances
(4.3h) of city, company, first and family names
Evaluations
LILI test corpus 10k single word utterances
(3.5h) 10k words lexicon (matched conditions)
Names corpus an inhouse collection of 676
utterances (0.5h) two different decoding lexica
10k lex, 190k lex (acoustic conditions are
matched, whereas there is a lexical mismatch)
Office corpus 3.2k utterances (1.5h), recorded
over microphone in clean conditions 20k lexicon
(an acoustic mismatch to the training conditions)

38
Isolated word recognition

Boosting ML models

39
Isolated word recognition

Combining boosting and discriminative training
The experiments in isolated word recognition
showed that boosting may improve the best test
error rates

40
Continuous speech recognition

Database
Professional dictation
An inhouse data collection of real-life
recordings of medical reports
The acoustic training corpus consists of about
58h of data
Evaluations were carried out on two test corpora
Development corpus consists of 5.0h of speech
Evaluation corpus consists of 3.3h of speech
Switchboard
Consisting of spontaneous conversations recorded
over telephone line 57h(73h) of male(female)
Evaluations corpus
Containing about 1h(0.5h) of male(female)

41
Continuous speech recognition

Professional dictation

Switchboard

43
Conclusions

In this paper, a boosting approach which can be
applied to any HMM based speech recognizer was be
presented and evaluated
The increased recognizer complexity and thus
decoding effort of the boosted systems is a major
drawback compared to other training techniques
like discriminative training

44
Probably Approximately Correct Learning

We would like our hypothesis to be approximately
correct, namely, that the error probability be
bounded by some value
We also would like to be confident in our
hypothesis in that we want to know that our
hypothesis will be correct most of the time, so
we want to be probably correct as well
Given a class, , and examples drawn from
some unknown but fixed probability distribution,
such that with probability at least ,
the hypothesis has error at most , for
arbitrary and

45
Probably Approximately Correct Learning

How many training examples N should we have, such
that with probability at least 1 ? d, h has error
at most e ?

most specific hypothesis, S
most general hypothesis, G

Each strip is at most e/4
Pr that we miss a strip 1? e/4
Pr that N instances miss a strip (1 ? e/4)N
Pr that N instances miss 4 strips 4(1 ? e/4)N
4(1 ? e/4)N d and (1 ? x)exp( ? x)
4exp(? eN/4) d and N (4/e)log(4/d)

h Î H, between S and G is consistent and make
up the version space
46
The Boosting Approach to Machine Learning An
Overview

Robert E. Schapire
ATT Labs, USA
MSRI Workshop on Nonlinear Estimation and
Classification, 2002

47
Abstract

This paper overviews some of recent work on
boosting including
Analyses of AdaBoosts training error and
generalization error
Boostings connection to game theory and linear
programming
The relationship between boosting and logistic
regression
Extensions of AdaBoost for multiclass
classification problems
Methods of incorporation human knowledge into
boosting

48
References

Freund and Schapire. 1997. A decision-theoretic
generalization of on-line learning and an
application to boosting Journal of Computer and
System Sciences 55, 119-139
Meir and Ratsch. 2003 An introduction to
boosting and leveraging in Advanced Lectures on
Machine Learning (LNAI2600), 118-183

49
Introduction

Boosting is based on the observation
finding many rough rules of thumb can be a lot
easier than finding a single, highly accurate
prediction rule
Two fundamental questions
How should each distribution be chosen on each
round?
How should the weak rules be combined into a
single rule?

A method for finding rough rules of thumb is
called as weak or base learning algorithm
50
AdaBoost algorithm
51
AdaBoost algorithm cont.

The base learners job is to find a base
classifier appropriate for the
distribution
In the binary case, the base learners then is to
minimize the error
AdaBoost choose a parameter that
intuitively measures the importance that it
assigns to

52
Analyzing the training error

The most theoretical property of AdaBoost
concerns its ability to reduce the training error
The training error of the final classifier is
bounded as follows

define
53
Detail derivation
54
Analyzing the training error cont.

The training error can be reduced most rapidly by
choosing and on each round to minimize
In the case of binary classifiers,

55
Analyzing the training error cont.

Thus, if each base classifier is slightly better
than random so that for some ,
then the training error drops exponentially fast
in T
The fact that AdaBoost is a procedure for finding
a linear combination f of base classifiers which
attempts to minimize

AdaBoost is doing a kind of steepest descent
search to minimize above equation where the
search is constrained at each step to follow
coordinate directions
Mason et al. 1999. Boosting Algorithms as
gradient descent in Advances in Neural
Information Processing Systems 12, 2000
56
Detail derivation

Write a Comment

User Comments (0)