Boosting HMM acoustic models in large vocabulary speech recognition - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Boosting HMM acoustic models in large vocabulary speech recognition

Description:

The AdaBoost algorithm was presented for transforming a 'weak' learning rule ... provide more meaningful N-best lists consists in 'chopping' of the training data ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 26

Provided by: Ryan85

Category:

more less

Transcript and Presenter's Notes

Title: Boosting HMM acoustic models in large vocabulary speech recognition

1
Boosting HMM acoustic models in large vocabulary
speech recognition

Carsten Meyer, Hauke Schramm
Philips Research Laboratories, Germany
SPEECH COMMUNICATION 2006

2
AdaBoost introduction

The AdaBoost algorithm was presented for
transforming a weak learning rule into a
strong one
The basic idea is to train a series of
classifiers based on the classification
performance of the previous classifier on the
training data
In multi-class classification, a popular variant
is the AdaBoost.M2 algorithm
AdaBoost.M2 is applicable when a mapping
can be defined for classifier
which is related to the classification
criterion

3
AdaBoost.M2 (Freund and
Schapire, 1997)
4
AdaBoost introduction

The update rule is designed to guarantee an upper
bound on the training error of the combined
classifier which is exponentially decreasing with
the number of individual classifiers
In multi-class problems, the weights
are summed up to give a weight for each
training pattern

5
Introduction

Why there are only a few studies so far applying
boosting to acoustic model training?
Speech recognition is an extremely complex large
scale classification problem
The main motivation to apply AdaBoost to speech
recognition is
Its theoretical foundation providing explicit
bounds on the training andin terms of marginson
the generalization error

6
Introduction

In most previous applications to speech
recognition, boosting was applied to classifying
each individual feature vector to a phoneme
symbol ICASSP04Dimitrakakis
Needing the phoneme posterior probabilities
But the problem is..
The conventional HMM speech recognizers do not
involve an intermediate phoneme classification
step for individual feature vectors
So the frame-level boosting approach cannot
straightforwardly be applied

7
Utterance approach for boosting in ASR

An intuitive way of applying boosting to HMM
speech recognition is at the utterance level
Thus, boosting is used to improve upon an initial
ranking of candidate word sequences
The utterance approach has two advantages
First, it is directly related to the sentence
error rate
Second, it is computationally much less expensive
than boosting applied at the level of feature
vectors

8
Utterance approach for boosting in ASR

In utterance approach, we define the input
patterns to be the sequence of feature
vectors corresponding to the entire utterance
denotes one possible candidate word sequence
of the speech recognizer, being the correct
word sequence for utterance
The a posteriori confidence measure is calculated
on basis of the N-best list for utterance

9
Utterance approach for boosting in ASR

Based on the confidence values and AdaBoost.M2
algorithm, we calculate an utterance weight
for each training utterance
Subsequently, the weight are used in maximum
likelihood and discriminative training of
Gaussian mixture model

10
Utterance approach for boosting in ASR

Some problem encountered when apply it to
large-scale continuous speech application
The N-best lists of reasonable length (e.g.
N100) generally contain only a tiny fraction of
the possible classification results
This has two consequences
In training, it may lead to sub-optimal utterance
weights
In recognition, Eq. (1) cannot be applied
appropriately

11
Utterance approach for CSR--Training

Training
A convenient strategy to reduce the complexity of
the classification task and to provide more
meaningful N-best lists consists in chopping of
the training data
For long sentences, it simply means to insert
additional sentence break symbols at silence
intervals with a given minimum length
This reduces the number of possible
classifications of each sentence fragment, so
that the resulting N-best lists should cover a
sufficiently large fraction of hypotheses

12
Utterance approach for CSR--Decoding

Decoding lexical approach for model combination
A single pass decoding setup, where the
combination of the boosted acoustic models is
realized at a lexical level
The basic idea is to add a new pronunciation
model by replicating the set of phoneme symbols
in each boosting iteration (e.g. by appending
the suffix _t to the phoneme symbol)
The new phoneme symbols, represent the underlying
acoustic model of boosting iteration

au, au_1 ,au_2,
13
Utterance approach for CSR--Decoding

Decoding lexical approach for model combination
(cont.)
Add to each phonetic transcription in the
decoding lexicon a new transcription using the
corresponding phoneme set
Use the reweighted training data to train the
boosted classifier
Decoding is then performed using the extended
lexicon and the set of acoustic models weighted
by their unigram prior probabilities which are
estimated on the training data

sic_a, sic_1 a_1 ,
weighted summation
14
In more detail
Training
Training corpus
_t
Boosting Iteration t
Mt
phonetically transcribed
training corpus(Mt)
ML/MMI training
pronunciation variant
sic_a, sic_1 a_1 ,
Decoding
Lexicon
M1,M2,,Mt
unweighted model combination weighted model
combination
extend
15
In more detail
16
Weighted model combination

Word level model combination

17
Experiments

Isolated word recognition
Telephone-bandwidth large vocabulary isolated
word recognition
SpeechDat(II) German meterial
Continuous speech recognition
Professional dictation and Switchboard

18
Isolated word recognition

Database
Training corpus consists of 18k utterances
(4.3h) of city, company, first and family names
Evaluations
LILI test corpus 10k single word utterances
(3.5h) 10k words lexicon (matched conditions)
Names corpus an inhouse collection of 676
utterances (0.5h) two different decoding lexica
10k lex, 190k lex (acoustic conditions are
matched, whereas there is a lexical mismatch)
Office corpus 3.2k utterances (1.5h), recorded
over microphone in clean conditions 20k lexicon
(an acoustic mismatch to the training conditions)

19
Isolated word recognition

Boosting ML models

20
Isolated word recognition

Combining boosting and discriminative training
The experiments in isolated word recognition
showed that boosting may improve the best test
error rates

21
Continuous speech recognition

Database
Professional dictation
An inhouse data collection of real-life
recordings of medical reports
The acoustic training corpus consists of about
58h of data
Evaluations were carried out on two test corpora
Development corpus consists of 5.0h of speech
Evaluation corpus consists of 3.3h of speech
Switchboard
Consisting of spontaneous conversations recorded
over telephone line 57h(73h) of male(female)
Evaluations corpus
Containing about 1h(0.5h) of male(female)

22
Continuous speech recognition

Professional dictation

Switchboard

24
Conclusions

In this paper, a boosting approach which can be
applied to any HMM based speech recognizer was be
presented and evaluated
The increased recognizer complexity and thus
decoding effort of the boosted systems is a major
drawback compared to other training techniques
like discriminative training

25
References

ICASSP02C.Meyer Utterance-Level Boosting of
HMM Speech Recognizers
ICML02C.Meyer Towards Large Margin Speech
Recognizers by Boosting and Discriminative
Training
ICSLP00C.Meyer Rival Training Efficient Use
of Data in Discriminative Training
ICASSP00Schramm and Aubert Efficient
Integration of Multiple Pronunciations in a Large
Vocabulary Decoder