Title: Boosting HMM acoustic models in large vocabulary speech recognition
1Boosting HMM acoustic models in large vocabulary
speech recognition
- Carsten Meyer, Hauke Schramm
- Philips Research Laboratories, Germany
- SPEECH COMMUNICATION 2006
2AdaBoost introduction
- The AdaBoost algorithm was presented for
transforming a weak learning rule into a
strong one - The basic idea is to train a series of
classifiers based on the classification
performance of the previous classifier on the
training data - In multi-class classification, a popular variant
is the AdaBoost.M2 algorithm - AdaBoost.M2 is applicable when a mapping
can be defined for classifier
which is related to the classification
criterion
3 AdaBoost.M2 (Freund and
Schapire, 1997)
4AdaBoost introduction
- The update rule is designed to guarantee an upper
bound on the training error of the combined
classifier which is exponentially decreasing with
the number of individual classifiers - In multi-class problems, the weights
are summed up to give a weight for each
training pattern
5Introduction
- Why there are only a few studies so far applying
boosting to acoustic model training? - Speech recognition is an extremely complex large
scale classification problem - The main motivation to apply AdaBoost to speech
recognition is - Its theoretical foundation providing explicit
bounds on the training andin terms of marginson
the generalization error
6Introduction
- In most previous applications to speech
recognition, boosting was applied to classifying
each individual feature vector to a phoneme
symbol ICASSP04Dimitrakakis - Needing the phoneme posterior probabilities
- But the problem is..
- The conventional HMM speech recognizers do not
involve an intermediate phoneme classification
step for individual feature vectors - So the frame-level boosting approach cannot
straightforwardly be applied
7Utterance approach for boosting in ASR
- An intuitive way of applying boosting to HMM
speech recognition is at the utterance level - Thus, boosting is used to improve upon an initial
ranking of candidate word sequences - The utterance approach has two advantages
- First, it is directly related to the sentence
error rate - Second, it is computationally much less expensive
than boosting applied at the level of feature
vectors
8Utterance approach for boosting in ASR
- In utterance approach, we define the input
patterns to be the sequence of feature
vectors corresponding to the entire utterance - denotes one possible candidate word sequence
of the speech recognizer, being the correct
word sequence for utterance - The a posteriori confidence measure is calculated
on basis of the N-best list for utterance
9Utterance approach for boosting in ASR
- Based on the confidence values and AdaBoost.M2
algorithm, we calculate an utterance weight
for each training utterance - Subsequently, the weight are used in maximum
likelihood and discriminative training of
Gaussian mixture model
10Utterance approach for boosting in ASR
- Some problem encountered when apply it to
large-scale continuous speech application - The N-best lists of reasonable length (e.g.
N100) generally contain only a tiny fraction of
the possible classification results - This has two consequences
- In training, it may lead to sub-optimal utterance
weights - In recognition, Eq. (1) cannot be applied
appropriately
11Utterance approach for CSR--Training
- Training
- A convenient strategy to reduce the complexity of
the classification task and to provide more
meaningful N-best lists consists in chopping of
the training data - For long sentences, it simply means to insert
additional sentence break symbols at silence
intervals with a given minimum length - This reduces the number of possible
classifications of each sentence fragment, so
that the resulting N-best lists should cover a
sufficiently large fraction of hypotheses
12Utterance approach for CSR--Decoding
- Decoding lexical approach for model combination
- A single pass decoding setup, where the
combination of the boosted acoustic models is
realized at a lexical level - The basic idea is to add a new pronunciation
model by replicating the set of phoneme symbols
in each boosting iteration (e.g. by appending
the suffix _t to the phoneme symbol) - The new phoneme symbols, represent the underlying
acoustic model of boosting iteration
au, au_1 ,au_2,
13Utterance approach for CSR--Decoding
- Decoding lexical approach for model combination
(cont.) - Add to each phonetic transcription in the
decoding lexicon a new transcription using the
corresponding phoneme set - Use the reweighted training data to train the
boosted classifier - Decoding is then performed using the extended
lexicon and the set of acoustic models weighted
by their unigram prior probabilities which are
estimated on the training data
sic_a, sic_1 a_1 ,
weighted summation
14In more detail
Training
Training corpus
_t
Boosting Iteration t
Mt
phonetically transcribed
training corpus(Mt)
ML/MMI training
pronunciation variant
sic_a, sic_1 a_1 ,
Decoding
Lexicon
M1,M2,,Mt
unweighted model combination weighted model
combination
extend
15In more detail
16Weighted model combination
- Word level model combination
17Experiments
- Isolated word recognition
- Telephone-bandwidth large vocabulary isolated
word recognition - SpeechDat(II) German meterial
- Continuous speech recognition
- Professional dictation and Switchboard
18Isolated word recognition
- Database
- Training corpus consists of 18k utterances
(4.3h) of city, company, first and family names - Evaluations
- LILI test corpus 10k single word utterances
(3.5h) 10k words lexicon (matched conditions) - Names corpus an inhouse collection of 676
utterances (0.5h) two different decoding lexica
10k lex, 190k lex (acoustic conditions are
matched, whereas there is a lexical mismatch) - Office corpus 3.2k utterances (1.5h), recorded
over microphone in clean conditions 20k lexicon
(an acoustic mismatch to the training conditions)
19Isolated word recognition
20Isolated word recognition
- Combining boosting and discriminative training
- The experiments in isolated word recognition
showed that boosting may improve the best test
error rates
21Continuous speech recognition
- Database
- Professional dictation
- An inhouse data collection of real-life
recordings of medical reports - The acoustic training corpus consists of about
58h of data - Evaluations were carried out on two test corpora
- Development corpus consists of 5.0h of speech
- Evaluation corpus consists of 3.3h of speech
- Switchboard
- Consisting of spontaneous conversations recorded
over telephone line 57h(73h) of male(female) - Evaluations corpus
- Containing about 1h(0.5h) of male(female)
22Continuous speech recognition
23 24Conclusions
- In this paper, a boosting approach which can be
applied to any HMM based speech recognizer was be
presented and evaluated - The increased recognizer complexity and thus
decoding effort of the boosted systems is a major
drawback compared to other training techniques
like discriminative training
25References
- ICASSP02C.Meyer Utterance-Level Boosting of
HMM Speech Recognizers - ICML02C.Meyer Towards Large Margin Speech
Recognizers by Boosting and Discriminative
Training - ICSLP00C.Meyer Rival Training Efficient Use
of Data in Discriminative Training - ICASSP00Schramm and Aubert Efficient
Integration of Multiple Pronunciations in a Large
Vocabulary Decoder