Boosting HMM acoustic models in large vocabulary speech recognition - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Boosting HMM acoustic models in large vocabulary speech recognition

Description:

The AdaBoost algorithm was presented for transforming a 'weak' learning rule ... provide more meaningful N-best lists consists in 'chopping' of the training data ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 26
Provided by: Ryan85
Category:

less

Transcript and Presenter's Notes

Title: Boosting HMM acoustic models in large vocabulary speech recognition


1
Boosting HMM acoustic models in large vocabulary
speech recognition
  • Carsten Meyer, Hauke Schramm
  • Philips Research Laboratories, Germany
  • SPEECH COMMUNICATION 2006

2
AdaBoost introduction
  • The AdaBoost algorithm was presented for
    transforming a weak learning rule into a
    strong one
  • The basic idea is to train a series of
    classifiers based on the classification
    performance of the previous classifier on the
    training data
  • In multi-class classification, a popular variant
    is the AdaBoost.M2 algorithm
  • AdaBoost.M2 is applicable when a mapping
    can be defined for classifier
    which is related to the classification
    criterion

3
AdaBoost.M2 (Freund and
Schapire, 1997)
4
AdaBoost introduction
  • The update rule is designed to guarantee an upper
    bound on the training error of the combined
    classifier which is exponentially decreasing with
    the number of individual classifiers
  • In multi-class problems, the weights
    are summed up to give a weight for each
    training pattern

5
Introduction
  • Why there are only a few studies so far applying
    boosting to acoustic model training?
  • Speech recognition is an extremely complex large
    scale classification problem
  • The main motivation to apply AdaBoost to speech
    recognition is
  • Its theoretical foundation providing explicit
    bounds on the training andin terms of marginson
    the generalization error

6
Introduction
  • In most previous applications to speech
    recognition, boosting was applied to classifying
    each individual feature vector to a phoneme
    symbol ICASSP04Dimitrakakis
  • Needing the phoneme posterior probabilities
  • But the problem is..
  • The conventional HMM speech recognizers do not
    involve an intermediate phoneme classification
    step for individual feature vectors
  • So the frame-level boosting approach cannot
    straightforwardly be applied

7
Utterance approach for boosting in ASR
  • An intuitive way of applying boosting to HMM
    speech recognition is at the utterance level
  • Thus, boosting is used to improve upon an initial
    ranking of candidate word sequences
  • The utterance approach has two advantages
  • First, it is directly related to the sentence
    error rate
  • Second, it is computationally much less expensive
    than boosting applied at the level of feature
    vectors

8
Utterance approach for boosting in ASR
  • In utterance approach, we define the input
    patterns to be the sequence of feature
    vectors corresponding to the entire utterance
  • denotes one possible candidate word sequence
    of the speech recognizer, being the correct
    word sequence for utterance
  • The a posteriori confidence measure is calculated
    on basis of the N-best list for utterance

9
Utterance approach for boosting in ASR
  • Based on the confidence values and AdaBoost.M2
    algorithm, we calculate an utterance weight
    for each training utterance
  • Subsequently, the weight are used in maximum
    likelihood and discriminative training of
    Gaussian mixture model

10
Utterance approach for boosting in ASR
  • Some problem encountered when apply it to
    large-scale continuous speech application
  • The N-best lists of reasonable length (e.g.
    N100) generally contain only a tiny fraction of
    the possible classification results
  • This has two consequences
  • In training, it may lead to sub-optimal utterance
    weights
  • In recognition, Eq. (1) cannot be applied
    appropriately

11
Utterance approach for CSR--Training
  • Training
  • A convenient strategy to reduce the complexity of
    the classification task and to provide more
    meaningful N-best lists consists in chopping of
    the training data
  • For long sentences, it simply means to insert
    additional sentence break symbols at silence
    intervals with a given minimum length
  • This reduces the number of possible
    classifications of each sentence fragment, so
    that the resulting N-best lists should cover a
    sufficiently large fraction of hypotheses

12
Utterance approach for CSR--Decoding
  • Decoding lexical approach for model combination
  • A single pass decoding setup, where the
    combination of the boosted acoustic models is
    realized at a lexical level
  • The basic idea is to add a new pronunciation
    model by replicating the set of phoneme symbols
    in each boosting iteration (e.g. by appending
    the suffix _t to the phoneme symbol)
  • The new phoneme symbols, represent the underlying
    acoustic model of boosting iteration

au, au_1 ,au_2,
13
Utterance approach for CSR--Decoding
  • Decoding lexical approach for model combination
    (cont.)
  • Add to each phonetic transcription in the
    decoding lexicon a new transcription using the
    corresponding phoneme set
  • Use the reweighted training data to train the
    boosted classifier
  • Decoding is then performed using the extended
    lexicon and the set of acoustic models weighted
    by their unigram prior probabilities which are
    estimated on the training data

sic_a, sic_1 a_1 ,
weighted summation
14
In more detail
Training
Training corpus
_t
Boosting Iteration t
Mt
phonetically transcribed
training corpus(Mt)
ML/MMI training
pronunciation variant
sic_a, sic_1 a_1 ,
Decoding
Lexicon
M1,M2,,Mt
unweighted model combination weighted model
combination
extend
15
In more detail
16
Weighted model combination
  • Word level model combination

17
Experiments
  • Isolated word recognition
  • Telephone-bandwidth large vocabulary isolated
    word recognition
  • SpeechDat(II) German meterial
  • Continuous speech recognition
  • Professional dictation and Switchboard

18
Isolated word recognition
  • Database
  • Training corpus consists of 18k utterances
    (4.3h) of city, company, first and family names
  • Evaluations
  • LILI test corpus 10k single word utterances
    (3.5h) 10k words lexicon (matched conditions)
  • Names corpus an inhouse collection of 676
    utterances (0.5h) two different decoding lexica
    10k lex, 190k lex (acoustic conditions are
    matched, whereas there is a lexical mismatch)
  • Office corpus 3.2k utterances (1.5h), recorded
    over microphone in clean conditions 20k lexicon
    (an acoustic mismatch to the training conditions)

19
Isolated word recognition
  • Boosting ML models

20
Isolated word recognition
  • Combining boosting and discriminative training
  • The experiments in isolated word recognition
    showed that boosting may improve the best test
    error rates

21
Continuous speech recognition
  • Database
  • Professional dictation
  • An inhouse data collection of real-life
    recordings of medical reports
  • The acoustic training corpus consists of about
    58h of data
  • Evaluations were carried out on two test corpora
  • Development corpus consists of 5.0h of speech
  • Evaluation corpus consists of 3.3h of speech
  • Switchboard
  • Consisting of spontaneous conversations recorded
    over telephone line 57h(73h) of male(female)
  • Evaluations corpus
  • Containing about 1h(0.5h) of male(female)

22
Continuous speech recognition
  • Professional dictation

23
  • Switchboard

24
Conclusions
  • In this paper, a boosting approach which can be
    applied to any HMM based speech recognizer was be
    presented and evaluated
  • The increased recognizer complexity and thus
    decoding effort of the boosted systems is a major
    drawback compared to other training techniques
    like discriminative training

25
References
  • ICASSP02C.Meyer Utterance-Level Boosting of
    HMM Speech Recognizers
  • ICML02C.Meyer Towards Large Margin Speech
    Recognizers by Boosting and Discriminative
    Training
  • ICSLP00C.Meyer Rival Training Efficient Use
    of Data in Discriminative Training
  • ICASSP00Schramm and Aubert Efficient
    Integration of Multiple Pronunciations in a Large
    Vocabulary Decoder
Write a Comment
User Comments (0)
About PowerShow.com