Network Training for Continuous Speech Recognition - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Network Training for Continuous Speech Recognition

Description:

Dept. Electrical and Computer Eng. Mississippi State University. Contact Information: Box 0452. Mississippi State University. Mississippi State, Mississippi 39762 ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 14
Provided by: ValuedGate831
Category:

less

Transcript and Presenter's Notes

Title: Network Training for Continuous Speech Recognition


1
Network Training for Continuous Speech Recognition
  • Author
  • Issac John Alphonso
  • Inst. for Signal and Info. Processing
  • Dept. Electrical and Computer Eng.
  • Mississippi State University
  • Contact Information
  • Box 0452
  • Mississippi State University
  • Mississippi State, Mississippi 39762
  • Tel 662-325-8335
  • Fax 662-325-2298

Email alphonso_at_isip.msstate.edu
  • URL isip.msstate.edu/publications/books/msstate_t
    heses/2003/network_training/

2
INTRODUCTION
ORGANIZATION
  • Motivation Why do we need a new training
    paradigm?
  • Theoretical Review the EM-based supervised
    training framework.
  • Network Training The differences between the
    network training and traditional training.
  • Experiments Verification of the approach using
    industry standard databases (e.g., TIDigits,
    Alphadigits and Resource Management).

Motivation
Network Training
Experiments
Conclusions
3
INTRODUCTION
MOTIVATION
  • A traditional trainer uses an EM-based framework
    to estimate the parameters of a speech
    recognition system.
  • EM-based parameter estimation is performed in
    several complicated stages which are prone to
    human error.
  • A network trainer reduces the complexity of the
    training process by employing a soft decision
    criterion.
  • A network trainer achieves comparable performance
    and retains the robustness of the EM-based
    framework.

4
NETWORK TRAINER
TRAINING RECIPE
CI Training
State Tying
Flat Start
CD Training
Context-Independent
Context-Dependent
  • The flat start stage segments the acoustic signal
    and seed the speech and non-speech models.
  • The context-independent stage inserts and
    optional silence model between words.
  • The state-tying stage clusters the model
    parameters via linguistic rules to compensate for
    sparse training data.
  • The context-dependent stage is similar to the
    context-independent stage (words are modeled
    using context).

5
NETWORK TRAINER
FLEXIBLE TRANSCRIPTIONS
Network Trainer
SILENCE
HAVE
SILENCE
  • The network trainer uses word level
    transcriptions which does not impose restrictions
    on the word pronunciation.
  • The traditional trainer uses phone level
    transcriptions which uses the canonical
    pronunciation of the word.
  • Using orthographic transcriptions removes the
    need for directly dealing with phonetic contexts
    during training.

6
NETWORK TRAINER
FLEXIBLE TRANSCRIPTIONS
Network Trainer
Traditional Trainer
  • The network trainer uses a silence word which
    precludes the need for inserting it into the
    phonetic pronunciation.
  • The traditional trainer deals with silence
    between words by explicitly specifying it in the
    phonetic pronunciation.

7
NETWORK TRAINER
DUAL SILENCE MODELLING
Multi-Path
Single-Path
  • The multi-path silence model is used between
    words.
  • The single-path silence model is used at
    utterance ends.

8
NETWORK TRAINER
DUAL SILENCE MODELLING
  • The network trainer uses a fixed silence at
    utterance bounds and an optional silence between
    words.
  • We use a fixed silence at utterance bounds to
    avoid an underestimated silence model.

9
NETWORK TRAINER
DUAL SILENCE MODELLING
  • Network training uses a single-path silence at
    utterance bounds and a multi-path silence between
    words.
  • We use a single-path silence at utterance bounds
    to avoid uncertainty in modeling silence.

10
EXPERIMENTS
TIDIGITS WER COMPARISON
  • The network trainer achieves comparable
    performance to the traditional trainer.
  • The network trainer converges in word error rate
    to the traditional trainer.

11
EXPERIMENTS
AD WER COMPARISON
  • The network trainer achieves comparable
    performance to the traditional trainer.
  • The network trainer converges in word error rate
    to the traditional trainer.

12
EXPERIMENTS
RM WER COMPARISON
  • The network trainer achieves comparable
    performance to the traditional trainer.
  • It is important to note that the 1.8 degradation
    in performance is not significant (MAPSSWE test).

13
CONCLUSIONS
SUMMARY
  • Explored the effectiveness of a novel training
    recipe in the reestimation process of for speech
    processing.
  • Analyzed performance on three databases.
  • For TIDigits, at 7.6 WER, the performance of the
    network trainer was better by about 0.1.
  • For OGI Alphadigits, at 35.3 WER, the
    performance of the network trainer was better by
    about 2.7.
  • For Resource Management, at 27.5 WER, the
    performance degraded by about 1.8 (not
    significant).
Write a Comment
User Comments (0)
About PowerShow.com