Network Training for Continuous Speech Recognition - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Network Training for Continuous Speech Recognition

Description:

Number of Views:71

Avg rating:3.0/5.0

Slides: 14

Provided by: ValuedGate831

Category:

more less

Transcript and Presenter's Notes

Title: Network Training for Continuous Speech Recognition

1
Network Training for Continuous Speech Recognition

Email alphonso_at_isip.msstate.edu

2
INTRODUCTION
ORGANIZATION

Motivation Why do we need a new training
paradigm?
Theoretical Review the EM-based supervised
training framework.
Network Training The differences between the
network training and traditional training.
Experiments Verification of the approach using
industry standard databases (e.g., TIDigits,
Alphadigits and Resource Management).

Motivation
Network Training
Experiments
Conclusions
3
INTRODUCTION
MOTIVATION

A traditional trainer uses an EM-based framework
to estimate the parameters of a speech
recognition system.
EM-based parameter estimation is performed in
several complicated stages which are prone to
human error.
A network trainer reduces the complexity of the
training process by employing a soft decision
criterion.
A network trainer achieves comparable performance
and retains the robustness of the EM-based
framework.

4
NETWORK TRAINER
TRAINING RECIPE
CI Training
State Tying
Flat Start
CD Training
Context-Independent
Context-Dependent

The flat start stage segments the acoustic signal
and seed the speech and non-speech models.
The context-independent stage inserts and
optional silence model between words.
The state-tying stage clusters the model
parameters via linguistic rules to compensate for
sparse training data.
The context-dependent stage is similar to the
context-independent stage (words are modeled
using context).

5
NETWORK TRAINER
FLEXIBLE TRANSCRIPTIONS
Network Trainer
SILENCE
HAVE
SILENCE

The network trainer uses word level
transcriptions which does not impose restrictions
on the word pronunciation.
The traditional trainer uses phone level
transcriptions which uses the canonical
pronunciation of the word.
Using orthographic transcriptions removes the
need for directly dealing with phonetic contexts
during training.

6
NETWORK TRAINER
FLEXIBLE TRANSCRIPTIONS
Network Trainer
Traditional Trainer

The network trainer uses a silence word which
precludes the need for inserting it into the
phonetic pronunciation.
The traditional trainer deals with silence
between words by explicitly specifying it in the
phonetic pronunciation.

7
NETWORK TRAINER
DUAL SILENCE MODELLING
Multi-Path
Single-Path

8
NETWORK TRAINER
DUAL SILENCE MODELLING

The network trainer uses a fixed silence at
utterance bounds and an optional silence between
words.
We use a fixed silence at utterance bounds to
avoid an underestimated silence model.

9
NETWORK TRAINER
DUAL SILENCE MODELLING

Network training uses a single-path silence at
utterance bounds and a multi-path silence between
words.
We use a single-path silence at utterance bounds
to avoid uncertainty in modeling silence.

10
EXPERIMENTS
TIDIGITS WER COMPARISON

11
EXPERIMENTS
AD WER COMPARISON

12
EXPERIMENTS
RM WER COMPARISON

The network trainer achieves comparable
performance to the traditional trainer.
It is important to note that the 1.8 degradation
in performance is not significant (MAPSSWE test).

13
CONCLUSIONS
SUMMARY

Explored the effectiveness of a novel training
recipe in the reestimation process of for speech
processing.
Analyzed performance on three databases.
For TIDigits, at 7.6 WER, the performance of the
network trainer was better by about 0.1.
For OGI Alphadigits, at 35.3 WER, the
performance of the network trainer was better by
about 2.7.
For Resource Management, at 27.5 WER, the
performance degraded by about 1.8 (not
significant).