Discriminative Feature Optimization for Speech Recognition presentation

About This Presentation

Transcript and Presenter's Notes

Title: Discriminative Feature Optimization for Speech Recognition

1
Discriminative Feature Optimization for Speech
Recognition

Bing Zhang
College of Computer Information Science
Northeastern University

2
Outline

Introduction
Problem to attack
Methodology
Region-dependent feature transform
Discriminative optimization of the feature
transform
Implementation
System description results
Conclusions

3
Introduction

Speech recognition
Goal transcribe speech into text
Performance measurement word error rate (WER)
Typical approach
Training statistically model the acoustic and
linguistic knowledge
Recognition search for the most probable word
sequence using the models
Speech feature extraction
Reason raw signals cannot be robustly modeled
due to high-dimensionality, therefore compact
features have to be extracted
Two stages of feature extraction
speech analysis ? cepstral coefficients
speech feature transformation
In this thesis A better feature transformation
approach is developed to reduce the WER of the
speech recognition system

4
Introduction (cont.)
A typical speech recognition system
Word Sequence
Acoustic Model
Language Model
Features
5
Language Model

N-grams
Models the conditional probability of any word
given N-1 words in history
The product of N-gram probabilities can be used
to approximate the probability of a sequence of
words
P(w1, w2, , wk) P(w1 ) P(w2 w1) P(w3 w1,
w2) P(wN w1, , wN-1)
P(wk-1 wk-N, ..., wk-2) P(wk
wk-(N-1), ..., wk-1)
Special cases
Unigram P(wi)
Bigram P(wi wi-1)
Trigram P(wi wi-2,wi-1)

6
HMM-based Acoustic Model

Repository of unit HMMs (Hidden Markov Model)
Each HMM is a probabilistic finite state machine
with outputs at each hidden state
Transition probabilities
Observation probabilities (modeled by a mixture
of Gaussians for each state)
Each HMM represents a basic unit of speech, e.g.,
phoneme, crossword/non-crossword multiphones
HMM state-clusters specify which HMM states can
share which parameters
Pronunciation dictionary phonetic spelling of
the words

7
Example of an HMM
a11
a22
a33
a44
HMM
a12
a23
a34
1
4
2
Start
3
End
a13
a24
Observations
8
Example of an HMM
a11
a33
a12
a23
a34
1
4
2
Start
3
End
b1(o1)
b1(o2)
b2(o3)
b3(o4)
b3(o5)
b4(o6)
o1
o2
o3
o4
o5
o6
a22
a44
a12
1
4
2
Start
End
a24
b1(o1)
b2(o2)
b2(o3)
b2(o4)
b4(o5)
b4(o6)
o1
o2
o3
o4
o5
o6
9
HMM-based Acoustic Model

Repository of unit HMMs (Hidden Markov Model)
Each HMM is a probabilistic finite state machine
with outputs at each hidden state
Transition probabilities
Observation probabilities (modeled by a mixture
of Gaussians for each state)
Each HMM represents a basic unit of speech, e.g.,
phoneme, crossword/non-crossword multiphones
HMM state-clusters specify which HMM states can
share which parameters
Pronunciation dictionary phonetic spelling of
the words

10
Acoustic Training

Maximum likelihood (ML) training
Objective maximize the conditional likelihood of
the observed features given the model
Algorithm Expectation-maximization (EM)
Discriminative training
Objective train the model to distinguish the
correct word sequence from other hypotheses
Criterion
Minimum phoneme error (MPE)
Representation of hypotheses lattices
Algorithm Extended EM

11
Feature Extraction

Speech analysis
Deals with the problem of extracting
distinguishing characteristics (e.g., formant
locations) of speech from digital signals
Examples MFCC (Mel-frequency cepstral
coefficients), PLP (perceptual linear prediction)
Resulting features cepstral coefficients
Speech feature transformation
Applied on top of the cepstral coefficients
Transform the cepstral features to better fit the
model
help the HMM to model the trajectory of the
cepstral features
fit the diagonal covariance assumption of the
Gaussian components

12
Commonly Used Feature Transforms

LDA (linear discriminant analysis)
Transform the features to maximize the distance
between different classes while keeping each
class as compact as possible
Assumes the all classes have equal covariance
HLDA (heteroscedastic linear discriminant
analysis)
Remove the equal covariance assumption of LDA
Find the feature transform that maximizes the
likelihood of the data with respect to the
acoustic model in the transformed space
Others
HDA (heteroscedastic discriminant analysis)
MLLT (maximum likelihood linear transform)

13
Drawbacks of Traditional Feature Transforms

Inaccurate assumptions about the acoustic model
LDA assumes equal-class covariance
HDA LDA ignore the diagonal covariance
assumption
Linear transform
Linear transform has limited power for feature
extraction
Using more powerful transforms can be risky when
the criterion does not correlate with the WER
The criteria do not correlate with the WER
Performance degrades on high-dimensional input
features
Experimental results in the thesis
Performance degrades on highly-correlated input
features
Example on the next slide

14
Example
The data has linear dependency between two
dimensions such that Z2X
Z
Z
Y
X
X

If projected to 1-D
HLDA will map all samples to one single point
LDA will fail to find the answer at all because
the covariance matrix of each class is singular

15
A Better Approach

Region-dependent transform
Nonlinear
Computationally inexpensive to train
Discriminative training of the feature transform
Criterion correlates well with the WER
Detailed acoustic model in feature training

16
Region Dependent Transform (RDT)

RDT
Divides the acoustic space to multiple regions
e.g., r1, r2, , rN
Applies a different transform based on which
region the input feature vector belongs to
e.g., f1, f2, , fN

To avoid making hard decisions when choosing
which transform to apply, the posterior
probabilities of the regions are used to
interpolate the transformed results
17
More Details of RDT

Input features long-span features
A long span feature vector is formed by
concatenating the cepstral features from
consecutive frames, centered at the current frame
Advantage contains information about the
acoustic context of the current frame
Division of the regions global Gaussian mixture
model (GMM)
Trained via unsupervised clustering
Each Gaussian component in the GMM corresponds to
a region
Region-specific transforms
In general, they can be any projections of
long-span feature vectors
In this thesis, linear projections are studied

18
Special Cases of RDT
RDT
Generic projection
RDLT
Linear projection
SPLICE
fMPE
MPE-HLDA
Mean-offset fMPE
Only one region
Only offset
Rotation matrix plus offset
P is not region-dependent
Note () fMPE also includes a context-expansion
layer, which does not fit this categorization.
(see thesis for details)
19
Projections vs. Offsets in RDT
The projection and the offset in RDT
Different regions can share the same projections
and/or offsets. So the unique number of
projections/offsets can be less than the number
of regions.
Projection
Offset
20
Optimization Criterion of RDT

Minimum Phoneme Error (MPE) criterion
Gives significant gains when used to train the
HMM
Correlates well with WER
Can be rewritten as a function of the feature
transform

WER
MPE Score
O, Or original feature vectors ? the HMM
FRDT the feature transform a(Wrk) the accuracy
score of hypothesized word sequence Wrk
21
HMM Updating Methods

In MPE, the HMM depends on the transformed
features, so we should define how it is updated
When we choose the HMM updating methods, the
concern is to make the trained transform be more
generic, i.e., reusable for different training
setups including
both ML and MPE training
different types of HMMs
If we can make the feature transform focus on
separating the data, this goal can be achieved
To ensure that, the HMM should better describe
the data rather than anything else

22
HMM Updating Methods (cont.)

If the HMM is updated discriminatively, e.g.,
under MPE
Some Gaussians in the HMM will model decision
boundaries, being away from the mass of the data
The feature transform will be misled from
separating the real data
The resulting transform is less generic
This method is OK if there is only one HMM to
train
If the HMM is updated under ML
The Gaussians will stay on the data
The feature transform will also focus on the data
The resulting transform is more generic
This method is preferred if there are different
HMMs to train
We assume ML updating of the HMM in this thesis

23
Example
Discriminative Model
ML Model
Before transform
After transform
Since the model is already discriminative,
nothing needs to be done here.
24
Training the Feature Transform

The transform is trained using a numerical
optimization algorithm
Derivative of MPE with respect to the transform
Two terms in the derivative
MPE depends on the transformed features directly
? direct derivative
MPE depends on the transform through the HMM,
which in turn depends on transformed features ?
indirect derivative
Two passes of data processing
The first pass computes the direct derivative
using lattices
The second pass computes the indirect derivative
using reference transcripts

25
Training Procedure
Iterative update of RDT using numerical
optimization
26
Implementation

Feature transform network
A directed acyclic network of primitive
components
Design goals
reuse primitive components (e.g., linear
projection, frame-concatenation)
reuse the algorithm that applies the transform or
computes the derivative
easy to extend to other transforms
efficient usage of CPU time memory
Impact
enables numerical optimization of any
differentiable components including but not
limited RDT
simplifies the BBN system by providing a unified
representation of various transforms
added flexibility to the front-end processing in
the BBN system

Cepstra
Concatenation
Projection
Gauss. Mixture
RDT
27
RDT and the State-of-the-art System

The state-of-the-art system at BBN
Two sub-systems
Speaker-independent (SI) system
Speaker-adaptive (SA) system
Two phases of training
ML (initialize MPE training)
MPE
Three pass decoding
Three tied-mixture acoustic models
How RDT interacts with the system
Trained once, used in three types of acoustic
models
Integrated with speaker adaptation

28
RDT in Speaker-independent (SI) Training
Bootstrapping
SI training baseline
SI training with RDT
LDAMLLT
Initial Transform
ML Training
ML-SI HMM
Lattice Generation
Lattices
MPE Training
MPE-SI HMM
29
Experimental Setup

Data
Training English Conversational Telephone Speech
(CTS), 2300 hours SWBFisher
Testing Eval03Dev04, 3 hours SWB-II, 6 hours
Fisher
Analysis
14 Perceptual Linear Prediction (PLP) cepstral
coefficients and normalized energy
Vocal Tract Length Normalization (VTLN)
RDT
15-frame long-span features projected to 60
dimensions
initialized from LDAMLLT
1000 regions, one linear projection per region
crossword state-cluster tied model (SCTM), 7K
clusters.
number of Gaussians per state-cluster in the HMM
varies in different experiments

30
SI Results (ML)

Description
Two RDTs were trained using the HMMs with 12
Gaussians per state-cluster (GPS) and 44 GPS,
respectively
For decoding, several ML crossword SCTM models
with different sizes were trained using either
LDAMLLT or RDT
Only the lattice-rescoring pass was run in
decoding for simplicity
() After other two models (STM, SCTM-NX) were
retrained, the WER was further reduced to 20.4,
i.e., 9.3 relatively better than the LDAMLLT
result

31
SI Results (MPE)

Description
Same as the ML experiments, except that the final
models were trained under MPE
() After other two models (STM, SCTM-NX) were
trained, the WER was further reduced to 19.2,
i.e., 5.8 relatively better than the LDAMLLT
result

32
Speaker Adaptation

Speaker adaptation (figure)
Assumption the speaker-dependent models are
linearly transformed from an SI model
Variations
MLLR assume that only Gaussian means are
transformed
CMLLR both means covariances are transformed ?
equivalent to applying the inverse transform to
features while keeping model fixed
Speaker-Adaptive Training (SAT)
The SI model is not optimal for adaptation
SAT tries to estimate a better model that when
transformed gives the best likelihood of the data

33
RDT in Speaker-adaptive Training (SAT)
Straightforward approach
Train SI RDT
SI RDT HMM

Use SI-RDT transparently
Simple
But RDT is not optimized for SAT

CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
MPE Training
MPE-SAT HMM
34
RDT in Speaker-adaptive Training (SAT)
Train SI RDT
Iterative approach (SA-RDT)
SI RDT HMM

Alternately update RDT and the speaker- dependent
(SD) transforms
Back-propagation is used to compute the
derivative, since SD transforms are applied on
top of RDT
RDT is optimized for SAT

CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
Update RDT
SA RDT HMM
MPE Training
MPE-SAT HMM
35
Adapted Results

Description
Same training testing data, state-cluster and
LM as the unadapted experiments
10.9 relative WER reduction for the ML system
7.0 relative WER reduction for the MPE system

36
Alternative Procedure for SA-RDT
Simplified SA-RDT
SI LDAMLLT HMM

Similar to the original SA-RDT
But the speaker-dependent transforms are
estimated using the baseline model features

CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
Update RDT
SA RDT HMM
MPE Training
MPE-SAT HMM
37
Adapted Results

Description
500 hours of training data
Another set of SD transforms were used before
LDA/RDT
SA-RDT1 was using the simplified procedure
SA-RDT2 was using the original procedure
The simplified procedure gave 2/3 of the gain by
training the RDT only once

38
Conclusions

Original work
Region-dependent transform
Improved discriminative feature training that
leads to more generic feature transform
Improved SAT procedure using RDT
Impact
RDT encompasses several other feature transforms,
including MPE-HLDA, SPLICE and the core of fMPE
and mean-offset fMPE
The method gives significant WER reduction 7
relative reduction to the SAT-MPE English CTS
system
The method is potentially helpful for exploring
novel acoustic features
We do not have to worry about the negative effect
when we add new features to the input of the
feature transform, because the training will
decide whether to use the new features and how to
use them based on a criterion that is correlated
to WER

39
Publications

B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz.
Long span features and minimum phoneme
heteroscedastic linear discriminant analysis. In
Proceedings of EARS RT-04 Workshop, 2004.
B. Zhang and S. Matsoukas. Minimum phoneme error
based heteroscedastic linear discriminant
analysis for speech recognition, In Proceedings
of ICASSP, 2005.
B. Zhang, S. Matsoukas and R. Schwartz.
Discriminatively trained region-dependent
transform for speech recognition. In Proceedings
of ICASSP, 2006.
Nominated for the Student Paper Award
Awarded the Spoken Language Processing Grant by
the IEEE Signal Processing Society
B. Zhang, S. Matsoukas and R. Schwartz. Recent
progress on the discriminative region-dependent
transform for speech feature extraction. In
Proceedings of ICSLP, 2006.

Write a Comment

User Comments (0)

About PowerShow.com

Discriminative Feature Optimization for Speech Recognition PowerPoint PPT Presentation