Title: Discriminative Feature Optimization for Speech Recognition
1Discriminative Feature Optimization for Speech
Recognition
- Bing Zhang
- College of Computer Information Science
Northeastern University
2Outline
- Introduction
- Problem to attack
- Methodology
- Region-dependent feature transform
- Discriminative optimization of the feature
transform - Implementation
- System description results
- Conclusions
3Introduction
- Speech recognition
- Goal transcribe speech into text
- Performance measurement word error rate (WER)
- Typical approach
- Training statistically model the acoustic and
linguistic knowledge - Recognition search for the most probable word
sequence using the models - Speech feature extraction
- Reason raw signals cannot be robustly modeled
due to high-dimensionality, therefore compact
features have to be extracted - Two stages of feature extraction
- speech analysis ? cepstral coefficients
- speech feature transformation
- In this thesis A better feature transformation
approach is developed to reduce the WER of the
speech recognition system
4Introduction (cont.)
A typical speech recognition system
Word Sequence
Acoustic Model
Language Model
Features
5Language Model
- N-grams
- Models the conditional probability of any word
given N-1 words in history - The product of N-gram probabilities can be used
to approximate the probability of a sequence of
words - P(w1, w2, , wk) P(w1 ) P(w2 w1) P(w3 w1,
w2) P(wN w1, , wN-1) - P(wk-1 wk-N, ..., wk-2) P(wk
wk-(N-1), ..., wk-1) -
- Special cases
- Unigram P(wi)
- Bigram P(wi wi-1)
- Trigram P(wi wi-2,wi-1)
6HMM-based Acoustic Model
- Repository of unit HMMs (Hidden Markov Model)
- Each HMM is a probabilistic finite state machine
with outputs at each hidden state - Transition probabilities
- Observation probabilities (modeled by a mixture
of Gaussians for each state) - Each HMM represents a basic unit of speech, e.g.,
phoneme, crossword/non-crossword multiphones - HMM state-clusters specify which HMM states can
share which parameters - Pronunciation dictionary phonetic spelling of
the words
7Example of an HMM
a11
a22
a33
a44
HMM
a12
a23
a34
1
4
2
Start
3
End
a13
a24
Observations
8Example of an HMM
a11
a33
a12
a23
a34
1
4
2
Start
3
End
b1(o1)
b1(o2)
b2(o3)
b3(o4)
b3(o5)
b4(o6)
o1
o2
o3
o4
o5
o6
a22
a44
a12
1
4
2
Start
End
a24
b1(o1)
b2(o2)
b2(o3)
b2(o4)
b4(o5)
b4(o6)
o1
o2
o3
o4
o5
o6
9HMM-based Acoustic Model
- Repository of unit HMMs (Hidden Markov Model)
- Each HMM is a probabilistic finite state machine
with outputs at each hidden state - Transition probabilities
- Observation probabilities (modeled by a mixture
of Gaussians for each state) - Each HMM represents a basic unit of speech, e.g.,
phoneme, crossword/non-crossword multiphones - HMM state-clusters specify which HMM states can
share which parameters - Pronunciation dictionary phonetic spelling of
the words
10Acoustic Training
- Maximum likelihood (ML) training
- Objective maximize the conditional likelihood of
the observed features given the model - Algorithm Expectation-maximization (EM)
- Discriminative training
- Objective train the model to distinguish the
correct word sequence from other hypotheses - Criterion
- Minimum phoneme error (MPE)
- Representation of hypotheses lattices
- Algorithm Extended EM
11Feature Extraction
- Speech analysis
- Deals with the problem of extracting
distinguishing characteristics (e.g., formant
locations) of speech from digital signals - Examples MFCC (Mel-frequency cepstral
coefficients), PLP (perceptual linear prediction) - Resulting features cepstral coefficients
- Speech feature transformation
- Applied on top of the cepstral coefficients
- Transform the cepstral features to better fit the
model - help the HMM to model the trajectory of the
cepstral features - fit the diagonal covariance assumption of the
Gaussian components
12Commonly Used Feature Transforms
- LDA (linear discriminant analysis)
- Transform the features to maximize the distance
between different classes while keeping each
class as compact as possible - Assumes the all classes have equal covariance
- HLDA (heteroscedastic linear discriminant
analysis) - Remove the equal covariance assumption of LDA
- Find the feature transform that maximizes the
likelihood of the data with respect to the
acoustic model in the transformed space - Others
- HDA (heteroscedastic discriminant analysis)
- MLLT (maximum likelihood linear transform)
13Drawbacks of Traditional Feature Transforms
- Inaccurate assumptions about the acoustic model
- LDA assumes equal-class covariance
- HDA LDA ignore the diagonal covariance
assumption - Linear transform
- Linear transform has limited power for feature
extraction - Using more powerful transforms can be risky when
the criterion does not correlate with the WER - The criteria do not correlate with the WER
- Performance degrades on high-dimensional input
features - Experimental results in the thesis
- Performance degrades on highly-correlated input
features - Example on the next slide
14Example
The data has linear dependency between two
dimensions such that Z2X
Z
Z
Y
X
X
- If projected to 1-D
- HLDA will map all samples to one single point
- LDA will fail to find the answer at all because
the covariance matrix of each class is singular
15A Better Approach
- Region-dependent transform
- Nonlinear
- Computationally inexpensive to train
- Discriminative training of the feature transform
- Criterion correlates well with the WER
- Detailed acoustic model in feature training
16Region Dependent Transform (RDT)
- RDT
- Divides the acoustic space to multiple regions
- e.g., r1, r2, , rN
- Applies a different transform based on which
region the input feature vector belongs to - e.g., f1, f2, , fN
To avoid making hard decisions when choosing
which transform to apply, the posterior
probabilities of the regions are used to
interpolate the transformed results
17More Details of RDT
- Input features long-span features
- A long span feature vector is formed by
concatenating the cepstral features from
consecutive frames, centered at the current frame - Advantage contains information about the
acoustic context of the current frame - Division of the regions global Gaussian mixture
model (GMM) - Trained via unsupervised clustering
- Each Gaussian component in the GMM corresponds to
a region - Region-specific transforms
- In general, they can be any projections of
long-span feature vectors - In this thesis, linear projections are studied
18Special Cases of RDT
RDT
Generic projection
RDLT
Linear projection
SPLICE
fMPE
MPE-HLDA
Mean-offset fMPE
Only one region
Only offset
Rotation matrix plus offset
P is not region-dependent
Note () fMPE also includes a context-expansion
layer, which does not fit this categorization.
(see thesis for details)
19Projections vs. Offsets in RDT
The projection and the offset in RDT
Different regions can share the same projections
and/or offsets. So the unique number of
projections/offsets can be less than the number
of regions.
Projection
Offset
20Optimization Criterion of RDT
- Minimum Phoneme Error (MPE) criterion
- Gives significant gains when used to train the
HMM - Correlates well with WER
- Can be rewritten as a function of the feature
transform
WER
MPE Score
O, Or original feature vectors ? the HMM
FRDT the feature transform a(Wrk) the accuracy
score of hypothesized word sequence Wrk
21HMM Updating Methods
- In MPE, the HMM depends on the transformed
features, so we should define how it is updated - When we choose the HMM updating methods, the
concern is to make the trained transform be more
generic, i.e., reusable for different training
setups including - both ML and MPE training
- different types of HMMs
- If we can make the feature transform focus on
separating the data, this goal can be achieved - To ensure that, the HMM should better describe
the data rather than anything else
22HMM Updating Methods (cont.)
- If the HMM is updated discriminatively, e.g.,
under MPE - Some Gaussians in the HMM will model decision
boundaries, being away from the mass of the data - The feature transform will be misled from
separating the real data - The resulting transform is less generic
- This method is OK if there is only one HMM to
train - If the HMM is updated under ML
- The Gaussians will stay on the data
- The feature transform will also focus on the data
- The resulting transform is more generic
- This method is preferred if there are different
HMMs to train - We assume ML updating of the HMM in this thesis
23Example
Discriminative Model
ML Model
Before transform
After transform
Since the model is already discriminative,
nothing needs to be done here.
24Training the Feature Transform
- The transform is trained using a numerical
optimization algorithm - Derivative of MPE with respect to the transform
- Two terms in the derivative
- MPE depends on the transformed features directly
? direct derivative - MPE depends on the transform through the HMM,
which in turn depends on transformed features ?
indirect derivative - Two passes of data processing
- The first pass computes the direct derivative
using lattices - The second pass computes the indirect derivative
using reference transcripts
25Training Procedure
Iterative update of RDT using numerical
optimization
26Implementation
- Feature transform network
- A directed acyclic network of primitive
components - Design goals
- reuse primitive components (e.g., linear
projection, frame-concatenation) - reuse the algorithm that applies the transform or
computes the derivative - easy to extend to other transforms
- efficient usage of CPU time memory
- Impact
- enables numerical optimization of any
differentiable components including but not
limited RDT - simplifies the BBN system by providing a unified
representation of various transforms - added flexibility to the front-end processing in
the BBN system
Cepstra
Concatenation
Projection
Gauss. Mixture
RDT
27RDT and the State-of-the-art System
- The state-of-the-art system at BBN
- Two sub-systems
- Speaker-independent (SI) system
- Speaker-adaptive (SA) system
- Two phases of training
- ML (initialize MPE training)
- MPE
- Three pass decoding
- Three tied-mixture acoustic models
- How RDT interacts with the system
- Trained once, used in three types of acoustic
models - Integrated with speaker adaptation
28RDT in Speaker-independent (SI) Training
Bootstrapping
SI training baseline
SI training with RDT
LDAMLLT
Initial Transform
ML Training
ML-SI HMM
Lattice Generation
Lattices
MPE Training
MPE-SI HMM
29Experimental Setup
- Data
- Training English Conversational Telephone Speech
(CTS), 2300 hours SWBFisher - Testing Eval03Dev04, 3 hours SWB-II, 6 hours
Fisher - Analysis
- 14 Perceptual Linear Prediction (PLP) cepstral
coefficients and normalized energy - Vocal Tract Length Normalization (VTLN)
- RDT
- 15-frame long-span features projected to 60
dimensions - initialized from LDAMLLT
- 1000 regions, one linear projection per region
- crossword state-cluster tied model (SCTM), 7K
clusters. - number of Gaussians per state-cluster in the HMM
varies in different experiments
30SI Results (ML)
- Description
- Two RDTs were trained using the HMMs with 12
Gaussians per state-cluster (GPS) and 44 GPS,
respectively - For decoding, several ML crossword SCTM models
with different sizes were trained using either
LDAMLLT or RDT - Only the lattice-rescoring pass was run in
decoding for simplicity - () After other two models (STM, SCTM-NX) were
retrained, the WER was further reduced to 20.4,
i.e., 9.3 relatively better than the LDAMLLT
result
31SI Results (MPE)
- Description
- Same as the ML experiments, except that the final
models were trained under MPE - () After other two models (STM, SCTM-NX) were
trained, the WER was further reduced to 19.2,
i.e., 5.8 relatively better than the LDAMLLT
result
32Speaker Adaptation
- Speaker adaptation (figure)
- Assumption the speaker-dependent models are
linearly transformed from an SI model - Variations
- MLLR assume that only Gaussian means are
transformed - CMLLR both means covariances are transformed ?
equivalent to applying the inverse transform to
features while keeping model fixed - Speaker-Adaptive Training (SAT)
- The SI model is not optimal for adaptation
- SAT tries to estimate a better model that when
transformed gives the best likelihood of the data
33RDT in Speaker-adaptive Training (SAT)
Straightforward approach
Train SI RDT
SI RDT HMM
- Use SI-RDT transparently
- Simple
- But RDT is not optimized for SAT
CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
MPE Training
MPE-SAT HMM
34RDT in Speaker-adaptive Training (SAT)
Train SI RDT
Iterative approach (SA-RDT)
SI RDT HMM
- Alternately update RDT and the speaker- dependent
(SD) transforms - Back-propagation is used to compute the
derivative, since SD transforms are applied on
top of RDT - RDT is optimized for SAT
CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
Update RDT
SA RDT HMM
MPE Training
MPE-SAT HMM
35Adapted Results
- Description
- Same training testing data, state-cluster and
LM as the unadapted experiments - 10.9 relative WER reduction for the ML system
- 7.0 relative WER reduction for the MPE system
36Alternative Procedure for SA-RDT
Simplified SA-RDT
SI LDAMLLT HMM
- Similar to the original SA-RDT
- But the speaker-dependent transforms are
estimated using the baseline model features
CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
Update RDT
SA RDT HMM
MPE Training
MPE-SAT HMM
37Adapted Results
- Description
- 500 hours of training data
- Another set of SD transforms were used before
LDA/RDT - SA-RDT1 was using the simplified procedure
- SA-RDT2 was using the original procedure
- The simplified procedure gave 2/3 of the gain by
training the RDT only once
38Conclusions
- Original work
- Region-dependent transform
- Improved discriminative feature training that
leads to more generic feature transform - Improved SAT procedure using RDT
- Impact
- RDT encompasses several other feature transforms,
including MPE-HLDA, SPLICE and the core of fMPE
and mean-offset fMPE - The method gives significant WER reduction 7
relative reduction to the SAT-MPE English CTS
system - The method is potentially helpful for exploring
novel acoustic features - We do not have to worry about the negative effect
when we add new features to the input of the
feature transform, because the training will
decide whether to use the new features and how to
use them based on a criterion that is correlated
to WER
39Publications
- B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz.
Long span features and minimum phoneme
heteroscedastic linear discriminant analysis. In
Proceedings of EARS RT-04 Workshop, 2004. - B. Zhang and S. Matsoukas. Minimum phoneme error
based heteroscedastic linear discriminant
analysis for speech recognition, In Proceedings
of ICASSP, 2005. - B. Zhang, S. Matsoukas and R. Schwartz.
Discriminatively trained region-dependent
transform for speech recognition. In Proceedings
of ICASSP, 2006. - Nominated for the Student Paper Award
- Awarded the Spoken Language Processing Grant by
the IEEE Signal Processing Society - B. Zhang, S. Matsoukas and R. Schwartz. Recent
progress on the discriminative region-dependent
transform for speech feature extraction. In
Proceedings of ICSLP, 2006.