Title: Linear Dynamic Model LDM for Automatic Speech Recognition
1Linear Dynamic Model (LDM) for Automatic Speech
Recognition
PhD Candidate Tao Ma Advised by Dr. Joseph
Picone Institute for Signal and Information
Processing (ISIP) Mississippi State University
2- An Example of Kalman Filter (another name of LDM)
- In control system engineering, Kalman Filter
succeeds to model a system with noisy observations
Filtering Position at present time (remove
noise effect) Predicting Position at a future
time Smoothing Position at a time in the past
Observation
A Kalman Filter models the position evolution
3- Why Linear Dynamic Model (LDM)?
- Linear Dynamic Model
- Pilot experiment LDM phone classification on
Aurora 4 - Hybrid HMM/LDM decoder architecture for LVCSR
- Future work
4- HMM Speech Recognition System
Hidden Markov Models
5- Is HMM a perfect model for ASR?
- Progress on improving the accuracy of HMM-based
system has slowed in the past decade - Theory drawbacks of HMM
- False assumption that frames are independent and
stationary - Spatial correlation is ignored (diagonal
covariance matrix) - Limited discrete state space
Accuracy
Clean
Noisy
Time
6- Motivation of Linear Dynamic Model (LDM) Research
- Motivation
- A model which reflects the characteristics of
speech signals will ultimately lead to great ASR
performance improvement - LDM incorporates frame correlation information of
speech signals, which is potential to increase
recognition accuracy - Filter characteristic of LDM has potential to
improve noise robustness of speech recognition - Fast growing computation capacity (thanks to
Intel) make it realistic to build a two-way
HMM/LDM hybrid speech engine
7- Linear Dynamic Model (LDM) is derived from State
Space Model - Equations of State Space Model
y observation feature vector x corresponding
internal state vector h() relationship function
between y and x at current time f() relationship
function between current state and all previous
states epsilon noise component eta noise
component
8- Equations of Linear Dynamic Model (LDM)
- Current state is only determined by previous
state - H, F are linear transform matrices
- Epsilon and Eta are driving components
y observation feature vector x corresponding
internal state vector H linear transform matrix
between y and x F linear transform matrix
between current state and previous state epsilon
driving component eta driving component
9- Kalman filtering for state inference (E-Step of
EM training)
For a speech sound,
Human Being Sound System
e
Kalman Filtering Estimation
10- RTS smoother for better inference
- Rauch-Tung-Striebel (RTS) smoother
- Additional backward pass to minimize inference
error - During EM training, computes the expectations of
state statistics
Standard Kalman Filter
Kalman Filter with RTS smoother
11- Maximum Likelihood Parameter Estimation (M-Step
of EM training)
LDM Parameters
aa ae ah ao aw ay b ch d dh eh er
Nothing but matrix multiplication!
12- LDM for Speech Classification
x
aa
y
x
x
ch
MFCC Feature
Hypothesis
x
eh
HMM-Based Recognition
x
LDM-Based Recognition
aa
y
x
MFCC Feature
x
ch
Hypothesis
x
eh
13- Challenges of Applying LDM to ASR
- Segment-based model
- frame-to-phoneme information is needed before
classification - EM training is sensitive to state initialization
- Each phoneme is modeled by a LDM, EM training is
to find a set of parameters for a specific LDM - No good mechanism for state initialization yet
- More parameters than HMM (23x)
- Currently mono-phone model, to build a tri-phone
model for LVCSR would need more training data
14- Pilot experiment phone classification on Aurora 4
- Aurora 4 Wall Street Journal six kinds of
noises - Airport, Babble, Car, Restaurant, Street, and
Train - Frame-to-phone alignment is generated by ISIP
decoder (force align mode) - Adding language model will get 93 accuracy for
clean data - 40 phones, one vs. all classifier
15- Hybrid HMM/LDM decoder architecture for LVCSR
Confidence Measurement
Best Hypothesis
16- The development of HMM/LDM hybrid decoder is
still in progress - HMM/LDM hybrid decoder is Expected to be done in
2009 - ISIP HMM/SVM hybrid decoder acts as the reference
for implementation - Future work
- Research has proved the nonlinear effects in
speech signals - Investigate the probability of replacing Kalman
filtering with nonlinear filtering (such as
Unscented Kalman Filter, Extended Kalman Filter)
17Questions?
18- Digalakis, V., Segment-based Stochastic Models
of Spectral Dynamics for Continuous Speech
Recognition, Ph.D. Dissertation, Boston
University, Boston, Massachusetts, USA, 1992. - Digalakis, V., Rohlicek, J. and Ostendorf, M.,
ML Estimation of a Stochastic Linear System with
the EM Algorithm and Its Application to Speech
Recognition, IEEE Transactions on Speech and
Audio Processing, vol. 1, no. 4, pp. 431442,
October 1993. - Frankel, J., Linear Dynamic Models for Automatic
Speech Recognition, Ph.D. Dissertation, The
Centre for Speech Technology Research, University
of Edinburgh, Edinburgh, UK, 2003.