Title: CSE 552
1- CSE 552
- Hidden Markov Models for Speech Recognition
- Spring, 2004
- Oregon Health Science University
- OGI School of Science Engineering
- John-Paul Hosom
- Lecture Notes for May 26
- Other Approaches to ASR
2Final Project Final Exam
- Forward-Backward project due June 2
- Assume you need only one mixture component in GMM
- Use 5 states (plus two null states) for training
if you wantto use a different number, thats
fine, but state yourassumptions. - Send C code and output of no model (after 10
training iterations) to hosom_at_cse.ogi.edu - Results should be very similar to HMM models
given toyou in project 2 (Viterbi search). - Final exam will be available after the last class
and due at the end of finals week (Friday, June
11).
3Other Approaches to ASR Segment-Based Systems
- SUMMIT system
- developed at MIT by Zue, Glass, et al.
- competitive performance at phoneme
classification - segment-based recognition (a) segment the
speech at possible phonetic boundaries (b)
create network of (sub-)phonetic segments (c)
classify each segment (d) search segment
probabilities for most likely sequence of
phonetic segments - complicated
4Other Approaches to ASR Segment-Based Systems
spectral change
- segment network can also be created using
segmentation by recognition
5Other Approaches to ASR Segment-Based Systems
- segment-based recognition (a) segment the
speech at possible phonetic boundaries (b)
create network of (sub-)phonetic segments (c)
classify each segment (d) search segment
probabilities for most likely sequence of
phonetic segments - (a,b) segmentation by recognition (A search)
m
ao
f
pau
aa
kc
k
n
tc
ah
ah
r
t
f
pau
m
tc
v
t
ah
ao
6Other Approaches to ASR Segment-Based Systems
(c) classification classify each segment (not
frame-by-frame) using information throughout
segment ? MFCC and energy averages over each
segment third, ? feature derivatives at segment
boundaries, ? segment duration, ? number of
boundaries within segment. (d) search Viterbi
search through segments to determine
best sequence of segments (phonemes). Search is
complicated by the fact that must consider all
possible segmentations, not just hypothesized
segmentation
7Other Approaches to ASR Segment-Based Systems
- Feature system
- developed at CMU by Cole, Stern, et al.
- competitive performance at alphabet
classification - segment-based recognition of a single
letter (a) extract information about signal,
including spectral properties, F0, energy in
frequency bands (b) locate 4 points in
utterance beginning of utterance, beginning
of vowel, offset of vowel, end of
utterance (c) extract 50 features (from step (a)
at the 4 locations) (d) use decision tree to
determine the probabilities of each letter - fragile errors in feature extraction
segmentation can not be recovered from
8Other Approaches to ASR Segment-Based Systems
- Determine letter/digit boundaries using HMM/ANN
hybrid (can only recover from substitution
errors) - For each hypothesized letter/digit (a) locate
4 points in word beginning of word, beginning
of sonorant, offset of sonorant, end of
word (b) extract information at segment
boundaries and some points within sonorant
region PLP features, zero crossings,
peak-to-peak amplitude (d) use ANN to classify
these features into 37 categories (26
letters 11 digits) - Telephone-band performance of almost 90 recent
HMM performance of just over 90.
9Other Approaches to ASR Template-Based Systems
- Template-based systems with time alignment
- useful for small vocabulary, isolated-word,
single-speaker recognition (e.g. cell phone
applications) - uses Dynamic Time Warping (DTW) warp two
sequences of features to minimize
distortion reference pattern and test pattern - recognize words by having multiple reference
patterns, ? 1 reference pattern (template) per
word - recognized word is template with minimum
distance between warped template and test word - much simpler than segment-based approach, only
need to compare frames of speech and find
smallest distortion.
10Other Approaches to ASR Stochastic Approaches
- includes HMMs and HMM/ANN hybrids
11Other Approaches to ASR Stochastic Approaches
- HMM/ANN hybrids (ICSI, CSLU, Cambridge)
- Technique
- Same as conventional HMM, except
- (1) Replace GMM with ANN to compute
observation probabilities - (2) Divide by prior probability of each class
- Why?
- ANNs compute posterior probabilities, if certain
conditions - are met. (Duda Hart, Richar Lippmann, Kanaya
Miyake, several others) - Criteria are Sufficient number of hidden nodes,
sufficient training data, Mean Square Error
criterion, etc. - M.D. Richard and R.P. Lippmann, Neural Network
Classifiers Estimate Bayesian a posteriori
Probabilities, in Neural Computation, vol. 3,
no. 4, pp. 461-483, Winter 1991.
12Other Approaches to ASR Stochastic Approaches
From Duda Hart, ANNs compute p(cj ot) HMMs
use observation probabilities p(ot cj) From
Bayes rule Because p(ot) is constant for
all cj, represents an unnormalized
probability of p(ot cj)
13Other Approaches to ASR Stochastic Approaches
Training HMM/ANN Hybrids Generate file
containing feature vectors and an index that
indicateswhich phonetic class that feature
vector belongs to
PLP coefficients (or MFCC, with/without delta)
index
14Other Approaches to ASR Stochastic Approaches
Train a neural network on each of the feature
vectors, with the target value 1.0 for the
associated phonetic class and 0.0 for all other
classes. The network may be feed-forward (OGI,
ICSI) or recurrent (Cambridge). Usually
fully-connected, trained using back-propagation.
0.0 0.0 1.0 0.0 0.0 0.0
y eh s n ow pau
PLP0 PLP1 PLP2 PLP3 PLP4 PLP5 PLP6
0.5631 -0.3687 -0.0673 1.2241 -0.8383
0.3568 0.4660
15Other Approaches to ASR Stochastic Approaches
During recognition, present the network with a
feature vector, and use the outputs as a
posteriori probabilities of each class
0.03 0.12 0.01 0.02 0.81 0.00
y eh s n ow pau
PLP0 PLP1 PLP2 PLP3 PLP4 PLP5 PLP6
1.2904 -0.2140 -0.9214 -0.5846 -0.6672
0.9754 -0.4610
Then, divide each output by the a priori
probability of that class
y0.03/0.08, eh0.12/0.17, s0.01/0.25,
n0.02/0.08, ow0.81/0.25, pau0.0/0.17
to arrive at the observation probabilities, e.g.
pow(ot) 3.24
16Other Approaches to ASR Stochastic Approaches
- Instead of dividing by the a priori likelihoods,
the training process may be modified to output
estimates of p(cj ot) / p(ci) (Wei and van
Vuuren, 1998). - The training process requires that each feature
vector be associated with a single phoneme with a
target probability of 1.0. Training an ANN takes
longer than training a GMM (because one feature
vector affects all outputs). - Therefore, training of HMM/ANN hybrids is
typically performed on hand-labeled or
force-aligned data, not using the
forward-backward training procedure. However,
there are methods for incorporating
forward-backward training into HMM/ANN
recognition (e.g. Yan, Fanty, Cole 1997).
17Other Approaches to ASR Stochastic Approaches
- Advantages of HMM/ANN hybrids
- Input features may be correlated
- Discriminant training
- Fast execution time
- Disadvantages of HMM/ANN hybrids
- Long training time
- Inability to tweak individual phoneme models
- Relatively small number of output categories
possible (500 max.) - Performance
- Comparable with HMM/GMM systems
- Slightly better performance on smaller tasks
such as phoneme or digit recognition, slightly
worse performance on large-vocabulary tasks.
18Other Approaches to ASR
- Variants on HMMs
- multi-band recognition, temporal patterns
(Hermansky) - phonetic-transition-based recognition (Bourlard,
Hosom) - syllable-based recognition (Wu)
- overlapping, low-level articulatory features
(Deng) - dynamic templates (Ghitza,Sondhi)
19Other Approaches to ASR SPAM
Stochastic Perceptual Auditory Model
(SPAM) Based on evidence from Furui (1976) and
Strange (1999), humans recognize speech based on
phonetic transitions. Steady-state regions are
considered (much) less important.
- HMM/ANN system trained on phonetic boundaries
- all steady-state regions classified as
non-transition - performance increase when combine with standard
HMM
w
uh
n
ntr
w-uh
ntr
uh-n
ntr
n-sil
sil-w
20Other Approaches to ASR Modified Diphones
Modified Diphones both transition regions and
steady-state regions are important
- Advantages
- optional steady state allows recognition of
fast and normal-rate speech - accounts for both phonetic transition and
steady-state info. - 13 reduction in error on digits task
- Disadvantages
- restricted task due to large number of diphones
- context-independent phonemes for steady states
due to large number of output categories
21Other Approaches to ASR TRAPS
In contrast to typical speech feature vectors
that represent the entire spectrum at one small
time frame, TRAPS represent energy in one
frequency band over long time frame (Sharma,
Hermansky 1998)
22Other Approaches to ASR TRAPS
The average TRAP (TempoRAl Pattern) for each
phoneme can be computed, using the center frame
of each labeled phoneme as reference. These
mean TRAPS show characteristics specific to
each phoneme.
23Other Approaches to ASR TRAPS
Pattern classification can be done by using the
1-second long TRAPS as input (100 inputs), 300
hidden nodes, and one phoneme per output node (29
for digits task). Phonetic classification can be
improved by combining outputs of different
frequency-band TRAPS
band 15
...
435
29
band 2
band 1
24Other Approaches to ASR TRAPS
These phoneme estimates can then be used in HMM
system as estimates of bj(ot), and compared with
standard HMM/ANN hybrid.
25Other Approaches to ASRTemplates as
Non-Stationary States
Combination of HMMs and DTW-based recognition
(Ghitza,Sondhi 1993) Each observation does not
occupy a single 10-msec frame, but a variable
duration, with time-varying output. Diphones are
used to maximize the change within units. A
labeled database of diphones is used to create
templates for each diphone. The templates are
composed of 11th-order cepstral features with a
frame rate of 10 msec.
mgtaa
aagtr
kcgtk
paugtm
rgtkc
fgtpau
kgtao
aogtf
26Other Approaches to ASRTemplates as
Non-Stationary States
Then, compute p(O,Tq) p(OT,q)
p(durTq) where T is the duration of the
diphone. p(OT,q) is computed by (a) DTW
between observations and template, and (b)
classification with Gaussian PDF using means and
covariances between warped observations
and template. p(durTq) is replaced by penalty
factor of 0.0 for durations beyond half or twice
the duration of the template. Recognition is
done using technique similar to Viterbi
search, maximizing p(O,Tq) instead of just
p(Oq) where Tt1,t2,,tN
27Other Approaches to ASRTemplates as
Non-Stationary States
Results for a single speaker uttering TIMIT
sentences, evaluated on other sentences Corre
ct Ins. Del. Accuracy Proposed 70.1 23.5 4.
2 42.4 Baseline 63.6 56.1 3.5
4.0 Why is baseline performance so bad??
Typical speaker-independent performance on TIMIT
is ? 70 accuracy. Why so many insertions in
both systems?? Not clear How to extend to
speaker-independent system?? (In other words,
move away from static DTW-based template?) Not
clear How is Viterbi search implemented?? Not
clear
28Other Approaches to ASR TRACE
- TRACE (McClelland and Elman, 1986) has
architecture similar to neural network(a)
based on interactive activation of simple units,
using excitatory and inhibitory
interactions of a large number of such
units(b) Each unit stands for a hypothesis
about the input, such as the existence of a
bilabial plosive activation of a unit is
monotonically related to the strength of the
hypothesis(c) Mutually consistent hypotheses
have mutually excitatory connections, and
inconsistent hypotheses have inhibitory
connections - Units may be connected at the same level of
abstraction, or at different levels. However,
there is no between-level inhibition. The three
primary levels are feature, phoneme, and word
29Other Approaches to ASR TRACE
- A frame-based architecture is used, but the size
of each frame increases with abstractness of the
features - Computationally plausible two prototypes built,
one for processing real speech of a single
speaker uttering monosyllables, another for
processing mock or synthetic speech to
illustrate more complex recognition. However,
this work has not been extended to a single
complete connected-word ASR system. - The feature level contains 7 distinct features,
each distinct feature having 8 possible values
from low to high and a 9th value for silence - The response of a unit to inputs is the Bayesian
combination of its input probabilities
30Other Approaches to ASR TRACE
31Why is HMM model dominant technique for ASR?
- well-defined mathematical structure
- does not require expert knowledge about speech
signal (more people study statistics than
study speech) - errors in analysis dont propagate and
accumulate - does not require prior segmentation
- temporal property of speech is accounted for
- does not require a prohibitively large number of
templates - results are usually the best or among the best
32(No Transcript)