CSE 552

About This Presentation

Title:

CSE 552

Description:

Use 5 states (plus two null states) for training; if you want ... developed at CMU by Cole, Stern, et al. competitive performance at alphabet classification ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 33

Provided by: hos1

Category:

more less

Transcript and Presenter's Notes

Title: CSE 552

1

CSE 552
Hidden Markov Models for Speech Recognition
Spring, 2004
Oregon Health Science University
OGI School of Science Engineering
John-Paul Hosom
Lecture Notes for May 26
Other Approaches to ASR

2
Final Project Final Exam

Forward-Backward project due June 2
Assume you need only one mixture component in GMM
Use 5 states (plus two null states) for training
if you wantto use a different number, thats
fine, but state yourassumptions.
Send C code and output of no model (after 10
training iterations) to hosom_at_cse.ogi.edu
Results should be very similar to HMM models
given toyou in project 2 (Viterbi search).
Final exam will be available after the last class
and due at the end of finals week (Friday, June
11).

3
Other Approaches to ASR Segment-Based Systems

SUMMIT system
developed at MIT by Zue, Glass, et al.
competitive performance at phoneme
classification
segment-based recognition (a) segment the
speech at possible phonetic boundaries (b)
create network of (sub-)phonetic segments (c)
classify each segment (d) search segment
probabilities for most likely sequence of
phonetic segments
complicated

4
Other Approaches to ASR Segment-Based Systems

SUMMIT system dendrogram

spectral change

segment network can also be created using
segmentation by recognition

5
Other Approaches to ASR Segment-Based Systems

segment-based recognition (a) segment the
speech at possible phonetic boundaries (b)
create network of (sub-)phonetic segments (c)
classify each segment (d) search segment
probabilities for most likely sequence of
phonetic segments
(a,b) segmentation by recognition (A search)

m
ao
f
pau
aa
kc
k
n
tc
ah
ah
r
t
f
pau
m
tc
v
t
ah
ao
6
Other Approaches to ASR Segment-Based Systems
(c) classification classify each segment (not
frame-by-frame) using information throughout
segment ? MFCC and energy averages over each
segment third, ? feature derivatives at segment
boundaries, ? segment duration, ? number of
boundaries within segment. (d) search Viterbi
search through segments to determine
best sequence of segments (phonemes). Search is
complicated by the fact that must consider all
possible segmentations, not just hypothesized
segmentation
7
Other Approaches to ASR Segment-Based Systems

Feature system
developed at CMU by Cole, Stern, et al.
competitive performance at alphabet
classification
segment-based recognition of a single
letter (a) extract information about signal,
including spectral properties, F0, energy in
frequency bands (b) locate 4 points in
utterance beginning of utterance, beginning
of vowel, offset of vowel, end of
utterance (c) extract 50 features (from step (a)
at the 4 locations) (d) use decision tree to
determine the probabilities of each letter
fragile errors in feature extraction
segmentation can not be recovered from

8
Other Approaches to ASR Segment-Based Systems

Determine letter/digit boundaries using HMM/ANN
hybrid (can only recover from substitution
errors)
For each hypothesized letter/digit (a) locate
4 points in word beginning of word, beginning
of sonorant, offset of sonorant, end of
word (b) extract information at segment
boundaries and some points within sonorant
region PLP features, zero crossings,
peak-to-peak amplitude (d) use ANN to classify
these features into 37 categories (26
letters 11 digits)
Telephone-band performance of almost 90 recent
HMM performance of just over 90.

9
Other Approaches to ASR Template-Based Systems

Template-based systems with time alignment
useful for small vocabulary, isolated-word,
single-speaker recognition (e.g. cell phone
applications)
uses Dynamic Time Warping (DTW) warp two
sequences of features to minimize
distortion reference pattern and test pattern
recognize words by having multiple reference
patterns, ? 1 reference pattern (template) per
word
recognized word is template with minimum
distance between warped template and test word
much simpler than segment-based approach, only
need to compare frames of speech and find
smallest distortion.

10
Other Approaches to ASR Stochastic Approaches

includes HMMs and HMM/ANN hybrids

11
Other Approaches to ASR Stochastic Approaches

HMM/ANN hybrids (ICSI, CSLU, Cambridge)
Technique
Same as conventional HMM, except
(1) Replace GMM with ANN to compute
observation probabilities
(2) Divide by prior probability of each class
Why?
ANNs compute posterior probabilities, if certain
conditions
are met. (Duda Hart, Richar Lippmann, Kanaya
Miyake, several others)
Criteria are Sufficient number of hidden nodes,
sufficient training data, Mean Square Error
criterion, etc.
M.D. Richard and R.P. Lippmann, Neural Network
Classifiers Estimate Bayesian a posteriori
Probabilities, in Neural Computation, vol. 3,
no. 4, pp. 461-483, Winter 1991.

12
Other Approaches to ASR Stochastic Approaches
From Duda Hart, ANNs compute p(cj ot) HMMs
use observation probabilities p(ot cj) From
Bayes rule Because p(ot) is constant for
all cj, represents an unnormalized
probability of p(ot cj)
13
Other Approaches to ASR Stochastic Approaches
Training HMM/ANN Hybrids Generate file
containing feature vectors and an index that
indicateswhich phonetic class that feature
vector belongs to
PLP coefficients (or MFCC, with/without delta)
index
14
Other Approaches to ASR Stochastic Approaches
Train a neural network on each of the feature
vectors, with the target value 1.0 for the
associated phonetic class and 0.0 for all other
classes. The network may be feed-forward (OGI,
ICSI) or recurrent (Cambridge). Usually
fully-connected, trained using back-propagation.
0.0 0.0 1.0 0.0 0.0 0.0
y eh s n ow pau
PLP0 PLP1 PLP2 PLP3 PLP4 PLP5 PLP6
0.5631 -0.3687 -0.0673 1.2241 -0.8383
0.3568 0.4660
15
Other Approaches to ASR Stochastic Approaches
During recognition, present the network with a
feature vector, and use the outputs as a
posteriori probabilities of each class
0.03 0.12 0.01 0.02 0.81 0.00
y eh s n ow pau
PLP0 PLP1 PLP2 PLP3 PLP4 PLP5 PLP6
1.2904 -0.2140 -0.9214 -0.5846 -0.6672
0.9754 -0.4610
Then, divide each output by the a priori
probability of that class
y0.03/0.08, eh0.12/0.17, s0.01/0.25,
n0.02/0.08, ow0.81/0.25, pau0.0/0.17
to arrive at the observation probabilities, e.g.
pow(ot) 3.24
16
Other Approaches to ASR Stochastic Approaches

Instead of dividing by the a priori likelihoods,
the training process may be modified to output
estimates of p(cj ot) / p(ci) (Wei and van
Vuuren, 1998).
The training process requires that each feature
vector be associated with a single phoneme with a
target probability of 1.0. Training an ANN takes
longer than training a GMM (because one feature
vector affects all outputs).
Therefore, training of HMM/ANN hybrids is
typically performed on hand-labeled or
force-aligned data, not using the
forward-backward training procedure. However,
there are methods for incorporating
forward-backward training into HMM/ANN
recognition (e.g. Yan, Fanty, Cole 1997).

17
Other Approaches to ASR Stochastic Approaches

Advantages of HMM/ANN hybrids
Input features may be correlated
Discriminant training
Fast execution time
Disadvantages of HMM/ANN hybrids
Long training time
Inability to tweak individual phoneme models
Relatively small number of output categories
possible (500 max.)
Performance
Comparable with HMM/GMM systems
Slightly better performance on smaller tasks
such as phoneme or digit recognition, slightly
worse performance on large-vocabulary tasks.

18
Other Approaches to ASR

Variants on HMMs
multi-band recognition, temporal patterns
(Hermansky)
phonetic-transition-based recognition (Bourlard,
Hosom)
syllable-based recognition (Wu)
overlapping, low-level articulatory features
(Deng)
dynamic templates (Ghitza,Sondhi)

19
Other Approaches to ASR SPAM
Stochastic Perceptual Auditory Model
(SPAM) Based on evidence from Furui (1976) and
Strange (1999), humans recognize speech based on
phonetic transitions. Steady-state regions are
considered (much) less important.

HMM/ANN system trained on phonetic boundaries
all steady-state regions classified as
non-transition
performance increase when combine with standard
HMM

w
uh
n
ntr
w-uh
ntr
uh-n
ntr
n-sil
sil-w
20
Other Approaches to ASR Modified Diphones
Modified Diphones both transition regions and
steady-state regions are important

Advantages
optional steady state allows recognition of
fast and normal-rate speech
accounts for both phonetic transition and
steady-state info.
13 reduction in error on digits task
Disadvantages
restricted task due to large number of diphones
context-independent phonemes for steady states
due to large number of output categories

21
Other Approaches to ASR TRAPS
In contrast to typical speech feature vectors
that represent the entire spectrum at one small
time frame, TRAPS represent energy in one
frequency band over long time frame (Sharma,
Hermansky 1998)
22
Other Approaches to ASR TRAPS
The average TRAP (TempoRAl Pattern) for each
phoneme can be computed, using the center frame
of each labeled phoneme as reference. These
mean TRAPS show characteristics specific to
each phoneme.
23
Other Approaches to ASR TRAPS
Pattern classification can be done by using the
1-second long TRAPS as input (100 inputs), 300
hidden nodes, and one phoneme per output node (29
for digits task). Phonetic classification can be
improved by combining outputs of different
frequency-band TRAPS
band 15
...
435
29
band 2
band 1
24
Other Approaches to ASR TRAPS
These phoneme estimates can then be used in HMM
system as estimates of bj(ot), and compared with
standard HMM/ANN hybrid.
25
Other Approaches to ASRTemplates as
Non-Stationary States
Combination of HMMs and DTW-based recognition
(Ghitza,Sondhi 1993) Each observation does not
occupy a single 10-msec frame, but a variable
duration, with time-varying output. Diphones are
used to maximize the change within units. A
labeled database of diphones is used to create
templates for each diphone. The templates are
composed of 11th-order cepstral features with a
frame rate of 10 msec.
mgtaa
aagtr
kcgtk
paugtm
rgtkc
fgtpau
kgtao
aogtf
26
Other Approaches to ASRTemplates as
Non-Stationary States
Then, compute p(O,Tq) p(OT,q)
p(durTq) where T is the duration of the
diphone. p(OT,q) is computed by (a) DTW
between observations and template, and (b)
classification with Gaussian PDF using means and
covariances between warped observations
and template. p(durTq) is replaced by penalty
factor of 0.0 for durations beyond half or twice
the duration of the template. Recognition is
done using technique similar to Viterbi
search, maximizing p(O,Tq) instead of just
p(Oq) where Tt1,t2,,tN
27
Other Approaches to ASRTemplates as
Non-Stationary States
Results for a single speaker uttering TIMIT
sentences, evaluated on other sentences Corre
ct Ins. Del. Accuracy Proposed 70.1 23.5 4.
2 42.4 Baseline 63.6 56.1 3.5
4.0 Why is baseline performance so bad??
Typical speaker-independent performance on TIMIT
is ? 70 accuracy. Why so many insertions in
both systems?? Not clear How to extend to
speaker-independent system?? (In other words,
move away from static DTW-based template?) Not
clear How is Viterbi search implemented?? Not
clear
28
Other Approaches to ASR TRACE

TRACE (McClelland and Elman, 1986) has
architecture similar to neural network(a)
based on interactive activation of simple units,
using excitatory and inhibitory
interactions of a large number of such
units(b) Each unit stands for a hypothesis
about the input, such as the existence of a
bilabial plosive activation of a unit is
monotonically related to the strength of the
hypothesis(c) Mutually consistent hypotheses
have mutually excitatory connections, and
inconsistent hypotheses have inhibitory
connections
Units may be connected at the same level of
abstraction, or at different levels. However,
there is no between-level inhibition. The three
primary levels are feature, phoneme, and word

29
Other Approaches to ASR TRACE

A frame-based architecture is used, but the size
of each frame increases with abstractness of the
features
Computationally plausible two prototypes built,
one for processing real speech of a single
speaker uttering monosyllables, another for
processing mock or synthetic speech to
illustrate more complex recognition. However,
this work has not been extended to a single
complete connected-word ASR system.
The feature level contains 7 distinct features,
each distinct feature having 8 possible values
from low to high and a 9th value for silence
The response of a unit to inputs is the Bayesian
combination of its input probabilities

30
Other Approaches to ASR TRACE
31
Why is HMM model dominant technique for ASR?

well-defined mathematical structure
does not require expert knowledge about speech
signal (more people study statistics than
study speech)
errors in analysis dont propagate and
accumulate
does not require prior segmentation
temporal property of speech is accounted for
does not require a prohibitively large number of
templates
results are usually the best or among the best

32
(No Transcript)

Write a Comment

User Comments (0)