CSE 552 - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

CSE 552

Description:

Use 5 states (plus two null states) for training; if you want ... developed at CMU by Cole, Stern, et al. competitive performance at alphabet classification ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 33
Provided by: hos1
Category:
Tags: cse | stern | stochastic

less

Transcript and Presenter's Notes

Title: CSE 552


1
  • CSE 552
  • Hidden Markov Models for Speech Recognition
  • Spring, 2004
  • Oregon Health Science University
  • OGI School of Science Engineering
  • John-Paul Hosom
  • Lecture Notes for May 26
  • Other Approaches to ASR

2
Final Project Final Exam
  • Forward-Backward project due June 2
  • Assume you need only one mixture component in GMM
  • Use 5 states (plus two null states) for training
    if you wantto use a different number, thats
    fine, but state yourassumptions.
  • Send C code and output of no model (after 10
    training iterations) to hosom_at_cse.ogi.edu
  • Results should be very similar to HMM models
    given toyou in project 2 (Viterbi search).
  • Final exam will be available after the last class
    and due at the end of finals week (Friday, June
    11).

3
Other Approaches to ASR Segment-Based Systems
  • SUMMIT system
  • developed at MIT by Zue, Glass, et al.
  • competitive performance at phoneme
    classification
  • segment-based recognition (a) segment the
    speech at possible phonetic boundaries (b)
    create network of (sub-)phonetic segments (c)
    classify each segment (d) search segment
    probabilities for most likely sequence of
    phonetic segments
  • complicated

4
Other Approaches to ASR Segment-Based Systems
  • SUMMIT system dendrogram

spectral change
  • segment network can also be created using
    segmentation by recognition

5
Other Approaches to ASR Segment-Based Systems
  • segment-based recognition (a) segment the
    speech at possible phonetic boundaries (b)
    create network of (sub-)phonetic segments (c)
    classify each segment (d) search segment
    probabilities for most likely sequence of
    phonetic segments
  • (a,b) segmentation by recognition (A search)

m
ao
f
pau
aa
kc
k
n
tc
ah
ah
r
t
f
pau
m
tc
v
t
ah
ao
6
Other Approaches to ASR Segment-Based Systems
(c) classification classify each segment (not
frame-by-frame) using information throughout
segment ? MFCC and energy averages over each
segment third, ? feature derivatives at segment
boundaries, ? segment duration, ? number of
boundaries within segment. (d) search Viterbi
search through segments to determine
best sequence of segments (phonemes). Search is
complicated by the fact that must consider all
possible segmentations, not just hypothesized
segmentation
7
Other Approaches to ASR Segment-Based Systems
  • Feature system
  • developed at CMU by Cole, Stern, et al.
  • competitive performance at alphabet
    classification
  • segment-based recognition of a single
    letter (a) extract information about signal,
    including spectral properties, F0, energy in
    frequency bands (b) locate 4 points in
    utterance beginning of utterance, beginning
    of vowel, offset of vowel, end of
    utterance (c) extract 50 features (from step (a)
    at the 4 locations) (d) use decision tree to
    determine the probabilities of each letter
  • fragile errors in feature extraction
    segmentation can not be recovered from

8
Other Approaches to ASR Segment-Based Systems
  • Determine letter/digit boundaries using HMM/ANN
    hybrid (can only recover from substitution
    errors)
  • For each hypothesized letter/digit (a) locate
    4 points in word beginning of word, beginning
    of sonorant, offset of sonorant, end of
    word (b) extract information at segment
    boundaries and some points within sonorant
    region PLP features, zero crossings,
    peak-to-peak amplitude (d) use ANN to classify
    these features into 37 categories (26
    letters 11 digits)
  • Telephone-band performance of almost 90 recent
    HMM performance of just over 90.

9
Other Approaches to ASR Template-Based Systems
  • Template-based systems with time alignment
  • useful for small vocabulary, isolated-word,
    single-speaker recognition (e.g. cell phone
    applications)
  • uses Dynamic Time Warping (DTW) warp two
    sequences of features to minimize
    distortion reference pattern and test pattern
  • recognize words by having multiple reference
    patterns, ? 1 reference pattern (template) per
    word
  • recognized word is template with minimum
    distance between warped template and test word
  • much simpler than segment-based approach, only
    need to compare frames of speech and find
    smallest distortion.

10
Other Approaches to ASR Stochastic Approaches
  • includes HMMs and HMM/ANN hybrids

11
Other Approaches to ASR Stochastic Approaches
  • HMM/ANN hybrids (ICSI, CSLU, Cambridge)
  • Technique
  • Same as conventional HMM, except
  • (1) Replace GMM with ANN to compute
    observation probabilities
  • (2) Divide by prior probability of each class
  • Why?
  • ANNs compute posterior probabilities, if certain
    conditions
  • are met. (Duda Hart, Richar Lippmann, Kanaya
    Miyake, several others)
  • Criteria are Sufficient number of hidden nodes,
    sufficient training data, Mean Square Error
    criterion, etc.
  • M.D. Richard and R.P. Lippmann, Neural Network
    Classifiers Estimate Bayesian a posteriori
    Probabilities, in Neural Computation, vol. 3,
    no. 4, pp. 461-483, Winter 1991.

12
Other Approaches to ASR Stochastic Approaches
From Duda Hart, ANNs compute p(cj ot) HMMs
use observation probabilities p(ot cj) From
Bayes rule Because p(ot) is constant for
all cj, represents an unnormalized
probability of p(ot cj)
13
Other Approaches to ASR Stochastic Approaches
Training HMM/ANN Hybrids Generate file
containing feature vectors and an index that
indicateswhich phonetic class that feature
vector belongs to
PLP coefficients (or MFCC, with/without delta)
index
14
Other Approaches to ASR Stochastic Approaches
Train a neural network on each of the feature
vectors, with the target value 1.0 for the
associated phonetic class and 0.0 for all other
classes. The network may be feed-forward (OGI,
ICSI) or recurrent (Cambridge). Usually
fully-connected, trained using back-propagation.
0.0 0.0 1.0 0.0 0.0 0.0
y eh s n ow pau
PLP0 PLP1 PLP2 PLP3 PLP4 PLP5 PLP6
0.5631 -0.3687 -0.0673 1.2241 -0.8383
0.3568 0.4660
15
Other Approaches to ASR Stochastic Approaches
During recognition, present the network with a
feature vector, and use the outputs as a
posteriori probabilities of each class
0.03 0.12 0.01 0.02 0.81 0.00
y eh s n ow pau
PLP0 PLP1 PLP2 PLP3 PLP4 PLP5 PLP6
1.2904 -0.2140 -0.9214 -0.5846 -0.6672
0.9754 -0.4610
Then, divide each output by the a priori
probability of that class
y0.03/0.08, eh0.12/0.17, s0.01/0.25,
n0.02/0.08, ow0.81/0.25, pau0.0/0.17
to arrive at the observation probabilities, e.g.
pow(ot) 3.24
16
Other Approaches to ASR Stochastic Approaches
  • Instead of dividing by the a priori likelihoods,
    the training process may be modified to output
    estimates of p(cj ot) / p(ci) (Wei and van
    Vuuren, 1998).
  • The training process requires that each feature
    vector be associated with a single phoneme with a
    target probability of 1.0. Training an ANN takes
    longer than training a GMM (because one feature
    vector affects all outputs).
  • Therefore, training of HMM/ANN hybrids is
    typically performed on hand-labeled or
    force-aligned data, not using the
    forward-backward training procedure. However,
    there are methods for incorporating
    forward-backward training into HMM/ANN
    recognition (e.g. Yan, Fanty, Cole 1997).

17
Other Approaches to ASR Stochastic Approaches
  • Advantages of HMM/ANN hybrids
  • Input features may be correlated
  • Discriminant training
  • Fast execution time
  • Disadvantages of HMM/ANN hybrids
  • Long training time
  • Inability to tweak individual phoneme models
  • Relatively small number of output categories
    possible (500 max.)
  • Performance
  • Comparable with HMM/GMM systems
  • Slightly better performance on smaller tasks
    such as phoneme or digit recognition, slightly
    worse performance on large-vocabulary tasks.

18
Other Approaches to ASR
  • Variants on HMMs
  • multi-band recognition, temporal patterns
    (Hermansky)
  • phonetic-transition-based recognition (Bourlard,
    Hosom)
  • syllable-based recognition (Wu)
  • overlapping, low-level articulatory features
    (Deng)
  • dynamic templates (Ghitza,Sondhi)

19
Other Approaches to ASR SPAM
Stochastic Perceptual Auditory Model
(SPAM) Based on evidence from Furui (1976) and
Strange (1999), humans recognize speech based on
phonetic transitions. Steady-state regions are
considered (much) less important.
  • HMM/ANN system trained on phonetic boundaries
  • all steady-state regions classified as
    non-transition
  • performance increase when combine with standard
    HMM

w
uh
n
ntr
w-uh
ntr
uh-n
ntr
n-sil
sil-w
20
Other Approaches to ASR Modified Diphones
Modified Diphones both transition regions and
steady-state regions are important
  • Advantages
  • optional steady state allows recognition of
    fast and normal-rate speech
  • accounts for both phonetic transition and
    steady-state info.
  • 13 reduction in error on digits task
  • Disadvantages
  • restricted task due to large number of diphones
  • context-independent phonemes for steady states
    due to large number of output categories

21
Other Approaches to ASR TRAPS
In contrast to typical speech feature vectors
that represent the entire spectrum at one small
time frame, TRAPS represent energy in one
frequency band over long time frame (Sharma,
Hermansky 1998)
22
Other Approaches to ASR TRAPS
The average TRAP (TempoRAl Pattern) for each
phoneme can be computed, using the center frame
of each labeled phoneme as reference. These
mean TRAPS show characteristics specific to
each phoneme.
23
Other Approaches to ASR TRAPS
Pattern classification can be done by using the
1-second long TRAPS as input (100 inputs), 300
hidden nodes, and one phoneme per output node (29
for digits task). Phonetic classification can be
improved by combining outputs of different
frequency-band TRAPS
band 15
...
435
29
band 2
band 1
24
Other Approaches to ASR TRAPS
These phoneme estimates can then be used in HMM
system as estimates of bj(ot), and compared with
standard HMM/ANN hybrid.
25
Other Approaches to ASRTemplates as
Non-Stationary States
Combination of HMMs and DTW-based recognition
(Ghitza,Sondhi 1993) Each observation does not
occupy a single 10-msec frame, but a variable
duration, with time-varying output. Diphones are
used to maximize the change within units. A
labeled database of diphones is used to create
templates for each diphone. The templates are
composed of 11th-order cepstral features with a
frame rate of 10 msec.
mgtaa
aagtr
kcgtk
paugtm
rgtkc
fgtpau
kgtao
aogtf
26
Other Approaches to ASRTemplates as
Non-Stationary States
Then, compute p(O,Tq) p(OT,q)
p(durTq) where T is the duration of the
diphone. p(OT,q) is computed by (a) DTW
between observations and template, and (b)
classification with Gaussian PDF using means and
covariances between warped observations
and template. p(durTq) is replaced by penalty
factor of 0.0 for durations beyond half or twice
the duration of the template. Recognition is
done using technique similar to Viterbi
search, maximizing p(O,Tq) instead of just
p(Oq) where Tt1,t2,,tN
27
Other Approaches to ASRTemplates as
Non-Stationary States
Results for a single speaker uttering TIMIT
sentences, evaluated on other sentences Corre
ct Ins. Del. Accuracy Proposed 70.1 23.5 4.
2 42.4 Baseline 63.6 56.1 3.5
4.0 Why is baseline performance so bad??
Typical speaker-independent performance on TIMIT
is ? 70 accuracy. Why so many insertions in
both systems?? Not clear How to extend to
speaker-independent system?? (In other words,
move away from static DTW-based template?) Not
clear How is Viterbi search implemented?? Not
clear
28
Other Approaches to ASR TRACE
  • TRACE (McClelland and Elman, 1986) has
    architecture similar to neural network(a)
    based on interactive activation of simple units,
    using excitatory and inhibitory
    interactions of a large number of such
    units(b) Each unit stands for a hypothesis
    about the input, such as the existence of a
    bilabial plosive activation of a unit is
    monotonically related to the strength of the
    hypothesis(c) Mutually consistent hypotheses
    have mutually excitatory connections, and
    inconsistent hypotheses have inhibitory
    connections
  • Units may be connected at the same level of
    abstraction, or at different levels. However,
    there is no between-level inhibition. The three
    primary levels are feature, phoneme, and word

29
Other Approaches to ASR TRACE
  • A frame-based architecture is used, but the size
    of each frame increases with abstractness of the
    features
  • Computationally plausible two prototypes built,
    one for processing real speech of a single
    speaker uttering monosyllables, another for
    processing mock or synthetic speech to
    illustrate more complex recognition. However,
    this work has not been extended to a single
    complete connected-word ASR system.
  • The feature level contains 7 distinct features,
    each distinct feature having 8 possible values
    from low to high and a 9th value for silence
  • The response of a unit to inputs is the Bayesian
    combination of its input probabilities

30
Other Approaches to ASR TRACE
31
Why is HMM model dominant technique for ASR?
  • well-defined mathematical structure
  • does not require expert knowledge about speech
    signal (more people study statistics than
    study speech)
  • errors in analysis dont propagate and
    accumulate
  • does not require prior segmentation
  • temporal property of speech is accounted for
  • does not require a prohibitively large number of
    templates
  • results are usually the best or among the best

32
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com