Title: CSE 552652
1 CSE 552/652 Hidden Markov Models for Speech
Recognition Spring, 2005 Oregon Health Science
University OGI School of Science
Engineering John-Paul Hosom Lecture Notes for
May 9 One-Pass, Two-Level, Level Building
2Forward-Backward Training Multiple Observation
Sequences
- Usually, training is performed on a large number
of separateobservation sequences, e.g. multiple
examples of the word yes. - If we denote individual observation sequences
with a superscript,where O(i) is the ith
observation sequence, then we can considerthe
set of all K observation sequences used in
training - We want to maximize
- The re-estimation formulas are based on
frequencies of events for an observation
sequence Oo1,o2,,oT, e.g.
1
2
3
3Forward-Backward Training Multiple Observation
Sequences
- If we have multiple observation sequences, then
we can re-write the re-estimation formulas for
specificsequences, e.g. - For example, lets say we have two observation
sequences, each of length 3, and furthermore,
lets pretend that the followingare reasonable
numbers
4
4Forward-Backward Training Multiple Observation
Sequences
- If we look at the transition probabilities
computed separatelyfor each sequence O(1) and
O(2), then - One way of computing the re-estimation formula
for aij is to set the weight wk to 1.0 and then - Another way of re-estimating is to give each
individual estimateequal weight by computing the
mean, e.g.
5
5Forward-Backward Training Multiple Observation
Sequences
- Rabiner proposes using a weight inversely
proportional to theprobability of the
observation sequence, given the model - This weighting gives greater weight in the
re-estimation to thoseutterances that dont fit
the model well. - This is reasonable if one assumes that in
training the model and data should always have a
good fit. - However, we assume that from the (known) words in
the training set we can obtain the correct
phoneme sequences in the training set.But, this
assumption is in many cases not valid.
Therefore, it can be safer to use a weight of wk
1.0. - Also, when dealing with very small values of P(O
?), small changes in P(O ?) can yield large
changes in the weights.
6
6Forward-Backward Training Multiple Observation
Sequences
- For the third project, you may implement either
equations 4 or5 (above) when dealing with
multiple observation sequences(multiple
recordings of the same word, in this case). - As noted in Lecture 8 (slides 25,26),
implementation of eithersolution involves use of
accumulators the idea is toadd values in the
accumulator for each file, and then whenall
files have been processed, compute the new model
parameters. For example, for equation 4, the
numeratorof the accumulator contains the sum
(over each file) of and the denominator
contains the sum (over each file) ofFor
equation 5, the accumulator contains the sum of
individual values of , and this sum is then
divided by K.
7Project 3 Forward-Backward Algorithm
- Given existing data files of speech, implement
the forward-backward (EM, Baum-Welch) algorithm
to train HMMs. - Template code is available to read in features,
write out HMM values to an output file, provide
some context and a starting point. - The features in the speech files are real, in
that they are 7 cepstral coefficients plus 7
delta values from utterances of yes and no
sampled every 10 msec. - All necessary files (data files and list of files
to train on) are in the project3.zip file on the
class web site. - Train an HMM on the word no using the list
nolist.txt, which contains the filenames
no_1.txt no_2.txt and no_3.txt - Train for 10 iterations.
8Project 3 Forward-Backward Algorithm
- The HMM should have 7 states, the first and last
of which are "NULL" states. - You can use the first NULL state to store
information about ?, and you can start off
assuming that the ? value for the first "real"
(non-null) state is 1.0 and all other states ?
is zero. - You can use any method to get initial HMM
parameters the flat start method is easiest.
- You can use only one mixture component in
training, and you can assume a diagonal
covariance matrix. - Updating of the parameters using the accumulators
is currently set up for accumulating numerators
and denominators separately for aij, means, and
covariances. If you want to do the updating
differently (using only one accumulator each for
aij, means, and covariances), feel free to do
so.
9Project 3 Forward-Backward Algorithm
- Sanity check There are two kinds of sanity
checks you can do. First, your output should be
close to the HMM file for the wordno that you
used in the Viterbi project. (Results may not
beexactly the same, depending on different
assumptions made.)Second, you can compare alpha
and beta values, as discussed in class, to make
sure that they are equal in certain cases. - Submit your results for the 10th iteration of
training on the word no. - Due on June 1 send your source code and results
(the file hmm_no.10 that you created) to hosom
at cslu .ogi .edu late responses generally not
accepted.
10Connected Word Recognition One Level, 2-Level,
Level Building
So far, weve been using Viterbi search to find
best word-level result. Words are constructed as
network of nodes, and final nodes may connect to
initial nodes for continuous-speech
case. Connecting word endings to word beginnings
in a Viterbi searchis also called the one
level search algorithm. There are two
additional approaches, 2-Level and Level
Buildingfor finding the best word sequence,
which may be based on Viterbi calculations or
DTW calculations. These two approaches are
faster than One Level, and are appropriate in
cases where the number of connected words N is
known in advance. Also, N should be small but
greater than 1. These two approaches are
presented here for historical background and for
the innovative algorithmic approaches to ASR, but
are generally not currently used.
11Connected Word Recognition
In many cases, we want to recognize multiple
words in sequence
R reference word V number of unique
vocabulary items m current frame in test
utterance M maximum time frames in test
utterance t(m) observation vector (features) at
time m T entire test utterance L how many
words we think were uttered NR duration (in
frames) of hypothesized reference word R
12Connected Word Recognition
We need some measure of distance between a
hypothesized word and the observed speech
data For DTW-based recognition
(Rabiner) where d(, ) is a local distance
measure (Euclidean, etc.) w(m) is the warping at
time m that aligns observation vector t with the
reference r, where rv is a single frame of
reference word Rv.
13Connected Word Recognition
Equivalent distances can be computed for
Viterbi-search based recognition (standard HMMs)
these bs are different!
where t(m) is the observation at time frame m,
d(t(m), j) is the local distortion for
observation t(m) in state j, D(Rj,m) is the
cumulative distortion for word R in state j at
frame m, Rj? is the word-initial state ? for word
Rj, Ri? is the word-final state ? for word Ri,
and j? is the state before j?. Also, b is the
frame at the beginning of the word Rj, e is the
frame at the end of the word Rj. Word Ri
preceeds word Rj. D(R,b,e) is the cumulative
distortion for word R beginning at b and ending
at e.
?
14Connected Word Recognition
The straightforward solution is to iterate over
all possible word combinations of length L, and
all possible word lengths from Lmin to Lmax,
finding the set of words with lowest
distance. Rsa (template of) possible series of
words, of length L there are VL such possible
series. where D is the best distance
value. but this is distance measures,
and each distance measure requires up to
calculations,
with lnumber of links in path, P total number
of paths in heuristic For V10, Lmin7, Lmax11,
there are about 1011 computations.
15The One-Pass Algorithm
For DTW systems (assuming one type of
heuristic) where dA() is the accumulated
distance up to frame m of the test utterance and
reference frame n of Rv (2 ?n ? Nv) and
reference utterance v (1 ? v ? V). At the
beginning of a reference pattern Rv, so that
the accumulated distance is the local distance,
plus the minimum of (transitioning from the last
frame of a previous word) or (staying in the same
first frame of the current word). Finally,
16The One-Pass Algorithm
For HMM systems which is the standard
Viterbi formula if the templates Rj
are individual states and Ri? are the final
states in each word i. In this case,
accumulated distance is minimum of (distance for
staying in same state ? local distance),
(distance for coming from a different state ?
local distance). At the beginning of a word,
the different states are the ending states of
all words. Need to keep track not only of
backtrace ?, but also where word boundaries
occur. (When transition between two states, know
ifthis transition marks a word boundary)
17The One-Pass Algorithm
sentencebeginning
sentenceend
Advantage If connect word-end states to
word-beginning states,implementation is easy
(Viterbi search) Problem Cant specify
beforehand the exact number of words in the
utterance. Solution Use a grammar. But
implementing a grammar is difficult.
18The One-Pass Algorithm
example
w w w n n n n n tc tc th u
u u u u
w
n
tc
th
u
19The 2-Level Dynamic Programming Algorithm
Warning!! the name should be 2-Step Dynamic
Programming Step 1 match every possible word
Rv, with every possible range of the test
utterance T. For each range of the test
utterance, only save the best word (and
score). Step 2 use dynamic programming to
select range of utterances (a) that covers
entire range of test utterance T, and (b) has
best overall score for a given number of words,
L (Step 3) choose word length with best overall
score
distance of best word beginning at frame 3
and ending at frame 4
optional find
20The 2-Level Algorithm
Step 1 compute distances
best score from b to e
best word from b to e
D(R1,2,4) D(R2,2,4) D(R3,2,4) D(R4,2,4)
choose min
begin frame
6 5 4 3 2 1
best word between 2 and 4
1 2 3 4 5 6
Viterbi or DTW score for word 4 beginning at time
2, ending at time 4
end frame
21The 2-Level Algorithm
Step 2 determine best sequence of best utterances
cost of best word from b to e
accumulated cost of l-1 words ending at time b-1
- evaluate at time eM to determine best l words
over test data. - choose minimum value of over all
values of l if exact number of words is not
known in advance. - word sequence obtained from word pointers
22The 2-Level Algorithm
Step 2 whole algorithm part (1)
Initialization part (2) Build level 1
(corresponding to word 1) part (3) Iterate for
all values of b lt e ? M, 2 ? l ?
Lmax part (4) Terminate
23The 2-Level Algorithm
Example (from Rabiner Juang, p. 398)
Given these D, what are best paths for 1, 2, and
3-word matches?
24The 2-Level Algorithm
Best N-word path 1 word from t1 to t15, score
D1(15) 60 (note )
25The Level-Building Dynamic Programming Algorithm
Nth level in LB Nth word in hypothesized word
string Idea instead of computing distances for
all words at all begin and end times, do
this (1) compute distances for all words with
begin time 1, until maximum end times (2) at
each possible end time, select best word
1 (3) compute distances for all words beginning
where previous words left off, until maximum
end times (4) at each possible end time, select
best word 2 repeat until reach level (word
number) Lmax.
26The Level-Building Dynamic Programming Algorithm
Define as minimum accumulated
distance at level (word) l with word v, until
frame m. We can evaluate this from mv1(l) to
mv2(l). For DTW with 21 expansion and
compression, at level 1, mv1(l) ½ (length of
reference pattern v), mv2(1) 2 (length of
reference pattern v). For HMMs at level 1,
mv1(1) of states in HMM, mv2(1) M or some
reasonable maximum length of v The first
output of level 1 is the matrix
27The Level-Building Dynamic Programming Algorithm
Then, for we compute which is the best
distance at level l to frame m. We also store
the word v that resulted in this best distance
and the starting frame. Then we start the second
level, with beginning frames in range and
search all words beginning at these frames with
the initial accumulated distance scores taken
from the results of the first level. Finally,
(Bbest)
(global best)
28The Level-Building Dynamic Programming Algorithm
29The Level-Building Dynamic Programming Algorithm
30The Level-Building Dynamic Programming Algorithm
31Comparison of Approaches
The 2-Level, LB, and one-pass algorithms in
general provide the same answer the differences
are (a) 2-Level can be done
time-synchronously, requires more computation
than LB, can specify exact number of words in
utterance (b) Level-Building can not be done
time-synchronously, requires less computation
than 2-Level, can specify exact number of words
in utterance (c) One-Pass can be done
time-synchronously, requires more computation
than 2-Level, can not specify exact number of
words in utterance without using finite-state
grammar
32Reading
Rabiner Juang, Chapter 6 Rabiner Juang,
Chapter 7 (pp. 390 433)