CSE 552652 - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

CSE 552652

Description:

each of length 3, and furthermore, let's pretend that the following. are reasonable numbers: ... possible series of words, of length L. there are VL such ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 33
Provided by: hos1
Category:
Tags: cse | length

less

Transcript and Presenter's Notes

Title: CSE 552652


1
CSE 552/652 Hidden Markov Models for Speech
Recognition Spring, 2005 Oregon Health Science
University OGI School of Science
Engineering John-Paul Hosom Lecture Notes for
May 9 One-Pass, Two-Level, Level Building
2
Forward-Backward Training Multiple Observation
Sequences
  • Usually, training is performed on a large number
    of separateobservation sequences, e.g. multiple
    examples of the word yes.
  • If we denote individual observation sequences
    with a superscript,where O(i) is the ith
    observation sequence, then we can considerthe
    set of all K observation sequences used in
    training
  • We want to maximize
  • The re-estimation formulas are based on
    frequencies of events for an observation
    sequence Oo1,o2,,oT, e.g.

1
2
3
3
Forward-Backward Training Multiple Observation
Sequences
  • If we have multiple observation sequences, then
    we can re-write the re-estimation formulas for
    specificsequences, e.g.
  • For example, lets say we have two observation
    sequences, each of length 3, and furthermore,
    lets pretend that the followingare reasonable
    numbers

4
4
Forward-Backward Training Multiple Observation
Sequences
  • If we look at the transition probabilities
    computed separatelyfor each sequence O(1) and
    O(2), then
  • One way of computing the re-estimation formula
    for aij is to set the weight wk to 1.0 and then
  • Another way of re-estimating is to give each
    individual estimateequal weight by computing the
    mean, e.g.

5
5
Forward-Backward Training Multiple Observation
Sequences
  • Rabiner proposes using a weight inversely
    proportional to theprobability of the
    observation sequence, given the model
  • This weighting gives greater weight in the
    re-estimation to thoseutterances that dont fit
    the model well.
  • This is reasonable if one assumes that in
    training the model and data should always have a
    good fit.
  • However, we assume that from the (known) words in
    the training set we can obtain the correct
    phoneme sequences in the training set.But, this
    assumption is in many cases not valid.
    Therefore, it can be safer to use a weight of wk
    1.0.
  • Also, when dealing with very small values of P(O
    ?), small changes in P(O ?) can yield large
    changes in the weights.

6
6
Forward-Backward Training Multiple Observation
Sequences
  • For the third project, you may implement either
    equations 4 or5 (above) when dealing with
    multiple observation sequences(multiple
    recordings of the same word, in this case).
  • As noted in Lecture 8 (slides 25,26),
    implementation of eithersolution involves use of
    accumulators the idea is toadd values in the
    accumulator for each file, and then whenall
    files have been processed, compute the new model
    parameters. For example, for equation 4, the
    numeratorof the accumulator contains the sum
    (over each file) of and the denominator
    contains the sum (over each file) ofFor
    equation 5, the accumulator contains the sum of
    individual values of , and this sum is then
    divided by K.

7
Project 3 Forward-Backward Algorithm
  • Given existing data files of speech, implement
    the forward-backward (EM, Baum-Welch) algorithm
    to train HMMs.
  • Template code is available to read in features,
    write out HMM values to an output file, provide
    some context and a starting point.
  • The features in the speech files are real, in
    that they are 7 cepstral coefficients plus 7
    delta values from utterances of yes and no
    sampled every 10 msec.
  • All necessary files (data files and list of files
    to train on) are in the project3.zip file on the
    class web site.
  • Train an HMM on the word no using the list
    nolist.txt, which contains the filenames
    no_1.txt no_2.txt and no_3.txt
  • Train for 10 iterations.

8
Project 3 Forward-Backward Algorithm
  • The HMM should have 7 states, the first and last
    of which are "NULL" states.
  • You can use the first NULL state to store
    information about ?, and you can start off
    assuming that the ? value for the first "real"
    (non-null) state is 1.0 and all other states ?
    is zero.
  • You can use any method to get initial HMM
    parameters the flat start method is easiest.
  • You can use only one mixture component in
    training, and you can assume a diagonal
    covariance matrix.
  • Updating of the parameters using the accumulators
    is currently set up for accumulating numerators
    and denominators separately for aij, means, and
    covariances. If you want to do the updating
    differently (using only one accumulator each for
    aij, means, and covariances), feel free to do
    so.

9
Project 3 Forward-Backward Algorithm
  • Sanity check There are two kinds of sanity
    checks you can do. First, your output should be
    close to the HMM file for the wordno that you
    used in the Viterbi project. (Results may not
    beexactly the same, depending on different
    assumptions made.)Second, you can compare alpha
    and beta values, as discussed in class, to make
    sure that they are equal in certain cases.
  • Submit your results for the 10th iteration of
    training on the word no.
  • Due on June 1 send your source code and results
    (the file hmm_no.10 that you created) to hosom
    at cslu .ogi .edu late responses generally not
    accepted.

10
Connected Word Recognition One Level, 2-Level,
Level Building
So far, weve been using Viterbi search to find
best word-level result. Words are constructed as
network of nodes, and final nodes may connect to
initial nodes for continuous-speech
case. Connecting word endings to word beginnings
in a Viterbi searchis also called the one
level search algorithm. There are two
additional approaches, 2-Level and Level
Buildingfor finding the best word sequence,
which may be based on Viterbi calculations or
DTW calculations. These two approaches are
faster than One Level, and are appropriate in
cases where the number of connected words N is
known in advance. Also, N should be small but
greater than 1. These two approaches are
presented here for historical background and for
the innovative algorithmic approaches to ASR, but
are generally not currently used.
11
Connected Word Recognition
In many cases, we want to recognize multiple
words in sequence
R reference word V number of unique
vocabulary items m current frame in test
utterance M maximum time frames in test
utterance t(m) observation vector (features) at
time m T entire test utterance L how many
words we think were uttered NR duration (in
frames) of hypothesized reference word R
12
Connected Word Recognition
We need some measure of distance between a
hypothesized word and the observed speech
data For DTW-based recognition
(Rabiner) where d(, ) is a local distance
measure (Euclidean, etc.) w(m) is the warping at
time m that aligns observation vector t with the
reference r, where rv is a single frame of
reference word Rv.
13
Connected Word Recognition
Equivalent distances can be computed for
Viterbi-search based recognition (standard HMMs)
these bs are different!
where t(m) is the observation at time frame m,
d(t(m), j) is the local distortion for
observation t(m) in state j, D(Rj,m) is the
cumulative distortion for word R in state j at
frame m, Rj? is the word-initial state ? for word
Rj, Ri? is the word-final state ? for word Ri,
and j? is the state before j?. Also, b is the
frame at the beginning of the word Rj, e is the
frame at the end of the word Rj. Word Ri
preceeds word Rj. D(R,b,e) is the cumulative
distortion for word R beginning at b and ending
at e.
?
14
Connected Word Recognition
The straightforward solution is to iterate over
all possible word combinations of length L, and
all possible word lengths from Lmin to Lmax,
finding the set of words with lowest
distance. Rsa (template of) possible series of
words, of length L there are VL such possible
series. where D is the best distance
value. but this is distance measures,
and each distance measure requires up to
calculations,
with lnumber of links in path, P total number
of paths in heuristic For V10, Lmin7, Lmax11,
there are about 1011 computations.
15
The One-Pass Algorithm
For DTW systems (assuming one type of
heuristic) where dA() is the accumulated
distance up to frame m of the test utterance and
reference frame n of Rv (2 ?n ? Nv) and
reference utterance v (1 ? v ? V). At the
beginning of a reference pattern Rv, so that
the accumulated distance is the local distance,
plus the minimum of (transitioning from the last
frame of a previous word) or (staying in the same
first frame of the current word). Finally,
16
The One-Pass Algorithm
For HMM systems which is the standard
Viterbi formula if the templates Rj
are individual states and Ri? are the final
states in each word i. In this case,
accumulated distance is minimum of (distance for
staying in same state ? local distance),
(distance for coming from a different state ?
local distance). At the beginning of a word,
the different states are the ending states of
all words. Need to keep track not only of
backtrace ?, but also where word boundaries
occur. (When transition between two states, know
ifthis transition marks a word boundary)
17
The One-Pass Algorithm
sentencebeginning
sentenceend
Advantage If connect word-end states to
word-beginning states,implementation is easy
(Viterbi search) Problem Cant specify
beforehand the exact number of words in the
utterance. Solution Use a grammar. But
implementing a grammar is difficult.
18
The One-Pass Algorithm
example
w w w n n n n n tc tc th u
u u u u
w

n
tc
th
u
19
The 2-Level Dynamic Programming Algorithm
Warning!! the name should be 2-Step Dynamic
Programming Step 1 match every possible word
Rv, with every possible range of the test
utterance T. For each range of the test
utterance, only save the best word (and
score). Step 2 use dynamic programming to
select range of utterances (a) that covers
entire range of test utterance T, and (b) has
best overall score for a given number of words,
L (Step 3) choose word length with best overall
score
distance of best word beginning at frame 3
and ending at frame 4
optional find
20
The 2-Level Algorithm
Step 1 compute distances
best score from b to e
best word from b to e
D(R1,2,4) D(R2,2,4) D(R3,2,4) D(R4,2,4)
choose min
begin frame
6 5 4 3 2 1
best word between 2 and 4
1 2 3 4 5 6
Viterbi or DTW score for word 4 beginning at time
2, ending at time 4
end frame
21
The 2-Level Algorithm
Step 2 determine best sequence of best utterances
cost of best word from b to e
accumulated cost of l-1 words ending at time b-1
  • evaluate at time eM to determine best l words
    over test data.
  • choose minimum value of over all
    values of l if exact number of words is not
    known in advance.
  • word sequence obtained from word pointers

22
The 2-Level Algorithm
Step 2 whole algorithm part (1)
Initialization part (2) Build level 1
(corresponding to word 1) part (3) Iterate for
all values of b lt e ? M, 2 ? l ?
Lmax part (4) Terminate
23
The 2-Level Algorithm
Example (from Rabiner Juang, p. 398)
Given these D, what are best paths for 1, 2, and
3-word matches?
24
The 2-Level Algorithm
Best N-word path 1 word from t1 to t15, score
D1(15) 60 (note )
25
The Level-Building Dynamic Programming Algorithm
Nth level in LB Nth word in hypothesized word
string Idea instead of computing distances for
all words at all begin and end times, do
this (1) compute distances for all words with
begin time 1, until maximum end times (2) at
each possible end time, select best word
1 (3) compute distances for all words beginning
where previous words left off, until maximum
end times (4) at each possible end time, select
best word 2 repeat until reach level (word
number) Lmax.
26
The Level-Building Dynamic Programming Algorithm
Define as minimum accumulated
distance at level (word) l with word v, until
frame m. We can evaluate this from mv1(l) to
mv2(l). For DTW with 21 expansion and
compression, at level 1, mv1(l) ½ (length of
reference pattern v), mv2(1) 2 (length of
reference pattern v). For HMMs at level 1,
mv1(1) of states in HMM, mv2(1) M or some
reasonable maximum length of v The first
output of level 1 is the matrix
27
The Level-Building Dynamic Programming Algorithm
Then, for we compute which is the best
distance at level l to frame m. We also store
the word v that resulted in this best distance
and the starting frame. Then we start the second
level, with beginning frames in range and
search all words beginning at these frames with
the initial accumulated distance scores taken
from the results of the first level. Finally,
(Bbest)
(global best)
28
The Level-Building Dynamic Programming Algorithm
29
The Level-Building Dynamic Programming Algorithm
30
The Level-Building Dynamic Programming Algorithm
31
Comparison of Approaches
The 2-Level, LB, and one-pass algorithms in
general provide the same answer the differences
are (a) 2-Level can be done
time-synchronously, requires more computation
than LB, can specify exact number of words in
utterance (b) Level-Building can not be done
time-synchronously, requires less computation
than 2-Level, can specify exact number of words
in utterance (c) One-Pass can be done
time-synchronously, requires more computation
than 2-Level, can not specify exact number of
words in utterance without using finite-state
grammar
32
Reading
Rabiner Juang, Chapter 6 Rabiner Juang,
Chapter 7 (pp. 390 433)
Write a Comment
User Comments (0)
About PowerShow.com