Title: T-61.184 Informaatiotekniikan erikoiskurssi IV
1T-61.184 Informaatiotekniikan erikoiskurssi IV
HMMs and Speech Recognition
based on chapter 7 ofD. Jurafsky, J. Martin
Speech and Language Processing
Jaakko PeltonenOctober 31, 2001
2- speech recognition architecture
- HMM, Viterbi, A
- speech acoustics features
- computing acoustic probabilities
- speech synthesis
3Speech Recognition Architecture
2
- Application LVCSR
- Large vocabulary dictionary size 5000 60000
- Continuous speech (words not separated)
- Speaker-independent
4- acoustic input considered a noisy version of a
source sentence -
- decoding find the sentence that most probably
generated the input - problems - metric for selecting best
match? - efficient algorithm for finding
best match
5- acoustic input symbol sequence
- sentence string of words
- best match metric probability
- Bayes rule
observation likelihood prior probability
acoustic model
language model
6Hidden Markov Models (HMMs)
5
- previously, Markov chains used to model
pronounciation - forward algorithm ? phone sequence likelihood
- real input is not symbolic spectral features
- input symbols do not correspond to machine
states - HMM definition
- state set Q, observation symbols O ?
Q transition probabilities A B not
limited to 1 and 0 start and end state(s)
observation likelihoods B
7a24
Word Model
a11
a22
a33
a01
a12
a23
a34
start0
n1
d3
end4
iy2
b1(o3)
b1(o5)
b1(o1)
b1(o2)
b1(o6)
b1(o4)
ObservationSequence
o1
o2
o3
o4
o5
o6
8- word boundaries unknown ? segmentationay d ih
s hh er d s ah m th ih ng ax b aw I just heard
something about
- assumption dynamic programming invariant
- If ultimate best path for o includes state qi ,
it includes the best path up to including qi
. - does not work for all grammars
9function VITERBI(observations of len T,
state-graph) returns best-path num_states ?
NUM-OF-STATES(state-graph) Create a path
probability matrix viterbinum-states2,T2
viterbi0,0?1.0 for each time step t from 0 to
T do for each state s from 0 to num-states
do for each transition s from s specified
by state-graph new-score?viterbis,tas,
sbs(ot) if ((viterbis,t1 0)
(new-score gt viterbis,t1)) then
viterbis,t1?new-score
back-pointers,t1?s Backtrace from
highest probability state in the final column of
viterbi and return path.
- single automaton ? combine single-word networks
? add word transition probabilities
bigram probabilities - states correspond to subphones context
- beam search
10- Viterbi has problems
- computes most probable state sequence, not
word sequence - Cannot be used with all language models (only
bigrams) - Solution 1 multiple-pass decoding
- N-best-Viterbi return N best sentences, sort
with more complex model - word lattice return directed word graph
word observation likelihoods ? refine with more
complex model
1110
A Decoder
- Viterbi uses an approximation of the forward
algorithm max instead of sum - A uses the complete forward algorithm ? correct
observation likelihoods, use any language
model - Best-first search of word sequence tree
- priority queue of scored paths to extend
- Algorithm 1. select highest-priority path (pop
queue) 2. create possible extensions (if none,
stop) 3. calculate scores for extended paths
(from forward algorithm and language
model) 4. add scored paths to queue
1211
A Decoder, continued
p(acousticmusic)forward probability
p(musicif)
music32
p(acousticif)forward probability
if30
muscle31
p(ifSTART)
messy25
(none)1
Alice40
was29
wants24
Every25
walls2
In4
13A Decoder, continued
12
- score of word string w is not (y is the
acoustic string) - reason path prefix would have higher score
- score A evaluation function
- score from start to current string end
- estimated score of best
extension to utterance end
14Acoustic Processingof Speech
13
- wave characteristics frequency ? pitch,
amplitude ? loudness - visible information vowel/consonant, voicing,
length, fricatives, stop closure - spectral features Fourier spectrum / LPC
spectrum - peaks characteristic of different
sounds ? formants - spectrogram changes over time
- digitization sampling, quantization
- processing ? cepstral features / PLP features
15Computing Acoustic Probabilities
14
- simple way vector quantization (cluster
feature vectors count cluster occurrences) - continuous approach calculate probability
density function (pdf) over observations - Gaussian pdf trained with forward-backward
algorithm - Gaussian mixtures, parameter tying
- Multi-layer perceptron (MLP) pdf trained with
error back-propagation
16Training A Speech Recognizer
15
- evaluation metric word error rate 1.
Compute minimum edit distance between
hypothesized and correct string 2. - e.g. correct I went to a party
hypothesis Eye went two a bar tea
3 substitutions, 1 deletion ? word error rate
80 - State of the art word error rate 20 on
natural- speech tasks
17Embedded Training
16
- models to be trained - language model
p(wiwi-1wi-2) - observation
likelihoods bj(ot) - transition
probabilities aij - pronounciation
lexicon HMM state graph - training data - corpus of speech wavefiles
word-transcription - large text corpus for
language model training - smaller corpus of
phonetically labeled speech - N-gram language model trained as in Chapter 6
- HMM lexicon structure built by hand -
PRONLEX, CMUdict off-the-shelf
pronounciation dictionaries
18Embedded Training,continued
17
- HMM parameters - initial estimate equal
transition probabilities, observation
probabilities bootstrapped (labeled speech
? label for each frame ? initial
Gaussian means / variances) - - MLP systems forced Viterbi alignment
features correct words given ? best states
?labels for each input ? retrain MLP -
Gaussian systems forward-backward algorithm
compute forward backward probabilities
? re-estimate a and b. Correct words known
? prune model
19Speech Synthesis
18
- text-to-speech (TTS) system output is a phone
sequence with durations and a FO pitch contour - waveform concatenation based on recorded
speech database, segmented into short units - simplest 1 unit / phone, join units smooth
edges - triphone models too many combinations ?
diphones used - diphones start/end midway through a phone for
stability - does not model pitch duration changes (prosody)
20Speech Synthesis, continued
19
- use signal processing to change prosody
- LPC model separates pitch from spectral
envelope ? to modify pitch generate
pulses in desired pitch, re-excite LPC
coefficients ? modified wave ? to modify
duration contract/expand coefficient
frames - TD-PSOLA frames centered around pitchmarks ?
to change pitch make pitchmarks closer
together / further apart ? to change duration
duplicate / leave out frames ? recombine
overlap and add frames
21Speech Synthesis, continued
20
- problems with speech synthesis - 1
example/diphone is insufficient - signal
processing ? distortion - subtle effects not
modeled - unit selection collect several examples/unit
with different pitch/duration/linguistic
situation - selection method - FO contour with 3
values/phone, large unit corpus 1. find
candidates (closest phone, duration FO)
rank them by target cost (closeness) 2.
measure join quality of neighbour candidates
rank joins by concatenation cost - pick
best unit set ? more natural speech
22Human SpechRecognition
21
- PLP analysis inspired by human auditory system
- lexical access has common properties -
frequency - parallelism - neighborhood
effects - cue-based processing (phoneme
restoration) formant structure,
timing, voicing, lexical cues,
word association, repetition priming - differences - time-course human
processing is on-line - other cues prosody
23Exercises
22
1. Hand-simulate the Viterbi algorithm use the
Automaton in Figure 7.8, on input aa n n ax n
iy d. What is the most probable string of
words? 2. Suggest two functions for
use in A decoding. What criteria should the
function satisfy for the search to work
(i.e. to return the best path)?