T-61.184 Informaatiotekniikan erikoiskurssi IV - PowerPoint PPT Presentation

About This Presentation

Title:

T-61.184 Informaatiotekniikan erikoiskurssi IV

Description:

... -excite LPC coefficients modified wave to modify duration: contract/expand coefficient frames TD-PSOLA: frames ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 24

Provided by: jtpe1

Category:

more less

Transcript and Presenter's Notes

Title: T-61.184 Informaatiotekniikan erikoiskurssi IV

1
T-61.184 Informaatiotekniikan erikoiskurssi IV
HMMs and Speech Recognition
based on chapter 7 ofD. Jurafsky, J. Martin
Speech and Language Processing
Jaakko PeltonenOctober 31, 2001
2

speech recognition architecture
HMM, Viterbi, A
speech acoustics features
computing acoustic probabilities
speech synthesis

3
Speech Recognition Architecture
2

Application LVCSR
Large vocabulary dictionary size 5000 60000
Continuous speech (words not separated)
Speaker-independent

acoustic input considered a noisy version of a
source sentence
decoding find the sentence that most probably
generated the input
problems - metric for selecting best
match? - efficient algorithm for finding
best match

acoustic input symbol sequence
sentence string of words
best match metric probability
Bayes rule
observation likelihood prior probability
acoustic model
language model

6
Hidden Markov Models (HMMs)
5

previously, Markov chains used to model
pronounciation
forward algorithm ? phone sequence likelihood
real input is not symbolic spectral features
input symbols do not correspond to machine
states
HMM definition
state set Q, observation symbols O ?
Q transition probabilities A B not
limited to 1 and 0 start and end state(s)
observation likelihoods B

7
a24
Word Model
a11
a22
a33
a01
a12
a23
a34
start0
n1
d3
end4
iy2
b1(o3)
b1(o5)
b1(o1)
b1(o2)
b1(o6)
b1(o4)
ObservationSequence

o1
o2
o3
o4
o5
o6
8

word boundaries unknown ? segmentationay d ih
s hh er d s ah m th ih ng ax b aw I just heard
something about

assumption dynamic programming invariant
If ultimate best path for o includes state qi ,
it includes the best path up to including qi
.
does not work for all grammars

9
function VITERBI(observations of len T,
state-graph) returns best-path num_states ?
NUM-OF-STATES(state-graph) Create a path
probability matrix viterbinum-states2,T2
viterbi0,0?1.0 for each time step t from 0 to
T do for each state s from 0 to num-states
do for each transition s from s specified
by state-graph new-score?viterbis,tas,
sbs(ot) if ((viterbis,t1 0)
(new-score gt viterbis,t1)) then
viterbis,t1?new-score
back-pointers,t1?s Backtrace from
highest probability state in the final column of
viterbi and return path.

single automaton ? combine single-word networks
? add word transition probabilities
bigram probabilities
states correspond to subphones context
beam search

Viterbi has problems
computes most probable state sequence, not
word sequence
Cannot be used with all language models (only
bigrams)
Solution 1 multiple-pass decoding
N-best-Viterbi return N best sentences, sort
with more complex model
word lattice return directed word graph
word observation likelihoods ? refine with more
complex model

11
10
A Decoder

Viterbi uses an approximation of the forward
algorithm max instead of sum
A uses the complete forward algorithm ? correct
observation likelihoods, use any language
model
Best-first search of word sequence tree
priority queue of scored paths to extend
Algorithm 1. select highest-priority path (pop
queue) 2. create possible extensions (if none,
stop) 3. calculate scores for extended paths
(from forward algorithm and language
model) 4. add scored paths to queue

12
11
A Decoder, continued
p(acousticmusic)forward probability
p(musicif)
music32
p(acousticif)forward probability
if30
muscle31
p(ifSTART)
messy25
(none)1
Alice40
was29
wants24
Every25
walls2
In4
13
A Decoder, continued
12

score of word string w is not (y is the
acoustic string)
reason path prefix would have higher score
score A evaluation function
score from start to current string end
estimated score of best
extension to utterance end

14
Acoustic Processingof Speech
13

wave characteristics frequency ? pitch,
amplitude ? loudness
visible information vowel/consonant, voicing,
length, fricatives, stop closure
spectral features Fourier spectrum / LPC
spectrum - peaks characteristic of different
sounds ? formants
spectrogram changes over time
digitization sampling, quantization
processing ? cepstral features / PLP features

15
Computing Acoustic Probabilities
14

simple way vector quantization (cluster
feature vectors count cluster occurrences)
continuous approach calculate probability
density function (pdf) over observations
Gaussian pdf trained with forward-backward
algorithm
Gaussian mixtures, parameter tying
Multi-layer perceptron (MLP) pdf trained with
error back-propagation

16
Training A Speech Recognizer
15

evaluation metric word error rate 1.
Compute minimum edit distance between
hypothesized and correct string 2.
e.g. correct I went to a party
hypothesis Eye went two a bar tea
3 substitutions, 1 deletion ? word error rate
80
State of the art word error rate 20 on
natural- speech tasks

17
Embedded Training
16

models to be trained - language model
p(wiwi-1wi-2) - observation
likelihoods bj(ot) - transition
probabilities aij - pronounciation
lexicon HMM state graph
training data - corpus of speech wavefiles
word-transcription - large text corpus for
language model training - smaller corpus of
phonetically labeled speech
N-gram language model trained as in Chapter 6
HMM lexicon structure built by hand -
PRONLEX, CMUdict off-the-shelf
pronounciation dictionaries

18
Embedded Training,continued
17

HMM parameters - initial estimate equal
transition probabilities, observation
probabilities bootstrapped (labeled speech
? label for each frame ? initial
Gaussian means / variances)
- MLP systems forced Viterbi alignment
features correct words given ? best states
?labels for each input ? retrain MLP -
Gaussian systems forward-backward algorithm
compute forward backward probabilities
? re-estimate a and b. Correct words known
? prune model

19
Speech Synthesis
18

text-to-speech (TTS) system output is a phone
sequence with durations and a FO pitch contour
waveform concatenation based on recorded
speech database, segmented into short units
simplest 1 unit / phone, join units smooth
edges
triphone models too many combinations ?
diphones used
diphones start/end midway through a phone for
stability
does not model pitch duration changes (prosody)

20
Speech Synthesis, continued
19

use signal processing to change prosody
LPC model separates pitch from spectral
envelope ? to modify pitch generate
pulses in desired pitch, re-excite LPC
coefficients ? modified wave ? to modify
duration contract/expand coefficient
frames
TD-PSOLA frames centered around pitchmarks ?
to change pitch make pitchmarks closer
together / further apart ? to change duration
duplicate / leave out frames ? recombine
overlap and add frames

21
Speech Synthesis, continued
20

problems with speech synthesis - 1
example/diphone is insufficient - signal
processing ? distortion - subtle effects not
modeled
unit selection collect several examples/unit
with different pitch/duration/linguistic
situation
selection method - FO contour with 3
values/phone, large unit corpus 1. find
candidates (closest phone, duration FO)
rank them by target cost (closeness) 2.
measure join quality of neighbour candidates
rank joins by concatenation cost - pick
best unit set ? more natural speech

22
Human SpechRecognition
21

PLP analysis inspired by human auditory system
lexical access has common properties -
frequency - parallelism - neighborhood
effects - cue-based processing (phoneme
restoration) formant structure,
timing, voicing, lexical cues,
word association, repetition priming
differences - time-course human
processing is on-line - other cues prosody

23
Exercises
22
1. Hand-simulate the Viterbi algorithm use the
Automaton in Figure 7.8, on input aa n n ax n
iy d. What is the most probable string of
words? 2. Suggest two functions for
use in A decoding. What criteria should the
function satisfy for the search to work
(i.e. to return the best path)?

Write a Comment

User Comments (0)