T-61.184 Informaatiotekniikan erikoiskurssi IV - PowerPoint PPT Presentation

About This Presentation
Title:

T-61.184 Informaatiotekniikan erikoiskurssi IV

Description:

... -excite LPC coefficients modified wave to modify duration: contract/expand coefficient frames TD-PSOLA: frames ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 24
Provided by: jtpe1
Category:

less

Transcript and Presenter's Notes

Title: T-61.184 Informaatiotekniikan erikoiskurssi IV


1
T-61.184 Informaatiotekniikan erikoiskurssi IV
HMMs and Speech Recognition
based on chapter 7 ofD. Jurafsky, J. Martin
Speech and Language Processing
Jaakko PeltonenOctober 31, 2001
2
  • speech recognition architecture
  • HMM, Viterbi, A
  • speech acoustics features
  • computing acoustic probabilities
  • speech synthesis

3
Speech Recognition Architecture
2
  • Application LVCSR
  • Large vocabulary dictionary size 5000 60000
  • Continuous speech (words not separated)
  • Speaker-independent

4
  • acoustic input considered a noisy version of a
    source sentence
  • decoding find the sentence that most probably
    generated the input
  • problems - metric for selecting best
    match? - efficient algorithm for finding
    best match

5
  • acoustic input symbol sequence
  • sentence string of words
  • best match metric probability
  • Bayes rule
    observation likelihood prior probability
    acoustic model
    language model

6
Hidden Markov Models (HMMs)
5
  • previously, Markov chains used to model
    pronounciation
  • forward algorithm ? phone sequence likelihood
  • real input is not symbolic spectral features
  • input symbols do not correspond to machine
    states
  • HMM definition
  • state set Q, observation symbols O ?
    Q transition probabilities A B not
    limited to 1 and 0 start and end state(s)
    observation likelihoods B

7
a24
Word Model
a11
a22
a33
a01
a12
a23
a34
start0
n1
d3
end4
iy2
b1(o3)
b1(o5)
b1(o1)
b1(o2)
b1(o6)
b1(o4)
ObservationSequence


o1
o2
o3
o4
o5
o6
8
  • word boundaries unknown ? segmentationay d ih
    s hh er d s ah m th ih ng ax b aw I just heard
    something about
  • assumption dynamic programming invariant
  • If ultimate best path for o includes state qi ,
    it includes the best path up to including qi
    .
  • does not work for all grammars

9
function VITERBI(observations of len T,
state-graph) returns best-path num_states ?
NUM-OF-STATES(state-graph) Create a path
probability matrix viterbinum-states2,T2
viterbi0,0?1.0 for each time step t from 0 to
T do for each state s from 0 to num-states
do for each transition s from s specified
by state-graph new-score?viterbis,tas,
sbs(ot) if ((viterbis,t1 0)
(new-score gt viterbis,t1)) then
viterbis,t1?new-score
back-pointers,t1?s Backtrace from
highest probability state in the final column of
viterbi and return path.
  • single automaton ? combine single-word networks
    ? add word transition probabilities
    bigram probabilities
  • states correspond to subphones context
  • beam search

10
  • Viterbi has problems
  • computes most probable state sequence, not
    word sequence
  • Cannot be used with all language models (only
    bigrams)
  • Solution 1 multiple-pass decoding
  • N-best-Viterbi return N best sentences, sort
    with more complex model
  • word lattice return directed word graph
    word observation likelihoods ? refine with more
    complex model

11
10
A Decoder
  • Viterbi uses an approximation of the forward
    algorithm max instead of sum
  • A uses the complete forward algorithm ? correct
    observation likelihoods, use any language
    model
  • Best-first search of word sequence tree
  • priority queue of scored paths to extend
  • Algorithm 1. select highest-priority path (pop
    queue) 2. create possible extensions (if none,
    stop) 3. calculate scores for extended paths
    (from forward algorithm and language
    model) 4. add scored paths to queue

12
11
A Decoder, continued
p(acousticmusic)forward probability
p(musicif)
music32
p(acousticif)forward probability
if30
muscle31
p(ifSTART)
messy25
(none)1
Alice40
was29
wants24
Every25
walls2
In4
13
A Decoder, continued
12
  • score of word string w is not (y is the
    acoustic string)
  • reason path prefix would have higher score
  • score A evaluation function
  • score from start to current string end
  • estimated score of best
    extension to utterance end

14
Acoustic Processingof Speech
13
  • wave characteristics frequency ? pitch,
    amplitude ? loudness
  • visible information vowel/consonant, voicing,
    length, fricatives, stop closure
  • spectral features Fourier spectrum / LPC
    spectrum - peaks characteristic of different
    sounds ? formants
  • spectrogram changes over time
  • digitization sampling, quantization
  • processing ? cepstral features / PLP features

15
Computing Acoustic Probabilities
14
  • simple way vector quantization (cluster
    feature vectors count cluster occurrences)
  • continuous approach calculate probability
    density function (pdf) over observations
  • Gaussian pdf trained with forward-backward
    algorithm
  • Gaussian mixtures, parameter tying
  • Multi-layer perceptron (MLP) pdf trained with
    error back-propagation

16
Training A Speech Recognizer
15
  • evaluation metric word error rate 1.
    Compute minimum edit distance between
    hypothesized and correct string 2.
  • e.g. correct I went to a party
    hypothesis Eye went two a bar tea
    3 substitutions, 1 deletion ? word error rate
    80
  • State of the art word error rate 20 on
    natural- speech tasks

17
Embedded Training
16
  • models to be trained - language model
    p(wiwi-1wi-2) - observation
    likelihoods bj(ot) - transition
    probabilities aij - pronounciation
    lexicon HMM state graph
  • training data - corpus of speech wavefiles
    word-transcription - large text corpus for
    language model training - smaller corpus of
    phonetically labeled speech
  • N-gram language model trained as in Chapter 6
  • HMM lexicon structure built by hand -
    PRONLEX, CMUdict off-the-shelf
    pronounciation dictionaries

18
Embedded Training,continued
17
  • HMM parameters - initial estimate equal
    transition probabilities, observation
    probabilities bootstrapped (labeled speech
    ? label for each frame ? initial
    Gaussian means / variances)
  • - MLP systems forced Viterbi alignment
    features correct words given ? best states
    ?labels for each input ? retrain MLP -
    Gaussian systems forward-backward algorithm
    compute forward backward probabilities
    ? re-estimate a and b. Correct words known
    ? prune model

19
Speech Synthesis
18
  • text-to-speech (TTS) system output is a phone
    sequence with durations and a FO pitch contour
  • waveform concatenation based on recorded
    speech database, segmented into short units
  • simplest 1 unit / phone, join units smooth
    edges
  • triphone models too many combinations ?
    diphones used
  • diphones start/end midway through a phone for
    stability
  • does not model pitch duration changes (prosody)

20
Speech Synthesis, continued
19
  • use signal processing to change prosody
  • LPC model separates pitch from spectral
    envelope ? to modify pitch generate
    pulses in desired pitch, re-excite LPC
    coefficients ? modified wave ? to modify
    duration contract/expand coefficient
    frames
  • TD-PSOLA frames centered around pitchmarks ?
    to change pitch make pitchmarks closer
    together / further apart ? to change duration
    duplicate / leave out frames ? recombine
    overlap and add frames

21
Speech Synthesis, continued
20
  • problems with speech synthesis - 1
    example/diphone is insufficient - signal
    processing ? distortion - subtle effects not
    modeled
  • unit selection collect several examples/unit
    with different pitch/duration/linguistic
    situation
  • selection method - FO contour with 3
    values/phone, large unit corpus 1. find
    candidates (closest phone, duration FO)
    rank them by target cost (closeness) 2.
    measure join quality of neighbour candidates
    rank joins by concatenation cost - pick
    best unit set ? more natural speech

22
Human SpechRecognition
21
  • PLP analysis inspired by human auditory system
  • lexical access has common properties -
    frequency - parallelism - neighborhood
    effects - cue-based processing (phoneme
    restoration) formant structure,
    timing, voicing, lexical cues,
    word association, repetition priming
  • differences - time-course human
    processing is on-line - other cues prosody

23
Exercises
22
1. Hand-simulate the Viterbi algorithm use the
Automaton in Figure 7.8, on input aa n n ax n
iy d. What is the most probable string of
words? 2. Suggest two functions for
use in A decoding. What criteria should the
function satisfy for the search to work
(i.e. to return the best path)?
Write a Comment
User Comments (0)
About PowerShow.com