Automatic Speech Recognition Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Speech Recognition Introduction

Description:

Different types of tasks with different difficulties ... Perplexity (small 10/large 100) Signal-to-noise ratio (high 30 dB/low 10dB) ... – PowerPoint PPT presentation

Number of Views:1141
Avg rating:3.0/5.0
Slides: 38
Provided by: EricFosle1
Category:

less

Transcript and Presenter's Notes

Title: Automatic Speech Recognition Introduction


1
Automatic Speech RecognitionIntroduction
  • Readings Jurafsky Martin 7.1-2
  • HLT Survey Chapter 1

2
The Human Dialogue System
3
The Human Dialogue System
4
Computer Dialogue Systems
Dialogue Management
Audition
Automatic Speech Recognition
Natural Language Understanding
Natural Language Generation
Text-to- speech
Planning
signal
words
words
signal
signal
logical form
5
Computer Dialogue Systems
Dialogue Mgmt.
Audition
ASR
NLU
NLG
Text-to- speech
Planning
signal
words
words
signal
signal
logical form
6
Parameters of ASR Capabilities
  • Different types of tasks with different
    difficulties
  • Speaking mode (isolated words/continuous speech)
  • Speaking style (read/spontaneous)
  • Enrollment (speaker-independent/dependent)
  • Vocabulary (small lt 20 wd/large gt20kword)
  • Language model (finite state/context sensitive)
  • Perplexity (small lt 10/large gt100)
  • Signal-to-noise ratio (high gt 30 dB/low lt 10dB)
  • Transducer (high quality microphone/telephone)

7
The Noisy Channel Model
message
message
noisy channel

Message
Channel
Signal
Decoding model find Message argmax
P(MessageSignal) But how do we represent each of
these things?
8
ASR using HMMs
  • Try to solve P(MessageSignal) by breaking the
    problem up into separate components
  • Most common method Hidden Markov Models
  • Assume that a message is composed of words
  • Assume that words are composed of sub-word parts
    (phones)
  • Assume that phones have some sort of acoustic
    realization
  • Use probabilistic models for matching acoustics
    to phones to words

9
HMMs The Traditional View
go
home
Markov model backbone composed of phones (hidden
because we dont know correspondences)
g
o
h
o
m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
Each line represents a probability estimate (more
later)
10
HMMs The Traditional View
go
home
Markov model backbone composed of phones (hidden
because we dont know correspondences)
g
o
h
o
m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
Even with same word hypothesis, can have
different alignments. Also, have to search over
all word hypotheses
11
HMMs as Dynamic Bayesian Networks
Markov model backbone composed of phones
go
home
q0g
q1o
q2o
q3o
q4h
q5o
q6o
q7o
q8m
q9m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
12
HMMs as Dynamic Bayesian Networks
Markov model backbone composed of phones
go
home
q0g
q1o
q2o
q3o
q4h
q5o
q6o
q7o
q8m
q9m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
ASR What is best assignment to q0q9 given x0x9?
13
Hidden Markov Models DBNs
DBN representation
Markov Model representation
14
Parts of an ASR System
Feature Calculation
Language Modeling
Acoustic Modeling
cat dog 0.00002 cat the 0.0000005 the cat
0.029 the dog 0.031 the mail 0.054
k
_at_
S E A R C H
The cat chased the dog
15
Parts of an ASR System
Feature Calculation
Language Modeling
Acoustic Modeling
cat dog 0.00002 cat the 0.0000005 the cat
0.029 the dog 0.031 the mail 0.054
k
_at_
Maps acoustics to phones
Maps phones to words
Strings words together
Produces acoustics (xt)
16
Feature calculation
17
Feature calculation
Frequency
Time
Find energy at each time step in each frequency
channel
18
Feature calculation
Frequency
Time
Take inverse Discrete Fourier Transform to
decorrelate frequencies
19
Feature calculation
Input
-0.1 0.3 1.4 -1.2 2.3 2.6
0.2 0.1 1.2 -1.2 4.4 2.2
-6.1 -2.1 3.1 2.4 1.0 2.2
0.2 0.0 1.2 -1.2 4.4 2.2
Output

20
Robust Speech Recognition
  • Different schemes have been developed for dealing
    with noise, reverberation
  • Additive noise reduce effects of particular
    frequencies
  • Convolutional noise remove effects of linear
    filters (cepstral mean subtraction)

21
Now what?
-0.1 0.3 1.4 -1.2 2.3 2.6
0.2 0.1 1.2 -1.2 4.4 2.2
-6.1 -2.1 3.1 2.4 1.0 2.2
0.2 0.0 1.2 -1.2 4.4 2.2
???
That you
22
Machine Learning!
-0.1 0.3 1.4 -1.2 2.3 2.6
0.2 0.1 1.2 -1.2 4.4 2.2
-6.1 -2.1 3.1 2.4 1.0 2.2
0.2 0.0 1.2 -1.2 4.4 2.2
Pattern recognition
That you
with HMMs
23
Hidden Markov Models (again!)
P(statet1statet) Pronunciation/Language models
P(acousticststatet) Acoustic Model
24
Acoustic Model
  • Assume that you can label each vector with a
    phonetic label
  • Collect all of the examples of a phone together
    and build a Gaussian model (or some other
    statistical model, e.g. neural networks)

Na(m,S) P(Xstatea)
25
Building up the Markov Model
  • Start with a model for each phone
  • Typically, we use 3 states per phone to give a
    minimum duration constraint, but ignore that here

transition probability
26
Building up the Markov Model
  • Pronunciation model gives connections between
    phones and words
  • Multiple pronunciations

ow
ey
t
m
ah
ah
27
Building up the Markov Model
  • Language model gives connections between words
    (e.g., bigram grammar)

p(hethat)
t
p(youthat)
28
ASR as Bayesian Inference
q1w1
q2w1
q3w1
p(hethat)
t
p(youthat)
x1
x2
x3
iy
argmaxW P(WX) argmaxW P(XW)P(W)/P(X) argmaxW
P(XW)P(W) argmaxW SQ P(X,QW)P(W) argmaxW maxQ
P(X,QW)P(W) argmaxW maxQ P(XQ) P(QW) P(W)
d
29
ASR Probability Models
  • Three probability models
  • P(XQ) acoustic model
  • P(QW) duration/transition/pronunciation model
  • P(W) language model
  • language/pronunciation models inferred from prior
    knowledge
  • Other models learned from data (how?)

30
Parts of an ASR System
P(XQ)
P(QW)
P(W)
Feature Calculation
Language Modeling
Acoustic Modeling
cat dog 0.00002 cat the 0.0000005 the cat
0.029 the dog 0.031 the mail 0.054
k
_at_
S E A R C H
The cat chased the dog
31
EM for ASR The Forward-Backward Algorithm
  • Determine state occupancy probabilities
  • I.e. assign each data vector to a state
  • Calculate new transition probabilities, new means
    standard deviations (emission probabilities)
    using assignments

32
ASR as Bayesian Inference
q1w1
q2w1
q3w1
p(hethat)
t
p(youthat)
x1
x2
x3
iy
argmaxW P(WX) argmaxW P(XW)P(W)/P(X) argmaxW
P(XW)P(W) argmaxW SQ P(X,QW)P(W) argmaxW maxQ
P(X,QW)P(W) argmaxW maxQ P(XQ) P(QW) P(W)
d
33
Search
  • When trying to find WargmaxW P(WX), need to
    look at (in theory)
  • All possible word sequences W
  • All possible segmentations/alignments of WX
  • Generally, this is done by searching the space of
    W
  • Viterbi search dynamic programming approach that
    looks for the most likely path
  • A search alternative method that keeps a stack
    of hypotheses around
  • If W is large, pruning becomes important

34
How to train an ASR system
  • Have a speech corpus at hand
  • Should have word (and preferrably phone)
    transcriptions
  • Divide into training, development, and test sets
  • Develop models of prior knowledge
  • Pronunciation dictionary
  • Grammar
  • Train acoustic models
  • Possibly realigning corpus phonetically

35
How to train an ASR system
  • Test on your development data (baseline)
  • Think real hard
  • Figure out some neat new modification
  • Retrain system component
  • Test on your development data
  • Lather, rinse, repeat
  • Then, at the end of the project, test on the test
    data.

36
Judging the quality of a system
  • Usually, ASR performance is judged by the word
    error rate
  • ErrorRate 100(Subs Ins Dels) / Nwords
  • REF I WANT TO GO HOME
  • REC WANT TWO GO HOME NOW
  • SC D C S C C I
  • 100(1S1I1D)/5 60

37
Judging the quality of a system
  • Usually, ASR performance is judged by the word
    error rate
  • This assumes that all errors are equal
  • Also, a bit of a mismatch between optimization
    criterion and error measurement
  • Other (task specific) measures sometimes used
  • Task completion
  • Concept error rate
Write a Comment
User Comments (0)
About PowerShow.com