Application 3 - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Application 3

Description:

Application 3 AUTOMATIC SPEECH RECOGNITON – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 57

Provided by: Gonz227

Category:

more less

Transcript and Presenter's Notes

Title: Application 3

1
Automatic Speech Recogniton

Application 3

2
The Malevolent Hal
3
The Turing Test A philosophical Interlude
4
The Chinese Room
5
What Do the Two Thought Experiments Have in
Common?
6
Types of Performance
7
The Model
Do You Believe This?
8
Why ASR is Hard
9
A Tale of Aspiration

t Tunafish
Word initial
Vocal chords dont vibrate. Produces a puff of
air
t Starfish
t preceded by s
Vocal chords vibrate. No air puff
k vocal chords dont vibrate. Produces puff of
air.
g Vocal vibrate. No air puff
But s initial changes things now k vibrates
Leads to the mishearing of sk/sg
the sky
this guy
Theres more going on which hearing is more
probable?

10
Ambiguity Exists at Different Levels (Jim Glass,
2007)

Acoustic-Phonetic
Let us pray / Lettuce spray
Syntactic
Meet her at home
Meter at home
Semantic
Is the baby crying
Is the bay bee crying
Discourse
It is easy to recognize speech
It is easy to wreck a nice beach
Prosody
Im FLYING to Spokane
Im flying to SPOKANE

11
What to DO?

Language is not a system of rules
t makes a certain sound
to whom is correct. to who is incorrect
Language is a collection of probabilities

12
Goal Of a Probabilistic Noisy Channel Architecture

What is the most likely sequence of words W out
of all word sequences in a language L given some
acoustic input O?
Where O is a sequence of observations
0o1, o2 , o3 , ..., ot
each oi is a floating point value representing
10ms worth of energy of that slice of 0
And ww1, w2 , w3 ,..., wn
each wi is a word in L

13
ASR as a Conditional Probability
14
An Historical Aside

ASR Is Old as Computing
50s Bell Labs, RCA Research, Lincoln Labs
Discoveries in acoustic phonetics applied to
recognition of single digits, syllables, vowels
60s Pattern recognition techniques used in US,
Japan, Soviet Union
Two Developments in 80s
DARPA funding for LVCSR
Application of HMMs to speech recognition

15
A sentimental Journey

Recall the Decoding Task
Given an HMM, M
Given a hidden state sequence, Q, Observation
sequence O
Determine p(QO)
Recall the Learning Task
Given O and Q, create M
Where M consists of two matrices
Priors A a11, ..., a1n, ..., an1, ..., ann,
where aij p(qjqi)
Likelihoods p(oi qi)
But how do we get from
i.e, to our likelihoods and priors

16
Parson Bayes to the Rescue
Author of Divine Benevolence, or an Attempt to
Prove That the Principal End of the Divine
Providence and Government is the Happiness of His
Creatures (1731)
17
Bayes Rule
Lets us transform

To

In Fact
p(ow) likelihoods ? referred to as the
acoustic model p(w) priors ? referred to as the
language model
18
LVCSR
19
another View
p(OW)
Viterbi
rep. of acoustic signal
p(W)
digital signal processing
20
Creating Feature VectorsDSP

Digitize the analog signal through sampling
Decide on a window size and perform FFT
Output amount of energy at each frequency range
spectrum
log(FFT) is mel scale value
Take FFT of the previous value cepstrum
Cepstrum is a model of the vocal tract
Save 13 values
Compute the change in these 13 over the next
window
Compute the change in the 13 deltas
Total 39 feature vectors

21
Left Out

Computing the likelihood of feature vectors
Given an HMM state
The HMM state is a partial representation of a
linguistic unit
p(otqi)
But First
What are these Speech Sounds

22
Fundamental Theory of Phonetics

Spoken word
Composed of smaller units of speech
Called phones
Def A phone is a speech sound
Phones are represented by symbols
IPA
ARPABET

23
English Vowels
24
Human Vocal Organs
25
Close-Up
26
Types of sound

Glottis space between vocal folds.
Glottis vibrates/doesnt vibrate
Voiced consonants like b, d, g, v, z,
all English vowels
Unvoiced consonants like p, t, k, v, s
Sounds passing through nose nasals
m, n, ng

27
Phone Classes

Consonants
produced by restricting the airflow
Vowels
unrestricted, usually voiced, and longer lasting
semivowels
y, voiced but shorter

28
Consonant Place of Articulation

labialb, m
labiodentalv,f
Dentalth
Alveolars,z,t,d
Palatalsh,ch,zh (Asian), jh (jar)
Velark (cuckoo), g (goose), ng (kingfisher)

29
Consonant Manner of Articulation

How the airflow is restricted
stop or plosive b,d,g
nasal air passes into the nasal cavity
n,m.ng
fricative air flow is not cut off completely
f,v,th,dh,
s,z
affricates stops followed by fricative ch
(chicken), jh (giraffe)
approximants two articulators are close together
but not close enough to cause turbulent air flow
y,w,r, l

30
Vowels

Characterized by height and backness
High Front tongue raised toward the front iy
(lily)
High Back tongue raised toward the back uw
(tulip)
Low Front ae (bat)
Low Back aa (poppy)

31
Acoustic Phonetics

Based on the sine wave

f cycles per second A height of the wave T
1/f, the amount of time it takes cycle to complete
32
Sound Waves

Plot the change in air pressure over time
Imagine an eardrum blocking air pressure waves
Graph measures the amount of compression and
uncompression.

iy in She just had a baby
33
She Just Had a Baby
Notice the vowels, fricative sh, and stop
release b
34
Fourier Analysis

Every complex wave form can be represented as a
sum of component sine waves

two wave forms 10hz and 100 hz
35
Spectrum

Spectrum of a signal is a representation of each
of its frequency components and their amplitudes.

Spectrum of the 10 100 Hz wave forms. Note the
two spikes
36
Wave form for ae in Had

Note
10 major waves and 4 smaller within the 10 larger
The frequency of the larger is 10 cy/.0427 s
234 Hz
The frequency of the smaller is about 4 times
that or 930 Hz
Also
Some of the 930 Hz waves have two smaller waves
F 2 930 1860 Hz

37
Spectrum for ae
Notice one of the peaks at just under 1000
Hz Another at just under 2000 Hz
38
Conclusion

Spectral peaks that are visible in a spectrum are
characteristic of different phones

39
What Remains

Computing likelihood probability of vectors given
a triphone
p(otqi)
Language model p(W)

40
Spectrogram

Representation of different frequencies that make
up a waveform over time (spectrum was a single
point in time)
x axis time
y axis frequencies in Hz
darkness amplitude

ih ae ah
41
We Need a Sequence Classifier
42
HMMs in Action

Observation Sequence in ASR
Acoustic Feature Vectors
39 real-valued features
Represents changes in energy in different
frequency bands
Each vector represents 10ms
Hidden States
words for simple tasks like digit
recognition/yes-no
phones or (usually subphones)

43
SIX

Bakis Network Left-Right HMM
Each aij is an entry in the priors matrix
Likelihood probabilities not shown
For each state there is a collection of
likelihood observations
Each observation (now a vector of 39 features)
has a probability given the state

44
But Phones Change Over Time
45
Necessary to Model Subphones

As Before
Bakis Network Left-Right HMM
Each aij is an entry in the priors matrix
Likelihood probabilities not shown
For each state there is a collection of
likelihood observations
Each observation (now a vector of 39 features)
has a probability given the state p(otqi)

46
coarticulation
Notice the difference in the 2nd formant of eh
in each context
47
Solution

Triphone
phone
left context
right context
Notationy-ehl eh preceded by y and
followed by l
Suppose there are 50 phones in a language
125,000 triphones
Not all will appear in a corpus
English disallows ae-ehow and m-jt
WSJ study 55,000 triphones needed but found only
18,500

48
Data Sparsity

Lucky for us different contexts sometimes have
similar effects. Notice w iy/r iy and m
iy/n iy

49
State Tying
Initial subphones of t-iyn t-iyn share
acoustic reps (and likelihood probabilities) How
Clustering algorithm
50
Acoustic Likelihood/transition Probability
Computation

Problem 1 Which observation corresponds to which
state?
p(otqi) Likelihoods
Problem 2 What is the transition probability
between states
Priors
Hand labeled
Training corpus of isolated words in wav file
Start and stop time of each phone is marked by
hand
Can compute the observation likelihoods by
counting (like ice cream)
But requires 400 hours to label an hour of speech
Humans are bad a labeling units smaller than a
phone
Embedded training
Wav file corresponding transcription
Pronunciation lexicon
Raw (untrained) HMM
Baum-Welsh sums over all possible segmentations
of words and phones

51
The Raw Materials
52
The Language Model
p(OW)
Viterbi
rep. of acoustic signal
p(W)
digital signal processing
53
p(W)

54
Bigram Probabilities
ltsgt Alex wrote his booklt/sgt
ltsgt James wrote a different booklt/sgt
ltsgtAlex wrote a book suggested by Judithlt/sgt
P(wroteAlex) 2/2 P(awrote) 2/3
P(booka) 1/2 P(lt/sgtbook) 2/3
Independence Assumption
55
Issues