Title: Application 3
1Automatic Speech Recogniton
2The Malevolent Hal
3The Turing Test A philosophical Interlude
4The Chinese Room
5What Do the Two Thought Experiments Have in
Common?
6Types of Performance
7The Model
Do You Believe This?
8Why ASR is Hard
9A Tale of Aspiration
- t Tunafish
- Word initial
- Vocal chords dont vibrate. Produces a puff of
air - t Starfish
- t preceded by s
- Vocal chords vibrate. No air puff
- k vocal chords dont vibrate. Produces puff of
air. - g Vocal vibrate. No air puff
- But s initial changes things now k vibrates
- Leads to the mishearing of sk/sg
- the sky
- this guy
- Theres more going on which hearing is more
probable?
10Ambiguity Exists at Different Levels (Jim Glass,
2007)
- Acoustic-Phonetic
- Let us pray / Lettuce spray
- Syntactic
- Meet her at home
- Meter at home
- Semantic
- Is the baby crying
- Is the bay bee crying
- Discourse
- It is easy to recognize speech
- It is easy to wreck a nice beach
- Prosody
- Im FLYING to Spokane
- Im flying to SPOKANE
11What to DO?
- Language is not a system of rules
- t makes a certain sound
- to whom is correct. to who is incorrect
- Language is a collection of probabilities
12Goal Of a Probabilistic Noisy Channel Architecture
- What is the most likely sequence of words W out
of all word sequences in a language L given some
acoustic input O? - Where O is a sequence of observations
- 0o1, o2 , o3 , ..., ot
- each oi is a floating point value representing
10ms worth of energy of that slice of 0 - And ww1, w2 , w3 ,..., wn
- each wi is a word in L
13ASR as a Conditional Probability
14An Historical Aside
- ASR Is Old as Computing
- 50s Bell Labs, RCA Research, Lincoln Labs
- Discoveries in acoustic phonetics applied to
recognition of single digits, syllables, vowels - 60s Pattern recognition techniques used in US,
Japan, Soviet Union - Two Developments in 80s
- DARPA funding for LVCSR
- Application of HMMs to speech recognition
15A sentimental Journey
- Recall the Decoding Task
- Given an HMM, M
- Given a hidden state sequence, Q, Observation
sequence O - Determine p(QO)
- Recall the Learning Task
- Given O and Q, create M
- Where M consists of two matrices
- Priors A a11, ..., a1n, ..., an1, ..., ann,
where aij p(qjqi) - Likelihoods p(oi qi)
- But how do we get from
- i.e, to our likelihoods and priors
16Parson Bayes to the Rescue
Author of Divine Benevolence, or an Attempt to
Prove That the Principal End of the Divine
Providence and Government is the Happiness of His
Creatures (1731)
17Bayes Rule
Lets us transform
To
In Fact
p(ow) likelihoods ? referred to as the
acoustic model p(w) priors ? referred to as the
language model
18LVCSR
19 another View
p(OW)
Viterbi
rep. of acoustic signal
p(W)
digital signal processing
20Creating Feature VectorsDSP
- Digitize the analog signal through sampling
- Decide on a window size and perform FFT
- Output amount of energy at each frequency range
spectrum - log(FFT) is mel scale value
- Take FFT of the previous value cepstrum
- Cepstrum is a model of the vocal tract
- Save 13 values
- Compute the change in these 13 over the next
window - Compute the change in the 13 deltas
- Total 39 feature vectors
21Left Out
- Computing the likelihood of feature vectors
- Given an HMM state
- The HMM state is a partial representation of a
linguistic unit - p(otqi)
- But First
- What are these Speech Sounds
22Fundamental Theory of Phonetics
- Spoken word
- Composed of smaller units of speech
- Called phones
- Def A phone is a speech sound
- Phones are represented by symbols
- IPA
- ARPABET
23English Vowels
24Human Vocal Organs
25Close-Up
26Types of sound
- Glottis space between vocal folds.
- Glottis vibrates/doesnt vibrate
- Voiced consonants like b, d, g, v, z,
all English vowels - Unvoiced consonants like p, t, k, v, s
- Sounds passing through nose nasals
- m, n, ng
27Phone Classes
- Consonants
- produced by restricting the airflow
- Vowels
- unrestricted, usually voiced, and longer lasting
- semivowels
- y, voiced but shorter
28Consonant Place of Articulation
- labialb, m
- labiodentalv,f
- Dentalth
- Alveolars,z,t,d
- Palatalsh,ch,zh (Asian), jh (jar)
- Velark (cuckoo), g (goose), ng (kingfisher)
29Consonant Manner of Articulation
- How the airflow is restricted
- stop or plosive b,d,g
- nasal air passes into the nasal cavity
n,m.ng - fricative air flow is not cut off completely
f,v,th,dh, - s,z
- affricates stops followed by fricative ch
(chicken), jh (giraffe) - approximants two articulators are close together
but not close enough to cause turbulent air flow
y,w,r, l
30Vowels
- Characterized by height and backness
- High Front tongue raised toward the front iy
(lily) - High Back tongue raised toward the back uw
(tulip) - Low Front ae (bat)
- Low Back aa (poppy)
31Acoustic Phonetics
f cycles per second A height of the wave T
1/f, the amount of time it takes cycle to complete
32Sound Waves
- Plot the change in air pressure over time
- Imagine an eardrum blocking air pressure waves
- Graph measures the amount of compression and
uncompression.
iy in She just had a baby
33She Just Had a Baby
Notice the vowels, fricative sh, and stop
release b
34Fourier Analysis
- Every complex wave form can be represented as a
sum of component sine waves
two wave forms 10hz and 100 hz
35Spectrum
- Spectrum of a signal is a representation of each
of its frequency components and their amplitudes.
Spectrum of the 10 100 Hz wave forms. Note the
two spikes
36Wave form for ae in Had
- Note
- 10 major waves and 4 smaller within the 10 larger
- The frequency of the larger is 10 cy/.0427 s
234 Hz - The frequency of the smaller is about 4 times
that or 930 Hz - Also
- Some of the 930 Hz waves have two smaller waves
- F 2 930 1860 Hz
37Spectrum for ae
Notice one of the peaks at just under 1000
Hz Another at just under 2000 Hz
38Conclusion
- Spectral peaks that are visible in a spectrum are
characteristic of different phones
39What Remains
- Computing likelihood probability of vectors given
a triphone - p(otqi)
- Language model p(W)
40Spectrogram
- Representation of different frequencies that make
up a waveform over time (spectrum was a single
point in time) - x axis time
- y axis frequencies in Hz
- darkness amplitude
ih ae ah
41We Need a Sequence Classifier
42HMMs in Action
- Observation Sequence in ASR
- Acoustic Feature Vectors
- 39 real-valued features
- Represents changes in energy in different
frequency bands - Each vector represents 10ms
- Hidden States
- words for simple tasks like digit
recognition/yes-no - phones or (usually subphones)
43SIX
- Bakis Network Left-Right HMM
- Each aij is an entry in the priors matrix
- Likelihood probabilities not shown
- For each state there is a collection of
likelihood observations - Each observation (now a vector of 39 features)
has a probability given the state
44But Phones Change Over Time
45Necessary to Model Subphones
- As Before
- Bakis Network Left-Right HMM
- Each aij is an entry in the priors matrix
- Likelihood probabilities not shown
- For each state there is a collection of
likelihood observations - Each observation (now a vector of 39 features)
has a probability given the state p(otqi)
46coarticulation
Notice the difference in the 2nd formant of eh
in each context
47Solution
- Triphone
- phone
- left context
- right context
- Notationy-ehl eh preceded by y and
followed by l - Suppose there are 50 phones in a language
125,000 triphones - Not all will appear in a corpus
- English disallows ae-ehow and m-jt
- WSJ study 55,000 triphones needed but found only
18,500
48Data Sparsity
- Lucky for us different contexts sometimes have
similar effects. Notice w iy/r iy and m
iy/n iy
49State Tying
Initial subphones of t-iyn t-iyn share
acoustic reps (and likelihood probabilities) How
Clustering algorithm
50Acoustic Likelihood/transition Probability
Computation
- Problem 1 Which observation corresponds to which
state? - p(otqi) Likelihoods
- Problem 2 What is the transition probability
between states - Priors
- Hand labeled
- Training corpus of isolated words in wav file
- Start and stop time of each phone is marked by
hand - Can compute the observation likelihoods by
counting (like ice cream) - But requires 400 hours to label an hour of speech
- Humans are bad a labeling units smaller than a
phone - Embedded training
- Wav file corresponding transcription
- Pronunciation lexicon
- Raw (untrained) HMM
- Baum-Welsh sums over all possible segmentations
of words and phones
51The Raw Materials
52The Language Model
p(OW)
Viterbi
rep. of acoustic signal
p(W)
digital signal processing
53p(W)
54Bigram Probabilities
ltsgt Alex wrote his booklt/sgt
ltsgt James wrote a different booklt/sgt
ltsgtAlex wrote a book suggested by Judithlt/sgt
P(wroteAlex) 2/2 P(awrote) 2/3
P(booka) 1/2 P(lt/sgtbook) 2/3
Independence Assumption
55Issues
- HMM requires independence assumptions
- Researchers are now experimenting with deep
belief networks - Stacks of Boltzman machines
- Non-global Languages the Google phenomenon
- Detection of physical and emotional states
- anger
- frustration
- sleepiness
- intoxication
- blame classification among married couples
56Language is Our Defining Feature
And We Havent Even Begun to Talk About
Understanding