Application 3 - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Application 3

Description:

Application 3 AUTOMATIC SPEECH RECOGNITON – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 57
Provided by: Gonz227
Category:

less

Transcript and Presenter's Notes

Title: Application 3


1
Automatic Speech Recogniton
  • Application 3

2
The Malevolent Hal
3
The Turing Test A philosophical Interlude
4
The Chinese Room
5
What Do the Two Thought Experiments Have in
Common?
6
Types of Performance
7
The Model
Do You Believe This?
8
Why ASR is Hard
9
A Tale of Aspiration
  • t Tunafish
  • Word initial
  • Vocal chords dont vibrate. Produces a puff of
    air
  • t Starfish
  • t preceded by s
  • Vocal chords vibrate. No air puff
  • k vocal chords dont vibrate. Produces puff of
    air.
  • g Vocal vibrate. No air puff
  • But s initial changes things now k vibrates
  • Leads to the mishearing of sk/sg
  • the sky
  • this guy
  • Theres more going on which hearing is more
    probable?

10
Ambiguity Exists at Different Levels (Jim Glass,
2007)
  • Acoustic-Phonetic
  • Let us pray / Lettuce spray
  • Syntactic
  • Meet her at home
  • Meter at home
  • Semantic
  • Is the baby crying
  • Is the bay bee crying
  • Discourse
  • It is easy to recognize speech
  • It is easy to wreck a nice beach
  • Prosody
  • Im FLYING to Spokane
  • Im flying to SPOKANE

11
What to DO?
  • Language is not a system of rules
  • t makes a certain sound
  • to whom is correct. to who is incorrect
  • Language is a collection of probabilities

12
Goal Of a Probabilistic Noisy Channel Architecture
  • What is the most likely sequence of words W out
    of all word sequences in a language L given some
    acoustic input O?
  • Where O is a sequence of observations
  • 0o1, o2 , o3 , ..., ot
  • each oi is a floating point value representing
    10ms worth of energy of that slice of 0
  • And ww1, w2 , w3 ,..., wn
  • each wi is a word in L

13
ASR as a Conditional Probability
14
An Historical Aside
  • ASR Is Old as Computing
  • 50s Bell Labs, RCA Research, Lincoln Labs
  • Discoveries in acoustic phonetics applied to
    recognition of single digits, syllables, vowels
  • 60s Pattern recognition techniques used in US,
    Japan, Soviet Union
  • Two Developments in 80s
  • DARPA funding for LVCSR
  • Application of HMMs to speech recognition

15
A sentimental Journey
  • Recall the Decoding Task
  • Given an HMM, M
  • Given a hidden state sequence, Q, Observation
    sequence O
  • Determine p(QO)
  • Recall the Learning Task
  • Given O and Q, create M
  • Where M consists of two matrices
  • Priors A a11, ..., a1n, ..., an1, ..., ann,
    where aij p(qjqi)
  • Likelihoods p(oi qi)
  • But how do we get from
  • i.e, to our likelihoods and priors

16
Parson Bayes to the Rescue
Author of Divine Benevolence, or an Attempt to
Prove That the Principal End of the Divine
Providence and Government is the Happiness of His
Creatures (1731)
17
Bayes Rule
Lets us transform
 
To
 
In Fact
p(ow) likelihoods ? referred to as the
acoustic model p(w) priors ? referred to as the
language model
18
LVCSR
19
another View
p(OW)
Viterbi
rep. of acoustic signal
p(W)
digital signal processing
20
Creating Feature VectorsDSP
  • Digitize the analog signal through sampling
  • Decide on a window size and perform FFT
  • Output amount of energy at each frequency range
    spectrum
  • log(FFT) is mel scale value
  • Take FFT of the previous value cepstrum
  • Cepstrum is a model of the vocal tract
  • Save 13 values
  • Compute the change in these 13 over the next
    window
  • Compute the change in the 13 deltas
  • Total 39 feature vectors

21
Left Out
  • Computing the likelihood of feature vectors
  • Given an HMM state
  • The HMM state is a partial representation of a
    linguistic unit
  • p(otqi)
  • But First
  • What are these Speech Sounds

22
Fundamental Theory of Phonetics
  • Spoken word
  • Composed of smaller units of speech
  • Called phones
  • Def A phone is a speech sound
  • Phones are represented by symbols
  • IPA
  • ARPABET

23
English Vowels
24
Human Vocal Organs
25
Close-Up
26
Types of sound
  • Glottis space between vocal folds.
  • Glottis vibrates/doesnt vibrate
  • Voiced consonants like b, d, g, v, z,
    all English vowels
  • Unvoiced consonants like p, t, k, v, s
  • Sounds passing through nose nasals
  • m, n, ng

27
Phone Classes
  • Consonants
  • produced by restricting the airflow
  • Vowels
  • unrestricted, usually voiced, and longer lasting
  • semivowels
  • y, voiced but shorter

28
Consonant Place of Articulation
  • labialb, m
  • labiodentalv,f
  • Dentalth
  • Alveolars,z,t,d
  • Palatalsh,ch,zh (Asian), jh (jar)
  • Velark (cuckoo), g (goose), ng (kingfisher)

29
Consonant Manner of Articulation
  • How the airflow is restricted
  • stop or plosive b,d,g
  • nasal air passes into the nasal cavity
    n,m.ng
  • fricative air flow is not cut off completely
    f,v,th,dh,
  • s,z
  • affricates stops followed by fricative ch
    (chicken), jh (giraffe)
  • approximants two articulators are close together
    but not close enough to cause turbulent air flow
    y,w,r, l

30
Vowels
  • Characterized by height and backness
  • High Front tongue raised toward the front iy
    (lily)
  • High Back tongue raised toward the back uw
    (tulip)
  • Low Front ae (bat)
  • Low Back aa (poppy)

31
Acoustic Phonetics
  • Based on the sine wave

f cycles per second A height of the wave T
1/f, the amount of time it takes cycle to complete
32
Sound Waves
  • Plot the change in air pressure over time
  • Imagine an eardrum blocking air pressure waves
  • Graph measures the amount of compression and
    uncompression.

iy in She just had a baby
33
She Just Had a Baby
Notice the vowels, fricative sh, and stop
release b
34
Fourier Analysis
  • Every complex wave form can be represented as a
    sum of component sine waves

two wave forms 10hz and 100 hz
35
Spectrum
  • Spectrum of a signal is a representation of each
    of its frequency components and their amplitudes.

Spectrum of the 10 100 Hz wave forms. Note the
two spikes
36
Wave form for ae in Had
  • Note
  • 10 major waves and 4 smaller within the 10 larger
  • The frequency of the larger is 10 cy/.0427 s
    234 Hz
  • The frequency of the smaller is about 4 times
    that or 930 Hz
  • Also
  • Some of the 930 Hz waves have two smaller waves
  • F 2 930 1860 Hz

37
Spectrum for ae
Notice one of the peaks at just under 1000
Hz Another at just under 2000 Hz
38
Conclusion
  • Spectral peaks that are visible in a spectrum are
    characteristic of different phones

39
What Remains
  • Computing likelihood probability of vectors given
    a triphone
  • p(otqi)
  • Language model p(W)

40
Spectrogram
  • Representation of different frequencies that make
    up a waveform over time (spectrum was a single
    point in time)
  • x axis time
  • y axis frequencies in Hz
  • darkness amplitude

ih ae ah
41
We Need a Sequence Classifier
42
HMMs in Action
  • Observation Sequence in ASR
  • Acoustic Feature Vectors
  • 39 real-valued features
  • Represents changes in energy in different
    frequency bands
  • Each vector represents 10ms
  • Hidden States
  • words for simple tasks like digit
    recognition/yes-no
  • phones or (usually subphones)

43
SIX
  • Bakis Network Left-Right HMM
  • Each aij is an entry in the priors matrix
  • Likelihood probabilities not shown
  • For each state there is a collection of
    likelihood observations
  • Each observation (now a vector of 39 features)
    has a probability given the state

44
But Phones Change Over Time
45
Necessary to Model Subphones
  • As Before
  • Bakis Network Left-Right HMM
  • Each aij is an entry in the priors matrix
  • Likelihood probabilities not shown
  • For each state there is a collection of
    likelihood observations
  • Each observation (now a vector of 39 features)
    has a probability given the state p(otqi)

46
coarticulation
Notice the difference in the 2nd formant of eh
in each context
47
Solution
  • Triphone
  • phone
  • left context
  • right context
  • Notationy-ehl eh preceded by y and
    followed by l
  • Suppose there are 50 phones in a language
    125,000 triphones
  • Not all will appear in a corpus
  • English disallows ae-ehow and m-jt
  • WSJ study 55,000 triphones needed but found only
    18,500

48
Data Sparsity
  • Lucky for us different contexts sometimes have
    similar effects. Notice w iy/r iy and m
    iy/n iy

49
State Tying
Initial subphones of t-iyn t-iyn share
acoustic reps (and likelihood probabilities) How
Clustering algorithm
50
Acoustic Likelihood/transition Probability
Computation
  • Problem 1 Which observation corresponds to which
    state?
  • p(otqi) Likelihoods
  • Problem 2 What is the transition probability
    between states
  • Priors
  • Hand labeled
  • Training corpus of isolated words in wav file
  • Start and stop time of each phone is marked by
    hand
  • Can compute the observation likelihoods by
    counting (like ice cream)
  • But requires 400 hours to label an hour of speech
  • Humans are bad a labeling units smaller than a
    phone
  • Embedded training
  • Wav file corresponding transcription
  • Pronunciation lexicon
  • Raw (untrained) HMM
  • Baum-Welsh sums over all possible segmentations
    of words and phones

51
The Raw Materials
52
The Language Model
p(OW)
Viterbi
rep. of acoustic signal
p(W)
digital signal processing
53
p(W)
  •  

54
Bigram Probabilities
ltsgt Alex wrote his booklt/sgt
ltsgt James wrote a different booklt/sgt
ltsgtAlex wrote a book suggested by Judithlt/sgt
P(wroteAlex) 2/2 P(awrote) 2/3
P(booka) 1/2 P(lt/sgtbook) 2/3
Independence Assumption
55
Issues
  • HMM requires independence assumptions
  • Researchers are now experimenting with deep
    belief networks
  • Stacks of Boltzman machines
  • Non-global Languages the Google phenomenon
  • Detection of physical and emotional states
  • anger
  • frustration
  • sleepiness
  • intoxication
  • blame classification among married couples

56
Language is Our Defining Feature
And We Havent Even Begun to Talk About
Understanding
Write a Comment
User Comments (0)
About PowerShow.com