Title: LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing
1LING 138/238 SYMBSYS 138Intro to Computer Speech
and Language Processing
- Lecture 9 Speech Recognition (I)
- October 26, 2004
- Dan Jurafsky
2Outline for ASR this week
- Acoustic Phonetics
- ASR Architecture
- The Noisy Channel Model
- Five easy pieces of an ASR system
- Feature Extraction
- Acoustic Model
- Lexicon/Pronunciation Model
- Decoder
- Language Model
- Evaluation
3Acoustic Phonetics
- Sound Waves
- http//www.kettering.edu/drussell/Demos/waves-int
ro/waves-intro.html - http//www.kettering.edu/drussell/Demos/waves/Lwa
ve.gif
4Waveforms for speech
- Waveform of the vowel iy
- Frequency repetitions/second of a wave
- Above vowel has 28 reps in .11 secs
- So freq is 28/.11 255 Hz
- This is speed that vocal folds move, hence
voicing - Amplitude y axis amount of air pressure at that
point in time - Zero is normal air pressure, negative is
rarefaction
5She just had a baby
- What can we learn from a wavefile?
- Vowels are voiced, long, loud
- Length in time length in space in waveform
picture - Voicing regular peaks in amplitude
- When stops closed no peaks silence.
- Peaks voicing .46 to .58 (vowel iy, from
second .65 to .74 (vowel ax) and so on - Silence of stop closure (1.06 to 1.08 for first
b, or 1.26 to 1.28 for second b) - Fricatives like sh intense irregular pattern
see .33 to .46
6Examples from Ladefoged
pad
bad
spat
7Spectra
- New idea spectra (singular spectrum)
- Different way to view a waveform
- Fourier analysis every wave can be represented
as sum of many simple waves of different
frequencies. - Articulatory facts
- The vocal cord vibrations create harmonics
- The mouth is an amplifier
- Depending on shape of mouth, some harmonics are
amplified more than others
8Part of ae waveform from had
- Note complex wave repeating nine times in figure
- Plus smaller waves which repeats 4 times for
every large pattern - Large wave has frequency of 250 Hz (9 times in
.036 seconds) - Small wave roughly 4 times this, or roughly 1000
Hz - Two little tiny waves on top of peak of 1000 Hz
waves
9A spectrum
- Spectrum represents these freq components
- Computed by Fourier transform, algorithm which
separates out each frequency component of wave. - x-axis shows frequency, y-axis shows magnitude
(in decibels, a log measure of amplitude) - Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
10Spectrogram
11Formants
- Vowels largely distinguished by 2 characteristic
pitches. - One of them (the higher of the two) goes downward
throughout the series iy ih eh ae aa ao ou u
(whisper iy eh uw) - The other goes up for the first four vowels and
then down for the next four. - creaky voice iy ih eh ae (goes up)
- creaky voice aa ow uh uw (goes down)
- These are called "formants" of the vowels,
lower is 1st formant, higher is 2nd formant. -
12How formants are produced
- Q why do vowels have different pitches if the
vocal cords are same rate? - A mouth as "amplifier" amplifies different
frequencies - Formants are result of differen shapes of vocal
tract. - Any body of air will vibrate in a way that
depends on its size and shape. Air in vocal tract
is set in vibration by action of focal cords.
Every time the vocal cords open and close, pulse
of air from the lungs, acting like sharp taps on
air in vocal tract, - Setting resonating cavities into vibration so
produce a number of different frequencies.
13How to read spectrograms
- bab closure of lips lowers all formants so
rapid increase in all formants at beginning of
"bab - dad first formant increases, but F2 and F3
slight fall - gag F2 and F3 come together this is a
characteristic of velars. Formant transitions
take longer in velars than in alveolars or labials
14She came back and started again
- 1., lots of high-freq energy
- 3. closure for k
- 4. burst of aspiration for k
- 5. ey vowelfaint 1100 Hz formant is
nasalization - 6. bilabial nasal
- short b closure, voicing barely visible.
- 8. ae note upward transitions after bilabial
stop at beginning - 9. note F2 and F3 coming together for "k"
15Spectrogram for She just had a baby
16Perceptual properties
- Pitch perceptual correlate of frequency
- Loudness perceptual correlate of power, which is
related to square of amplitude
17Speech Recognition
- Applications of Speech Recognition (ASR)
- Dictation
- Telephone-based Information (directions, air
travel, banking, etc) - Hands-free (in car)
- Speaker Identification
- Language Identification
- Second language ('L2') (accent reduction)
- Audio archive searching
18LVCSR
- Large Vocabulary Continuous Speech Recognition
- 20,000-64,000 words
- Speaker independent (vs. speaker-dependent)
- Continuous speech (vs isolated-word)
19LVCSR Design Intuition
- Build a statistical model of the speech-to-words
process - Collect lots and lots of speech, and transcribe
all the words. - Train the model on the labeled speech
- Paradigm Supervised Machine Learning Search
20Speech Recognition Architecture
21The Noisy Channel Model
- Search through space of all possible sentences.
- Pick the one that is most probable given the
waveform.
22The Noisy Channel Model (II)
- What is the most likely sentence out of all
sentences in the language L given some acoustic
input O? - Treat acoustic input O as sequence of individual
observations - O o1,o2,o3,,ot
- Define a sentence as a sequence of words
- W w1,w2,w3,,wn
23Noisy Channel Model (III)
- Probabilistic implication Pick the highest prob
S - We can use Bayes rule to rewrite this
- Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax
24A quick derivation of Bayes Rule
- Conditionals
- Rearranging
- And also
25Bayes (II)
- We know
- So rearranging things
26Noisy channel model
likelihood
prior
27The noisy channel model
- Ignoring the denominator leaves us with two
factors P(Source) and P(SignalSource)
28Five easy pieces
- Feature extraction
- Acoustic Modeling
- HMMs, Lexicons, and Pronunciation
- Decoding
- Language Modeling
29Feature Extraction
- Digitize Speech
- Extract Frames
30Digitizing Speech
31Digitizing Speech (A-D)
- Sampling
- measuring amplitude of signal at time t
- 16,000 Hz (samples/sec) Microphone (Wideband)
- 8,000 Hz (samples/sec) Telephone
- Why?
- Need at least 2 samples per cycle
- max measurable frequency is half sampling rate
- Human speech lt 10,000 Hz, so need max 20K
- Telephone filtered at 4K, so 8K is enough
32Digitizing Speech (II)
- Quantization
- Representing real value of each amplitude as
integer - 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
- Formats
- 16 bit PCM
- 8 bit mu-law log compression
- LSB (Intel) vs. MSB (Sun, Apple)
- Headers
- Raw (no header)
- Microsoft wav
- Sun .au
40 byte header
33Frame Extraction
- A frame (25 ms wide) extracted every 10 ms
25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
34MFCC (Mel Frequency Cepstral Coefficients)
- Do FFT to get spectral information
- Like the spectrogram/spectrum we saw earlier
- Apply Mel scaling
- Linear below 1kHz, log above, equal samples above
and below 1kHz - Models human ear more sensitivity in lower freqs
- Plus Discrete Cosine Transformation
35Final Feature Vector
- 39 Features per 10 ms frame
- 12 MFCC features
- 12 Delta MFCC features
- 12 Delta-Delta MFCC features
- 1 (log) frame energy
- 1 Delta (log) frame energy
- 1 Delta-Delta (log frame energy)
- So each frame represented by a 39D vector
36Acoustic Modeling
- Given a 39d vector corresponding to the
observation of one frame oi - And given a phone q we want to detect
- Compute p(oiq)
- Most popular method
- GMM (Gaussian mixture models)
- Other methods
- MLP (multi-layer perceptron)
37Acoustic Modeling MLP computes p(qo)
38Gaussian Mixture Models
- Also called fully-continuous HMMs
- P(oq) computed by a Gaussian
39Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means
P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
40Training Gaussians
- A (single) Gaussian is characterized by a mean
and a variance - Imagine that we had some training data in which
each phone was labeled - We could just compute the mean and variance from
the data
41But we need 39 gaussians, not 1!
- The observation o is really a vector of length 39
- So need a vector of Gaussians
42Actually, mixture of gaussians
- Each phone is modeled by a sum of different
gaussians - Hence able to model complex facts about the data
Phone A
Phone B
43Gaussians acoustic modeling
- Summary each phone is represented by a GMM
parameterized by - M mixture weights
- M mean vectors
- M covariance matrices
- Usually assume covariance matrix is diagonal
- I.e. just keep separate variance for each
cepstral feature
44ASR Lexicon Markov Models for pronunciation
45The Hidden Markov model
46Formal definition of HMM
- States a set of states Q q1, q2qN
- Transition probabilities a set of probabilities
A a01,a02,an1,ann. - Each aij represents P(ji)
- Observation likelihoods a set of likelihoods
Bbi(ot), probability that state i generated
observation t - Special non-emitting initial and final states
47Pieces of the HMM
- Observation likelihoods (b), p(oq), represents
the acoustics of each phone, and are computed by
the gaussians (Acoustic Model, or AM) - Transition probabilities represent the
probability of different pronunciations
(different sequences of phones) - States correspond to phones
48Pieces of the HMM
- Actually, I lied when I say states correspond to
phones - Actually states usually correspond to triphones
- CHEESE (phones) ch iy z
- CHEESE (triphones) -chiy, ch-iyz, iy-z
49Pieces of the HMM
- Actually, I lied again when I said states
correspond to triphones - In fact, each triphone has 3 states for
beginning, middle, and end of the triphone. -
50A real HMM
51Cross-word triphones
- Word-Internal Context-Dependent Models
- OUR LIST
- SIL AAR AA-R LIH L-IHS IH-ST S-T
- Cross-Word Context-Dependent Models
- OUR LIST
- SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL
52Summary
- ASR Architecture
- The Noisy Channel Model
- Five easy pieces of an ASR system
- Feature Extraction
- 39 MFCC features
- Acoustic Model
- Gaussians for computing p(oq)
- Lexicon/Pronunciation Model
- HMM
- Next time Decoding how to combine these to
compute words from speech!