LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

About This Presentation

Title:

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

Description:

short b closure, voicing barely visible. ... Build a statistical model of the speech-to-words process. Collect lots and lots of speech, and transcribe all the words. ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 53

Provided by: DanJur6

Learn more at: https://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

1
LING 138/238 SYMBSYS 138Intro to Computer Speech
and Language Processing

Lecture 9 Speech Recognition (I)
October 26, 2004
Dan Jurafsky

2
Outline for ASR this week

Acoustic Phonetics
ASR Architecture
The Noisy Channel Model
Five easy pieces of an ASR system
Feature Extraction
Acoustic Model
Lexicon/Pronunciation Model
Decoder
Language Model
Evaluation

3
Acoustic Phonetics

Sound Waves
http//www.kettering.edu/drussell/Demos/waves-int
ro/waves-intro.html
http//www.kettering.edu/drussell/Demos/waves/Lwa
ve.gif

4
Waveforms for speech

Waveform of the vowel iy
Frequency repetitions/second of a wave
Above vowel has 28 reps in .11 secs
So freq is 28/.11 255 Hz
This is speed that vocal folds move, hence
voicing
Amplitude y axis amount of air pressure at that
point in time
Zero is normal air pressure, negative is
rarefaction

5
She just had a baby

What can we learn from a wavefile?
Vowels are voiced, long, loud
Length in time length in space in waveform
picture
Voicing regular peaks in amplitude
When stops closed no peaks silence.
Peaks voicing .46 to .58 (vowel iy, from
second .65 to .74 (vowel ax) and so on
Silence of stop closure (1.06 to 1.08 for first
b, or 1.26 to 1.28 for second b)
Fricatives like sh intense irregular pattern
see .33 to .46

6
Examples from Ladefoged
pad
bad
spat
7
Spectra

New idea spectra (singular spectrum)
Different way to view a waveform
Fourier analysis every wave can be represented
as sum of many simple waves of different
frequencies.
Articulatory facts
The vocal cord vibrations create harmonics
The mouth is an amplifier
Depending on shape of mouth, some harmonics are
amplified more than others

8
Part of ae waveform from had

Note complex wave repeating nine times in figure
Plus smaller waves which repeats 4 times for
every large pattern
Large wave has frequency of 250 Hz (9 times in
.036 seconds)
Small wave roughly 4 times this, or roughly 1000
Hz
Two little tiny waves on top of peak of 1000 Hz
waves

9
A spectrum

Spectrum represents these freq components
Computed by Fourier transform, algorithm which
separates out each frequency component of wave.
x-axis shows frequency, y-axis shows magnitude
(in decibels, a log measure of amplitude)
Peaks at 930 Hz, 1860 Hz, and 3020 Hz.

10
Spectrogram
11
Formants

Vowels largely distinguished by 2 characteristic
pitches.
One of them (the higher of the two) goes downward
throughout the series iy ih eh ae aa ao ou u
(whisper iy eh uw)
The other goes up for the first four vowels and
then down for the next four.
creaky voice iy ih eh ae (goes up)
creaky voice aa ow uh uw (goes down)
These are called "formants" of the vowels,
lower is 1st formant, higher is 2nd formant.

12
How formants are produced

Q why do vowels have different pitches if the
vocal cords are same rate?
A mouth as "amplifier" amplifies different
frequencies
Formants are result of differen shapes of vocal
tract.
Any body of air will vibrate in a way that
depends on its size and shape. Air in vocal tract
is set in vibration by action of focal cords.
Every time the vocal cords open and close, pulse
of air from the lungs, acting like sharp taps on
air in vocal tract,
Setting resonating cavities into vibration so
produce a number of different frequencies.

13
How to read spectrograms

bab closure of lips lowers all formants so
rapid increase in all formants at beginning of
"bab
dad first formant increases, but F2 and F3
slight fall
gag F2 and F3 come together this is a
characteristic of velars. Formant transitions
take longer in velars than in alveolars or labials

14
She came back and started again

1., lots of high-freq energy
3. closure for k
4. burst of aspiration for k
5. ey vowelfaint 1100 Hz formant is
nasalization
6. bilabial nasal
short b closure, voicing barely visible.
8. ae note upward transitions after bilabial
stop at beginning
9. note F2 and F3 coming together for "k"

15
Spectrogram for She just had a baby
16
Perceptual properties

Pitch perceptual correlate of frequency
Loudness perceptual correlate of power, which is
related to square of amplitude

17
Speech Recognition

Applications of Speech Recognition (ASR)
Dictation
Telephone-based Information (directions, air
travel, banking, etc)
Hands-free (in car)
Speaker Identification
Language Identification
Second language ('L2') (accent reduction)
Audio archive searching

18
LVCSR

Large Vocabulary Continuous Speech Recognition
20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)

19
LVCSR Design Intuition

Build a statistical model of the speech-to-words
process
Collect lots and lots of speech, and transcribe
all the words.
Train the model on the labeled speech
Paradigm Supervised Machine Learning Search

20
Speech Recognition Architecture
21
The Noisy Channel Model

Search through space of all possible sentences.
Pick the one that is most probable given the
waveform.

22
The Noisy Channel Model (II)

What is the most likely sentence out of all
sentences in the language L given some acoustic
input O?
Treat acoustic input O as sequence of individual
observations
O o1,o2,o3,,ot
Define a sentence as a sequence of words
W w1,w2,w3,,wn

23
Noisy Channel Model (III)

Probabilistic implication Pick the highest prob
S
We can use Bayes rule to rewrite this
Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax

24
A quick derivation of Bayes Rule

Conditionals
Rearranging
And also

25
Bayes (II)

We know
So rearranging things

26
Noisy channel model
likelihood
prior
27
The noisy channel model

Ignoring the denominator leaves us with two
factors P(Source) and P(SignalSource)

28
Five easy pieces

Feature extraction
Acoustic Modeling
HMMs, Lexicons, and Pronunciation
Decoding
Language Modeling

29
Feature Extraction

Digitize Speech
Extract Frames

30
Digitizing Speech
31
Digitizing Speech (A-D)

Sampling
measuring amplitude of signal at time t
16,000 Hz (samples/sec) Microphone (Wideband)
8,000 Hz (samples/sec) Telephone
Why?
Need at least 2 samples per cycle
max measurable frequency is half sampling rate
Human speech lt 10,000 Hz, so need max 20K
Telephone filtered at 4K, so 8K is enough

32
Digitizing Speech (II)

Quantization
Representing real value of each amplitude as
integer
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats
16 bit PCM
8 bit mu-law log compression
LSB (Intel) vs. MSB (Sun, Apple)
Headers
Raw (no header)
Microsoft wav
Sun .au

40 byte header
33
Frame Extraction

A frame (25 ms wide) extracted every 10 ms

25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
34
MFCC (Mel Frequency Cepstral Coefficients)

Do FFT to get spectral information
Like the spectrogram/spectrum we saw earlier
Apply Mel scaling
Linear below 1kHz, log above, equal samples above
and below 1kHz
Models human ear more sensitivity in lower freqs
Plus Discrete Cosine Transformation

35
Final Feature Vector

39 Features per 10 ms frame
12 MFCC features
12 Delta MFCC features
12 Delta-Delta MFCC features
1 (log) frame energy
1 Delta (log) frame energy
1 Delta-Delta (log frame energy)
So each frame represented by a 39D vector

36
Acoustic Modeling

Given a 39d vector corresponding to the
observation of one frame oi
And given a phone q we want to detect
Compute p(oiq)
Most popular method
GMM (Gaussian mixture models)
Other methods
MLP (multi-layer perceptron)

37
Acoustic Modeling MLP computes p(qo)
38
Gaussian Mixture Models

Also called fully-continuous HMMs
P(oq) computed by a Gaussian

39
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means

P(oq)

P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
40
Training Gaussians

A (single) Gaussian is characterized by a mean
and a variance
Imagine that we had some training data in which
each phone was labeled
We could just compute the mean and variance from
the data

41
But we need 39 gaussians, not 1!

The observation o is really a vector of length 39
So need a vector of Gaussians

42
Actually, mixture of gaussians

Each phone is modeled by a sum of different
gaussians
Hence able to model complex facts about the data

Phone A
Phone B
43
Gaussians acoustic modeling

Summary each phone is represented by a GMM
parameterized by
M mixture weights
M mean vectors
M covariance matrices
Usually assume covariance matrix is diagonal
I.e. just keep separate variance for each
cepstral feature

44
ASR Lexicon Markov Models for pronunciation
45
The Hidden Markov model
46
Formal definition of HMM

States a set of states Q q1, q2qN
Transition probabilities a set of probabilities
A a01,a02,an1,ann.
Each aij represents P(ji)
Observation likelihoods a set of likelihoods
Bbi(ot), probability that state i generated
observation t
Special non-emitting initial and final states

47
Pieces of the HMM

Observation likelihoods (b), p(oq), represents
the acoustics of each phone, and are computed by
the gaussians (Acoustic Model, or AM)
Transition probabilities represent the
probability of different pronunciations
(different sequences of phones)
States correspond to phones

48
Pieces of the HMM

Actually, I lied when I say states correspond to
phones
Actually states usually correspond to triphones
CHEESE (phones) ch iy z
CHEESE (triphones) -chiy, ch-iyz, iy-z

49
Pieces of the HMM

Actually, I lied again when I said states
correspond to triphones
In fact, each triphone has 3 states for
beginning, middle, and end of the triphone.

50
A real HMM
51
Cross-word triphones

Word-Internal Context-Dependent Models
OUR LIST
SIL AAR AA-R LIH L-IHS IH-ST S-T
Cross-Word Context-Dependent Models
OUR LIST
SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL

52
Summary

ASR Architecture
The Noisy Channel Model
Five easy pieces of an ASR system
Feature Extraction
39 MFCC features
Acoustic Model
Gaussians for computing p(oq)
Lexicon/Pronunciation Model
HMM
Next time Decoding how to combine these to
compute words from speech!

Write a Comment

User Comments (0)

About PowerShow.com

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing - PowerPoint PPT Presentation

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

short b closure, voicing barely visible. ... Build a statistical model of the speech-to-words process. Collect lots and lots of speech, and transcribe all the words. ... – PowerPoint PPT presentation