Title: ROBUST SPEECH RECOGNITION Signal Processing for Speech Applications
1ROBUST SPEECH RECOGNITIONSignal Processing for
Speech Applications
- Richard Stern
- Robust Speech Recognition Group
- Carnegie Mellon University
- Telephone (412) 268-2535
- Fax (412) 268-3890
- rms_at_cs.cmu.edu
- http//www.cs.cmu.edu/rms
- Short Course at UNAM
- August 14-17, 2007
2SIGNAL PROCESSING FOR SPEECH APPLICATIONS
- Major speech technologies
- Speech coding
- Speech synthesis
- Speech recognition
3OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction
- Major functional components
- Signal processing to extract features from speech
waveforms - Comparison of features to pre-stored templates
- Important design choices
- Choice of features
- Specific method of comparing features to stored
templates
4GOALS OF SPEECH REPRESENTATIONS
- Capture important phonetic information in speech
- Computational efficiency
- Efficiency in storage requirements
- Optimize generalization
5GOALS OF THIS LECTURE
- We will describe and explain how we accomplish
feature extraction for automatic speech
recognition. - Some specific topics
- Sampling
- Linear predictive coding (LPC)
- LPC-derived cepstral coefficients (LPCC)
- Mel-frequency cepstral coefficients (MFCC)
- Some of the underlying mathematics
- Continuous-time Fourier transform (CTFT)
- Discrete-time Fourier transform (DTFT)
- Z-transforms
6OUTLINE OF PRESENTATION
- Introduction
- Why perform signal processing?
- The source-filter model of speech production
- Sampling of continuous-time signals
- Digital filtering of signals
- Frequency representations in continuous and
discrete time - Feature extraction for speech recognition
7WHY PERFORM SIGNAL PROCESSING?
- A look at the time-domain waveform of six
Its hard to infer much from the time-domain
waveform
8WHY PERFORM SIGNAL PROCESSING IN THE FREQUENCY
DOMAIN?
- Human hearing is based on frequency analysis
- Use of frequency analysis often simplifies signal
processing - Use of frequency analysis often facilitates
understanding
9THE SOURCE-FILTER MODEL FOR SPEECH
- A useful model for representing the generation of
speech sounds
Amplitude
pn
10Speech coding separating the vocal tract
excitation and and filter
- Original speech
- Speech with 75-Hz excitation
- Speech with 150 Hz excitation
- Speech with noise excitation
11SAMPLING CONTINUOUS-TIME SIGNALS
- Original speech waveform and its samples
12THE SAMPLING THEOREM
- Nyquist theorem If the signal is sampled with a
frequency that is at least twice the maximum
frequency of the incoming speech, we can recover
the original waveform by lowpass filtering. With
lower sampling frequencies, aliasing will occur,
which produces distortion from which the original
signal cannot be recovered.
Recovered speech wave
Speech wave
LOWPASS FILTER
Sampling pulse train
13EFFECTS OF ALIASING
14Digital filtering of signals
yn
xn
Filter 1 yn 3.6yn15.0yn23.2yn3.82
yn4 .013xn.032xn1.044xn2.033xn3
.013xn4 Filter 2 yn 2.7yn13.3yn22
.0yn3.57yn4 .35xn1.3xn12.0xn21.3
xn3.35xn4
15Filter 1 in the time domain
16Output of Filter 1 in the frequency domain
Original
Lowpass
17Filter 2 in the time domain
18Output of Filter 2 in the frequency domain
Original
Highpass
19OUTLINE OF PRESENTATION
- Introduction
- Frequency representations in continuous and
discrete time - Continuous-time Fourier series (CTFS)
- Continuous-time Fourier transform (CTFT)
- Discrete-time Fourier transform (DTFT)
- Short-time Fourier analysis
- Effects of window size and shape
- Z-transforms
- Feature extraction for speech recognition
20Frequency Representation The Continuous-Time
Fourier Series (CTFS)
- If x(t) is periodic with period T
- where
21Frequency Representation The Continuous-Time
Fourier Series (CTFS)
- Comments
- The coefficients Xk are complex
- Alternate representation
- where
22Example of Fourier series synthesisBuilding a
square wave from a sum of cosines
23Frequency Representation The Continuous-Time
Fourier Transform (CTFT)
24Frequency Representation The Discrete-Time
Fourier Transform (DTFT)
Comment DTFTs are always periodic in frequency
so typically only frequencies from -p to p are
used.
25EXAMPLES OF CTFTs and DTFTs
- Continuous-time decaying exponential
26EXAMPLES OF CTFTs and DTFTs
- Discrete-time decaying exponential
27CORRESPONDENCE BETWEEN FREQUENCY IN DISCRETE AND
CONTINUOUS TIME
- Suppose that a continuous-time signal is sampled
at a rate greater than the Nyquist rate, with a
time between samples of T - Let W represent continuous-time frequency in
radians/sec - Let w represent discrete-time frequency in
radians - Then
- Comment The maximum discrete-time frequency, w
p, corresponds to the Nyquist frequency in
continuous time, half the sampling rate
28SHORT-TIME FOURIER ANALYSIS
- Problem Conventional Fourier analysis does not
capture time-varying nature of speech signals - Solution Multiply signals by finite-duration
window function, then compute DTFT - Side effect windowing causes spectral blurring
29TWO POPULAR WINDOW SHAPES
- Rectangular window
- Hamming window
- Comment Hamming window is frequently preferred
because of its frequency response
30TIME AND FREQUENCY RESPONSE OF WINDOWS
31EFFECT OF WINDOW DURATION
- Short-duration window Long-Duration window
32BREAKING UP THE INCOMING SPEECH INTO FRAMES USING
HAMMING WINDOWS
33Generalizing the DTFT The Z-transform
- Discrete-Time Fourier Transform (DTFT)
- where
- Z-transform
- where
34WHY DO WE USE Z-TRANSFORMS?
- Z-Transforms enable us to relate difference
equations to frequency response of linear systems - Z-transforms facilitate design of discrete-time
systems - Z-transforms provide insight into linear
prediction
35CHARACTERIZING LINEAR SHIFT-INVARIANT SYSTEMS BY
DIFFERENCE EQUATIONS
- A simple example ...
- Difference equation characterizing system
- Or
xn
LSI SYSTEM
yn
hn
36TIME-DOMAIN AND FREQUENCY-DOMAIN CHARACTERIZATION
OF SYSTEMS
- Let hn be the response of a system to the unit
impulse function and let H(z) be the Z-transform
of hn - Then
- and
-
xn
LSI SYSTEM
yn
hn
37RELATING Z-TRANSFORMS TO DIFFERENCE EQUATIONS
- Z-transform delay property
- if
then
xn
LSI SYSTEM
yn
hn
38RELATING Z-TRANSFORMS TO DIFFERENCE EQUATIONS
- Difference equation characterizing system
- Z-transform characterizing system
xn
LSI SYSTEM
yn
hn
39POLES AND ZEROS
- Poles and zeros are the roots of the denominator
and numerator of LSI systems - Zeros of system are at z 0, z 1
- Poles of system are at z .9ejp/4, z .9e-jp/4
40MAGNITUDE OF THE DTFT FROM POLE AND ZERO LOCATIONS
- To evaluate the magnitude of the DTFT, consider
the locus of points in the z-plane corresponding
to the unit circle. - For each location on the unit circle, the
magnitude of the DTFT is proportional to the
product of the distances from the zeros divided
by the product of the distances from the poles
41Inferring the Magnitude of the DTFT from the
Z-transform
- The magnitude of the DTFT is obtained from the
magnitude of H(z) as we walk around the unit
circle of the z-plane
42SUMMARY OF Z-TRANSFORM DISCUSSION
- Z-transforms are a generalization of DTFTs
- Difference equations can be easily obtained from
Z-transforms - Locations of poles and zeros in z-plane provide
insight about frequency response
43OUTLINE OF PRESENTATION
- Introduction
- Frequency representations in continuous and
discrete time - Feature extraction for speech recognition
- Linear prediction (LPC)
- Representations based on cepstral coefficients
- LPCC - Linear prediction-based cepstral
coefficients - MFCC - Mel frequency cepstral coefficients
- Perceptual linear prediction (PLP)
44LINEAR PREDICTION OF SPEECH
- Find the best all-pole approximation to the
DTFT of a segment of speech - All-pole model is reasonable for most speech
- Very efficient in terms of data storage
- Coefficients ak can be computed efficiently
- Phase information not preserved (not a problem
for us)
45LINEAR PREDICTION EXAMPLE
- Spectra from the /ih/ in six
- Comment LPC spectrum follows peaks well
46FEATURES FOR SPEECH RECOGNITION CEPSTRAL
COEFFICIENTS
- The cepstrum is the inverse transform of the log
of the magnitude of the spectrum - Useful for separating convolved signals (like the
source and filter in the speech production model) - Can be thought of as the Fourier series expansion
of the magnitude of the Fourier transform - Generally provides more efficient and robust
coding of speech information than LPC
coefficients - Most common basic feature for speech recognition
47TWO WAYS OF DERIVING CEPSTRAL COEFFIENTS
- LPC-derived cepstral coefficients (LPCC)
- Compute traditional LPC coefficients
- Convert to cepstra using linear transformation
- Warp cepstra using bilinear transform
- Mel-frequency cepstral coefficients (MFCC)
- Compute log magnitude of windowed signal
- Multiply by triangular Mel weighting functions
- Compute inverse discrete cosine transform
48COMPUTING CEPSTRAL COEFFICIENTS
- Comments
- MFCC is currently the most popular
representation. - Typical systems include a combination of
- MFCC coefficients
- Delta MFCC coefficients
- Delta delta MFCC coefficients
- Power and delta power coefficients
49COMPUTING LPC CEPSTRAL COEFFICIENTS
- Procedure used in SPHINX-I
- A/D conversion at 16-kHz sampling rate
- Apply Hamming window, duration 320 samples (20
msec) with 50 overlap (100-Hz frame rate) - Pre-emphasize to boost high-frequency components
- Compute first 14 auto-correlation coefficients
- Perform Levinson-Durbin recursion to obtain 14
LPC coefficients - Convert LPC coefficients to cepstral coefficients
- Perform frequency warping to spread low
frequencies - Apply vector quantization to generate three
codebooks
50An example the vowel in welcome
- The original time function
51THE TIME FUNCTION AFTER WINDOWING
52THE RAW SPECTRUM
53PRE-EMPHASIZING THE SIGNAL
- Typical pre-emphasis filter
- Its frequency response
54THE SPECTRUM OF THE PRE-EMPHASIZED SIGNAL
55THE LPC SPECTRUM
56THE TRANSFORM OF THE CEPSTRAL COEFFICIENTS
57THE BIG PICTURE THE ORIGINAL SPECTROGRAM
58EFFECTS OF LPC PROCESSING
59COMPARING REPRESENTATIONS
- ORIGINAL SPEECH LPCC
CEPSTRA (unwarped)
60COMPUTING MEL FREQUENCY CEPSTRAL COEFFICIENTS
- Segment incoming waveform into frames
- Compute frequency response for each frame using
DFTs - Group magnitude of frequency response into 25-40
channels using triangular weighting functions - Compute log of weighted magnitudes for each
channel - Take inverse DCT/DFT of weighted magnitudes for
each channel, producing 14 cepstral coefficients
for each frame - Calculate delta and double-delta coefficients
61AN EXAMPLE DERIVING MFCC coefficients
62WEIGHTING THE FREQUENCY RESPONSE
63THE ACTUAL MEL WEIGHTING FUNCTIONS
64THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
65THE CEPSTRAL COEFFICIENTS
66LOGSPECTRA RECOVERED FROM CEPSTRA
67COMPARING SPECTRAL REPRESENTATIONS
- ORIGINAL SPEECH MEL LOG MAGS
AFTER CEPSTRA
68SUMMARY
- We outlined some of the relevant issues
associated with the representation of speech for
automatic recognition - The source-filter model of speech production
- Sampling
- Windowing
- The Discrete-Time Fourier Transform (DTFT)
- Concepts of frequency in continuous time and
discrete time - The Z-transform
- Linear prediction/linear predictive coding (LPC)
- Mel frequency cepstral coefficients (MFCC)
69(No Transcript)