Title: Speech Signal Representations I
1Speech Signal Representations I
- Seminar Speech Recognition 2002
- F.R. Verhage
2Speech Signal Representations I
- Decomposition of the speech signal (xn) as a
source (en) passed through a linear
time-varying filter (hn).
3Speech Signal Representations I
- Estimation of the filter, inspired by
- Speech production models
- Linear Predictive Coding (LPC)
- Cepstral analysis
- Speech perception models (part II)
- Mel-frequency cepstrum
- Perceptual Linaer Prediction (PLP)
- Speech recognizers estimate filter
characteristics and ignore the source
4Speech Signal Representations IShort-Time
Fourier Analysis
- Spectrogram
- Representation of a signal highlighting several
of its properties based on short-time Fourier
analysis - Two dimensional time horizontal and frequency
vertical - Third dimension gray or color level indicating
energy
5Speech Signal Representations IShort-Time
Fourier Analysis
- Spectrogram
- Narrow band
- Long windows (gt 20 ms) ?
- Narrow bandwidth
- Lower time resolution, better frequency
resolution - Wide band
- Short windows ( lt10 ms) ?
- Wide bandwidth
- Good time resolution, lower frequency resolution
- Pitch synchronous
- Requires knowledge of local pitch period
6Speech Signal Representations IShort-Time
Fourier Analysis
7Speech Signal Representations IShort-Time
Fourier Analysis
- Window analysis
- Series of short segments, analysis frames
- Short enough so that the signal is stationary
- Usually constant, 20-30 ms
- Overlaps possible
- Different types of window functions (wmn)
- Rectangular (equal to no window function)
- Hamming
- Hanning
8Speech Signal Representations IShort-Time
Fourier Analysis
- Window analysis
- Window size must be long enough
- Rectangular N M
- Hamming, Hanning N 2M
- Pitch period not known in advance ?
- Prepare for lowest pitch period ?
- At least 20ms for rectangular or 40ms for
Hamming/Hanning (50Hz) - But longer windows give a more average spectrum
instead of distinct spectra ? - Rectangular window has better time resolution
9Speech Signal Representations IShort-Time
Fourier Analysis
10Speech Signal Representations IShort-Time
Fourier Analysis
11Speech Signal Representations IShort-Time
Fourier Analysis
12Speech Signal Representations IShort-Time
Fourier Analysis
13Speech Signal Representations IShort-Time
Fourier Analysis
14Speech Signal Representations IShort-Time
Fourier Analysis
15Speech Signal Representations IShort-Time
Fourier Analysis
16Speech Signal Representations IShort-Time
Fourier Analysis
- Window analysis
- Frequency response not completely zero outside
main lobe ? Spectral leakage - Second lobe of a Hamming window is approx. 43dB
below main lobe ? less spectral leakage - Hamming, Hanning, triangular windows offer less
spectral leakage ? - Rectangular windows are rarely used despite their
better time resolution
17Speech Signal Representations IShort-Time
Fourier Analysis
18Speech Signal Representations IShort-Time
Fourier Analysis
19Speech Signal Representations IShort-Time
Fourier Analysis
20Speech Signal Representations IShort-Time
Fourier Analysis
21Speech Signal Representations IShort-Time
Fourier Analysis
- Short-time spectrum of male voice speech
- Time signal /ah/local pitch 110Hz
- 30ms rectangularwindow
- 15ms rectangular window
- 30ms Hammingwindow
- 15ms Hammingwindow
22Speech Signal Representations IShort-Time
Fourier Analysis
- Short-time spectrum of female voice speech
- Time signal /aa/local pitch 200Hz
- 30ms rectangularwindow
- 15ms rectangular window
- 30ms Hammingwindow
- 15ms Hammingwindow
23Speech Signal Representations IShort-Time
Fourier Analysis
- Short-time spectrum of unvoiced speech
- Time signal
- 30ms rectangularwindow
- 15ms rectangular window
- 30ms Hammingwindow
- 15ms Hammingwindow
24Speech Signal Representations ILinear
Predictive Coding
- LPC a.k.a. auto-regressive (AR) modeling
- All-pole filter is good approximation of speech,
with p as the order of the LPC analysis - Predicts current sample as linear combination of
past p samples
25Speech Signal Representations ILinear
Predictive Coding
- To estimate predictor coefficients (ak), use
short-term analysis technique - Per segment, minimize the total prediction error
by calculating the minimum squared error - Take the derivative, equate it to 0 expressed as
a set of p linear equationsthe Yule-Walker
equations
26Speech Signal Representations ILinear
Predictive Coding
- Solution of the Yule-Walker equations
- Any standard matrix inversion package
- Due to the special form of the matrix, efficient
solutions - Covariance methodusing the Cholesky
decomposition - Autocorrelation methodusing windows, results in
equations with Toeplitz matrices, solved by the
Durbin recursion algorithm - Lattice methodequivalent to Levinson Durbin
recursionoften used in fixed-point
implementations because lack of precision doesnt
result in unstable filters
27Speech Signal Representations I Linear
Predictive Coding
28Speech Signal Representations I Linear
Predictive Coding
29Speech Signal Representations ILinear
Predictive Coding
- Spectral analysis via LPC
- All-pole (IIR) filter
- Peaks at the roots of the denominator
30Speech Signal Representations ILinear
Predictive Coding
- Prediction error
- Should be (approximately) the excitation
- Unvoiced speech, expect white noise OK
- Voiced speech, expect impulse train NOK
- All-pole assumption not altogether valid
- Real speech not perfectly periodic
- Pitch synchronous analysis gives better results
- LPC order
- Larger p gives lower prediction errors
- Too large a p results in fitting the individual
harmonics ?separation between filter and source
will not be so good
31Speech Signal Representations ILinear
Predictive Coding
- Prediction error
- Inverse LPC filter gives residual signal
32Speech Signal Representations ILinear
Predictive Coding
- Alternatives for the predictor coefficients
- Line Spectral Frequencies
- local sensitivity
- efficiency
- Reflection Coefficients
- Guaranteed stable ? useful for coefficient
interpolated over time - Log-area ratios
- Flat spectral sensitivity
- Roots of the polynomial
- Represent resonance frequencies and bandwidths
33Speech Signal Representations ICepstral
Processing
- A homomorphic transformation converts a
convolution into a sum
34Speech Signal Representations ICepstral
Processing
35Speech Signal Representations ICepstral
Processing