Title: A 12-WEEK PROJECT IN Speech Coding and Recognition
1A 12-WEEK PROJECT INSpeech Coding and Recognition
- by Fu-Tien Hsiao
- and Vedrana Andersen
2Overview
- An Introduction to Speech Signals (Vedrana)
- Linear Prediction Analysis (Fu)
- Speech Coding and Synthesis (Fu)
- Speech Recognition (Vedrana)
3Speech Coding and Recognition
- AN INTRODUCTION TO SPEECH SIGNALS
4AN INTRODUCTION TO SPEECH SIGNALSSpeech
Production
- Flow of air from lungs
- Vibrating vocal cords
- Speech production cavities
- Lips
- Sound wave
- Vowels (a, e, i), fricatives (f, s, z) and
plosives (p, t, k)
5AN INTRODUCTION TO SPEECH SIGNALSSpeech Signals
- Sampling frequency 8 16 kHz
- Short-time stationary assumption (frames 20 40
ms)
6AN INTRODUCTION TO SPEECH SIGNALSModel for
Speech Production
- Excitation (periodic, noisy)
- Vocal tract filter (nasal cavity, oral cavity,
pharynx)
7AN INTRODUCTION TO SPEECH SIGNALSVoiced and
Unvoiced Sounds
- Voiced sounds, periodic excitation, pitch period
- Unvoiced sounds, noise-like excitation
- Short-time measures power and zero-crossing
8AN INTRODUCTION TO SPEECH SIGNALSFrequency Domain
- Pitch, harmonics (excitation)
- Formants, envelope (vocal tract filter)
- Harmonic product spectrum
9AN INTRODUCTION TO SPEECH SIGNALSSpeech
Spectrograms
- Time varying formant structure
- Narrowband / wideband
10Speech Coding and Recognition
- LINEAR PREDICTION ANALYSIS
11LINEAR PREDICTION ANALYSISCategories
- Vocal Tract Filter
- Linear Prediction Analysis
- Error Minimization
- Levison-Durbin Recursion
- Residual sequence u(n)
12LINEAR PREDICTION ANALYSISVocal Tract Filter(1)
- Vocal tract filter
- If we assume an all poles filter?
Output speech
Input periodic impulse train
13LINEAR PREDICTION ANALYSISVocal Tract Filter(2)
- Auto regressive model
- (all poles filter)
- where p is called the model order
- Speech is a linear combination of past samples
and an extra part, Aug(z)
14LINEAR PREDICTION ANALYSISLinear Prediction
Analysis(1)
- Goal how to find the coefficients ak in this all
poles model?
Physical model v.s. Analysis system
error, e(n)
speech, s(n)
impulse, Aug(n)
all poles model
?
ak here is fixed, but unknown!
we try to find ak to estimate ak
15LINEAR PREDICTION ANALYSISLinear Prediction
Analysis(2)
- What is really inside the ? box?
- A predictor (P(z), FIR filter) inside,
- where s(n) a1s(n-1)a2s(n-2) aps(n-p)
- If ak ak , then e(n) Aug(n)
predicitve s(n)
predictive error, e(n)s(n)- s(n)
original s(n)
P(z)
-
A(z)1-P(z)
16LINEAR PREDICTION ANALYSISLinear Prediction
Analysis (3)
- If we can find a predictor generating a smallest
error e(n) which is close to Aug(n), then we can
use A(z) to estimate filter coefficients.
very similar to vocal tract model
17LINEAR PREDICTION ANALYSISError Minization(1)
- Problem How to find the minimum error?
- Energy of error
-
-
- , where e(n)s(n)- s(n)
- function(ai)
- For quadratic function of ai we can find the
smallest value by for each
18LINEAR PREDICTION ANALYSISError Minization(2)
- By differentiation,
-
- We define that,
- where
- This is actually an autocorrelation of s(n)
a set of linear equations
19LINEAR PREDICTION ANALYSISError Minization(3)
- Hence, lets discuss linear equations in matrix
- Linear prediction coefficient is our goal.
- How to solve it efficiently?
20LINEAR PREDICTION ANALYSISLevinson-Durbin
Recursion(1)
- In the matrix, LD recursion method is based on
following characteristics - Symmetric
- Toeplitz
- Hence we can solve matrix in O(p2) instead of
O(p3) - Dont forget our objective, which is to find ak
to simulate the vocal tract filter.
21LINEAR PREDICTION ANALYSISLevinson-Durbin
Recursion(2)
- In exercise, we solve matrix by brute force and
L-D recursion. There is no difference of
corresponding parameters
- Error energy v.s. Predictor
order
22LINEAR PREDICTION ANALYSISResidual sequence u(n)
- After knowing filter coefficients, we can find
residual sequence u(n) by inversely filtering
computation. - Try to compare
- original s(n)
- residual u(n)
23Speech Coding and Recognition
- SPEECH CODING AND SYNTHESIS
24SPEECH CODING AND SYNTHESISCategories
- Analysis-by-Synthesis
- Perceptual Weighting Filter
- Linear Predictive Coding
- Multi-Pulse Linear Prediction
- Code-Excited Linear Prediction (CELP)
- CELP Experiment
- Quantization
25SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(
1)
- Analyze the speech by estimating a LP synthesis
filter - Computing a residual sequence as a excitation
signal to reconstruct signal - Encoder/Decoder
- the parameters like LP synthesis filter, gain,
and pitch are coded, transmitted, and decoded
26SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(
2)
- Frame by frame
- Without error minimization
- With error minimization
27SPEECH CODING AND SYNTHESISPerceptual Weighting
Filter(1)
- Perceptual masking effect
- Within the formant regions, one is less
sensitive to the noise - Idea
- designing a filter that de-emphasizes the error
in the formant region - Result
- synthetic speech with more error near formant
peaks but less error in others
28SPEECH CODING AND SYNTHESISPerceptual Weighting
Filter(2)
-
- In frequency domain
- LP syn. filter v.s. PW filter
- Perceptual weighting coefficient
- a 1, no filtering.
- a decreases, filtering more
- optimala depends on perception
29SPEECH CODING AND SYNTHESISPerceptual Weighting
Filter(3)
- In z domain, LP filter v.s. PW filter
- Numerator generating the zeros which are the
original poles of LP synthesis filter - Denominator placing the poles closer to the
origin. a determines the distance
30SPEECH CODING AND SYNTHESISLinear Predictive
Coding(1)
- Based on above methods, PW filter and
analysis-by-synthesis - If excitation signal impulse train, during
voicing, we can get a reconstructed signal very
close to the original - More often, however, the residue is far from the
impulse train
31SPEECH CODING AND SYNTHESISLinear Predictive
Coding(2)
- Hence, there are many kinds of coding trying to
improve this - Primarily differ in the type of excitation signal
- Two kinds
- Multi-Pulse Linear Prediction
- Code-Excited Linear Prediction (CELP)
32SPEECH CODING AND SYNTHESISMulti-Pulse Linear
Predcition(1)
- Concept represent the residual sequence by
putting impulses in order to make s(n) closer to
s(n). -
s(n)
LP Analysis
s(n)
Error Minimization
Excitation Generator
LP Synthesis Filter
-
Multi-pulse, u(n)
PW Filter
33SPEECH CODING AND SYNTHESISMulti-Pulse Linear
Predcition(2)
- s1 Estimate the LPC filter without excitation
- s2 Place one impulse (placement and amplitude)
- s3 A new error is determined
- s4 Repeat s2-s3 until reaching a desired min
error -
34SPEECH CODING AND SYNTHESISCode-Excited Linear
Prediction(1)
- The difference
- Represent the residue v(n) by codewords
(exhaustive searching) from a codebook of
zero-mean Gaussian sequence - Consider primary pitch pulses which are
predictable over consecutive periods -
35SPEECH CODING AND SYNTHESISCode-Excited Linear
Prediction(2)
s(n)
LP analysis
LP parameters
s(n)
s(n)
u(n)
LP synthesis filter
Gaussian excitation codebook
Multi-pulse generator
-
PW filter
Error minimization
36SPEECH CODING AND SYNTHESISCELP Experiment(1)
- An experiment of CELP
- Original (blue)
- Excitation signal (below)
- Reconstructed
- (green)
37SPEECH CODING AND SYNTHESISCELP Experiment(2)
- Test the quality for different settings
- LPC model order
- Initial M10
- Test M2
- PW coefficient
38SPEECH CODING AND SYNTHESISCELP Experiment(3)
- Codebook (L,K)
- K codebook size
- K influences the computation time strongly.
- if K 1024 to 256, then time 13 to 6 sec
- Initial (40,1024)
- Test (40,16)
- L length of the random signal
- L determines the number of subblock in the frame
39SPEECH CODING AND SYNTHESISQuantization
- With quantization,
- 16000 bps CELP
- 9600 bps CELP
- Trade-off
- Bandwidth efficiency v.s. speech quality
40Speech Coding and Recognition
41SPEECH RECOGNITIONDimensions of Difficulty
- Speaker dependent / independent
- Vocabulary size (small, medium, large)
- Discrete words / continuous utterance
- Quiet / noisy environment
42SPEECH RECOGNITIONFeature Extraction
- Overlapping frames
- Feature vector for each frame
- Mel-cepstrum, difference cepstrum, energy, diff.
energy
43SPEECH RECOGNITIONVector Quantization
- Vector quantization
- K-means algorithm
- Observation sequence for the whole word
44SPEECH RECOGNITIONHidden Markov Model (1)
- Changing states, emitting symbols
- ?(1), A, B
1
5
4
2
3
45SPEECH RECOGNITIONHidden Markov Model (2)
- Probability of transition
- State transition matrix
- State probability vector
- State equation
46SPEECH RECOGNITIONHidden Markov Model (3)
- Probability of observing
- Observation probability matrix
- Observation probability vector
- Observation equation
47SPEECH RECOGNITIONHidden Markov Model (4)
- Discrete observation hidden Markov model
- Two HMM problems
- Training problem
- Recognition problem
48SPEECH RECOGNITIONRecognition using HMM (1)
- Determining the probability that a
given HMM produced the observation sequence - Using straightforward computation all possible
paths, ST
49SPEECH RECOGNITIONRecognition using HMM (2)
- Forward-backward algorithm, only the forward part
- Forward partial observation
- Forward probability
50SPEECH RECOGNITIONRecognition using HMM (3)
- Initialization
- Recursion
- Termination
51SPEECH RECOGNITIONTraining HMM
- No known analytical way
- Forward-backward (Baum-Welch) reestimation, a
hill-climbing algorithm - Reestimates HMM parameters in such a way that
- Method
- Uses and to compute forward and backward
probabilities, calculates state transition
probabilities and observation probabilities - Reestimates the model to improve probability
- Need for scaling
52SPEECH RECOGNITIONExperiments
- Matrices A and B
- Observation sequences for words one and two
53Thank you!