A 12-WEEK PROJECT IN Speech Coding and Recognition - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

A 12-WEEK PROJECT IN Speech Coding and Recognition

Description:

A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen Overview An Introduction to Speech Signals (Vedrana) Linear Prediction ... – PowerPoint PPT presentation

Number of Views:165

Avg rating:3.0/5.0

Slides: 54

Provided by: VedranaA7

Category:

more less

Transcript and Presenter's Notes

Title: A 12-WEEK PROJECT IN Speech Coding and Recognition

1
A 12-WEEK PROJECT INSpeech Coding and Recognition

by Fu-Tien Hsiao
and Vedrana Andersen

2
Overview

An Introduction to Speech Signals (Vedrana)
Linear Prediction Analysis (Fu)
Speech Coding and Synthesis (Fu)
Speech Recognition (Vedrana)

3
Speech Coding and Recognition

AN INTRODUCTION TO SPEECH SIGNALS

4
AN INTRODUCTION TO SPEECH SIGNALSSpeech
Production

Flow of air from lungs
Vibrating vocal cords
Speech production cavities
Lips
Sound wave
Vowels (a, e, i), fricatives (f, s, z) and
plosives (p, t, k)

5
AN INTRODUCTION TO SPEECH SIGNALSSpeech Signals

Sampling frequency 8 16 kHz
Short-time stationary assumption (frames 20 40
ms)

6
AN INTRODUCTION TO SPEECH SIGNALSModel for
Speech Production

Excitation (periodic, noisy)
Vocal tract filter (nasal cavity, oral cavity,
pharynx)

7
AN INTRODUCTION TO SPEECH SIGNALSVoiced and
Unvoiced Sounds

Voiced sounds, periodic excitation, pitch period
Unvoiced sounds, noise-like excitation
Short-time measures power and zero-crossing

8
AN INTRODUCTION TO SPEECH SIGNALSFrequency Domain

Pitch, harmonics (excitation)
Formants, envelope (vocal tract filter)
Harmonic product spectrum

9
AN INTRODUCTION TO SPEECH SIGNALSSpeech
Spectrograms

Time varying formant structure
Narrowband / wideband

10
Speech Coding and Recognition

LINEAR PREDICTION ANALYSIS

11
LINEAR PREDICTION ANALYSISCategories

Vocal Tract Filter
Linear Prediction Analysis
Error Minimization
Levison-Durbin Recursion
Residual sequence u(n)

12
LINEAR PREDICTION ANALYSISVocal Tract Filter(1)

Vocal tract filter
If we assume an all poles filter?

Output speech
Input periodic impulse train
13
LINEAR PREDICTION ANALYSISVocal Tract Filter(2)

Auto regressive model
(all poles filter)
where p is called the model order
Speech is a linear combination of past samples
and an extra part, Aug(z)

14
LINEAR PREDICTION ANALYSISLinear Prediction
Analysis(1)

Goal how to find the coefficients ak in this all
poles model?

Physical model v.s. Analysis system
error, e(n)
speech, s(n)
impulse, Aug(n)
all poles model
?
ak here is fixed, but unknown!
we try to find ak to estimate ak
15
LINEAR PREDICTION ANALYSISLinear Prediction
Analysis(2)

What is really inside the ? box?
A predictor (P(z), FIR filter) inside,
where s(n) a1s(n-1)a2s(n-2) aps(n-p)
If ak ak , then e(n) Aug(n)

predicitve s(n)
predictive error, e(n)s(n)- s(n)
original s(n)
P(z)
-
A(z)1-P(z)
16
LINEAR PREDICTION ANALYSISLinear Prediction
Analysis (3)

If we can find a predictor generating a smallest
error e(n) which is close to Aug(n), then we can
use A(z) to estimate filter coefficients.

very similar to vocal tract model
17
LINEAR PREDICTION ANALYSISError Minization(1)

Problem How to find the minimum error?
Energy of error
, where e(n)s(n)- s(n)
function(ai)
For quadratic function of ai we can find the
smallest value by for each

18
LINEAR PREDICTION ANALYSISError Minization(2)

By differentiation,
We define that,
where
This is actually an autocorrelation of s(n)

a set of linear equations
19
LINEAR PREDICTION ANALYSISError Minization(3)

Hence, lets discuss linear equations in matrix
Linear prediction coefficient is our goal.
How to solve it efficiently?

20
LINEAR PREDICTION ANALYSISLevinson-Durbin
Recursion(1)

In the matrix, LD recursion method is based on
following characteristics
Symmetric
Toeplitz
Hence we can solve matrix in O(p2) instead of
O(p3)
Dont forget our objective, which is to find ak
to simulate the vocal tract filter.

21
LINEAR PREDICTION ANALYSISLevinson-Durbin
Recursion(2)

In exercise, we solve matrix by brute force and
L-D recursion. There is no difference of
corresponding parameters

Error energy v.s. Predictor
order

22
LINEAR PREDICTION ANALYSISResidual sequence u(n)

After knowing filter coefficients, we can find
residual sequence u(n) by inversely filtering
computation.
Try to compare
original s(n)
residual u(n)

23
Speech Coding and Recognition

SPEECH CODING AND SYNTHESIS

24
SPEECH CODING AND SYNTHESISCategories

Analysis-by-Synthesis
Perceptual Weighting Filter
Linear Predictive Coding
Multi-Pulse Linear Prediction
Code-Excited Linear Prediction (CELP)
CELP Experiment
Quantization

25
SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(
1)

Analyze the speech by estimating a LP synthesis
filter
Computing a residual sequence as a excitation
signal to reconstruct signal
Encoder/Decoder
the parameters like LP synthesis filter, gain,
and pitch are coded, transmitted, and decoded

26
SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(
2)

Frame by frame
Without error minimization
With error minimization

27
SPEECH CODING AND SYNTHESISPerceptual Weighting
Filter(1)

Perceptual masking effect
Within the formant regions, one is less
sensitive to the noise
Idea
designing a filter that de-emphasizes the error
in the formant region
Result
synthetic speech with more error near formant
peaks but less error in others

28
SPEECH CODING AND SYNTHESISPerceptual Weighting
Filter(2)

In frequency domain
LP syn. filter v.s. PW filter
Perceptual weighting coefficient
a 1, no filtering.
a decreases, filtering more
optimala depends on perception

29
SPEECH CODING AND SYNTHESISPerceptual Weighting
Filter(3)

In z domain, LP filter v.s. PW filter
Numerator generating the zeros which are the
original poles of LP synthesis filter
Denominator placing the poles closer to the
origin. a determines the distance

30
SPEECH CODING AND SYNTHESISLinear Predictive
Coding(1)

Based on above methods, PW filter and
analysis-by-synthesis
If excitation signal impulse train, during
voicing, we can get a reconstructed signal very
close to the original
More often, however, the residue is far from the
impulse train

31
SPEECH CODING AND SYNTHESISLinear Predictive
Coding(2)

Hence, there are many kinds of coding trying to
improve this
Primarily differ in the type of excitation signal
Two kinds
Multi-Pulse Linear Prediction
Code-Excited Linear Prediction (CELP)

32
SPEECH CODING AND SYNTHESISMulti-Pulse Linear
Predcition(1)

Concept represent the residual sequence by
putting impulses in order to make s(n) closer to
s(n).

s(n)
LP Analysis
s(n)
Error Minimization
Excitation Generator
LP Synthesis Filter
-
Multi-pulse, u(n)
PW Filter
33
SPEECH CODING AND SYNTHESISMulti-Pulse Linear
Predcition(2)

s1 Estimate the LPC filter without excitation
s2 Place one impulse (placement and amplitude)
s3 A new error is determined
s4 Repeat s2-s3 until reaching a desired min
error

34
SPEECH CODING AND SYNTHESISCode-Excited Linear
Prediction(1)

The difference
Represent the residue v(n) by codewords
(exhaustive searching) from a codebook of
zero-mean Gaussian sequence
Consider primary pitch pulses which are
predictable over consecutive periods

35
SPEECH CODING AND SYNTHESISCode-Excited Linear
Prediction(2)
s(n)
LP analysis
LP parameters
s(n)
s(n)
u(n)
LP synthesis filter
Gaussian excitation codebook
Multi-pulse generator
-
PW filter
Error minimization
36
SPEECH CODING AND SYNTHESISCELP Experiment(1)

An experiment of CELP
Original (blue)
Excitation signal (below)
Reconstructed
(green)

37
SPEECH CODING AND SYNTHESISCELP Experiment(2)

Test the quality for different settings
LPC model order
Initial M10
Test M2
PW coefficient

38
SPEECH CODING AND SYNTHESISCELP Experiment(3)

Codebook (L,K)
K codebook size
K influences the computation time strongly.
if K 1024 to 256, then time 13 to 6 sec
Initial (40,1024)
Test (40,16)
L length of the random signal
L determines the number of subblock in the frame

39
SPEECH CODING AND SYNTHESISQuantization

With quantization,
16000 bps CELP
9600 bps CELP
Trade-off
Bandwidth efficiency v.s. speech quality

40
Speech Coding and Recognition

SPEECH RECOGNITION

41
SPEECH RECOGNITIONDimensions of Difficulty

Speaker dependent / independent
Vocabulary size (small, medium, large)
Discrete words / continuous utterance
Quiet / noisy environment

42
SPEECH RECOGNITIONFeature Extraction

Overlapping frames
Feature vector for each frame
Mel-cepstrum, difference cepstrum, energy, diff.
energy

43
SPEECH RECOGNITIONVector Quantization

Vector quantization
K-means algorithm
Observation sequence for the whole word

44
SPEECH RECOGNITIONHidden Markov Model (1)

Changing states, emitting symbols
?(1), A, B

1
5
4
2
3
45
SPEECH RECOGNITIONHidden Markov Model (2)

Probability of transition
State transition matrix
State probability vector
State equation

46
SPEECH RECOGNITIONHidden Markov Model (3)

Probability of observing
Observation probability matrix
Observation probability vector
Observation equation

47
SPEECH RECOGNITIONHidden Markov Model (4)

Discrete observation hidden Markov model
Two HMM problems
Training problem
Recognition problem

48
SPEECH RECOGNITIONRecognition using HMM (1)

Determining the probability that a
given HMM produced the observation sequence
Using straightforward computation all possible
paths, ST

49
SPEECH RECOGNITIONRecognition using HMM (2)

Forward-backward algorithm, only the forward part
Forward partial observation
Forward probability

50
SPEECH RECOGNITIONRecognition using HMM (3)

Initialization
Recursion
Termination

51
SPEECH RECOGNITIONTraining HMM

No known analytical way
Forward-backward (Baum-Welch) reestimation, a
hill-climbing algorithm
Reestimates HMM parameters in such a way that
Method
Uses and to compute forward and backward
probabilities, calculates state transition
probabilities and observation probabilities
Reestimates the model to improve probability
Need for scaling

52
SPEECH RECOGNITIONExperiments