Voice DSP Processing III

About This Presentation

Title:

Voice DSP Processing III

Description:

DSP Processing III Yaakov J. Stein Chief Scientist RAD Data Communications – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 55

Provided by: YJS5

Category:

more less

Transcript and Presenter's Notes

Title: Voice DSP Processing III

1
VoiceDSPProcessing III

Yaakov J. Stein
Chief ScientistRAD Data Communications

2
Voice DSP

Part 1 Speech biology and what we can learn from
it
Part 2 Speech DSP (AGC, VAD, features, echo
cancellation)
Part 3 Speech compression techiques
Part 4 Speech Recognition

3
Voice DSP - Part 3

Simple coders
G.711 A-law m-law
Delta
ADPCM
CELP coders
LPC-10
RELP/GSM
CELP

Other methods
MBE
MELP
STC
Waveform Interpolation

4
Encoder Criteria

Encoders can be compared in many ways
the most important are
Bit rate (Kbps)
Speech quality (MOS)
Delay (algorithmic framelookahead
computational propagation)
Computational Complexity
Often less important
Bit exactness (interoperability)
Transcoding robustness
Behavior on non-speech (babble noise, tones,
music)
Bit error robustness

5
PSTN Quality Coders

Rate ITU-T encoder
128 Kbps 16bit linear
sampling
64 Kbps G.711 A-law/m-law 8bit
log sampling
32 Kbps G.726 ADPCM
16 Kbps G.728 LDCELP
8 Kbps G.729 CS-ACELP
4 Kbps SG16Q21 ???
toll quality MOS rating, but higher delay

6
Digital Cellular Standards
7
Military / Satellite Standards
8
Voice DSP

Simple
coders

9
G.711

16 bit linear sampling at 8 KHz means 128 Kbps
Minimal toll quality linear sampling is 12 bit
(96 Kbps)
8 bit linear sampling (256 levels) is noticeably
noisy
Due to
prevalence of low amplitudes
logarithmic response of ear
we can use logarithmic sampling
Different standards for different places

10
G.711 - cont.
North America m 255

m-law
A-law
Although very different looking they are nearly
identical
G.711 approximates these expressions by 16
staircase straight-line segments
(8 negative and 8 positive)
m-law horizontal segment through origin, A-law
vertical segment

Rest Of World A 87.56
11
DPCM

Due to low-pass character of speech
differences are usually smaller than signal
values
and hence require fewer bits to quantize
Simplest Delta-PCM (DPCM) quantize first
difference signal D
Delta-PCM quantize difference between signal
and prediction
sn p ( sn-1 , sn-2 , , sn-N ) S pi sn-i
If predict using linear combination (FIR filter),
this is linear prediction
Delta-modulation (DM) use only sign of
difference (1bit DPCM)
Sigma-delta (1bit) oversample, DM, trade-off
rate for bits

i
12
DPCM with prediction

If the linear prediction works well, then the
prediction error
en sn - sn
will be lower in energy and whiter than sn
itself !
Only the error is needed for reconstruction,
since the predictable portion can be predicted sn
sn en!

sn
prediction filter
13
DPCM - post-filtering

Simplest case
if highly oversampled
then previous sample sn-1 predicts sn well,
so we can use DM,
if sgn(en) lt 0 then -D else D
For DM there is no way to encode zero prediction
error
so decoded signal oscillates wildly
Standard remedy is a post-filter that low-pass
filters this noise
But there is a b i g g e r problem!

14
Open-loop Prediction

The encoder (linear predictor) is present in the
decoder
but there runs as feedback
The decoders predictions are accurate with the
precise error en
but it gets the quantized error en and the models
diverge!

15
Side Information

There are two ways to solve the problem ...
The first way is to send the prediction
coefficients
from the encoder to the decoder
and not to let the decoder derive them
The coefficients sent are called side-information
Using side-information means higher bit-rate
(since both en and coefficients must be sent)
The second way does not require increasing bit
rate

16
Closed-loop Prediction

To ensure that the encoder and decoder stay
in-sync
we put the decoder into the encoder
Thus the encoders predictions are identical to
the decoders
and no model difference accumulates

en
sn
en
sn
Q
IQ
IQ
PF
PF
17
Two types of error

For DM there are two types of error (depending on
step size)

D too small
D OK
D too large
18
Adaptive Step Size

Speech signals are very nonstationary
We need to adapt the step size to match signal
behavior
Increase D when signal changes rapidly
Decrease D when signal is relatively constant
Simplest method (for DM only)
If present bit is the same as previous multiply D
by K (K1.5)
If present bit is different, divide D by K
Constrain D to a predefined range
More general method
Collect N samples in buffer (N 128 512)
Compute standard deviation in buffer
Set D to a fraction of standard deviation
Send D to decoder as side-information or
Use backward adaptation (closed-loop D
computation)

19
ADPCM

G.726 has
Adaptive predictor
Adaptive quantizer and inverse quantizer
Adaptation speed control
Tone and transition detector
Mechanism to prevent loss from tandeming
Computational complexity relatively high (10
MIPS)
24 and 16 Kbps modes defined, but not toll
quality
G.727 same rates but embedded for packetize
networks
ADPCM only used general low-pass characteristic
of speech
What is the next step?

20
Scalar Quantization

Standard A/D has preset, evenly distributed
levels
G.711 has preset, non-evenly distributed levels
With a criterion we can make an adaptive
quantizer
Simplest criterion minimum squared quantization
error
en sn - sn E lt en2 gt
Need algorithm to find optimal placement of
levels EM-type algorithms

21
Vector Quantization

We can do the same thing in higher dimensions
Here we wish to match input data xi i 1
.. N
to a codebook of codewords Cj j 1 .. M
with Minimal Mean Squared Error
E Si1..N xi - C 2
where C is the codeword closest to xi in the
codebook

xi
22
LBG Algorithm for VQ

Input xi i 1 .. N clustering, unsupervised
learning
Randomly initialize codebook Cj j 1 .. M
Loop until converge
Classification Step
for i 1 .. N
for j 1 .. M
compute Dij2 xi - Cj 2
classify xi to Cj with minimal Dij2
Expectation Step
for j 1 .. M correct center Cj S
i e Cj xi

1
Nj
23
Speech Application of VQ

OK, I understand what to do with scalar
quantization
what is VQ good for ?
We could try to simply VQ frames of speech
samples
but this doesnt work well !
We can VQ spectra or sub-band components
We often VQ parameter sets (e.g. LPC
coefficients)
We also VQ model error signals

24
Voice DSP

CELP
coders

25
LPC-10

Based on 10th order LPC (obviously) Bishnu
Atal
180 sample blocks are encoded into 54 bits
Pitch U/V (found using AMDF) 7 bits
Gain
5 bits
10 reflection coefficients found by covariance
method
first two coefficients converted to log area
ratios
L1, L2, a3, a4 5 bits each
a5, a6, a7, a8 4 bits each
a9 3 bits a10 2 bits 41 bits
1 sync bit 1
bit
54 bits 44.44 times per second results in 2400
bps
By using VQ could reduce bit rate to under 1
Kbps!
LPC-10 speech is intelligible, but synthetic
sounding
and much of the speaker identity is lost !

26
The Residual

Recover sn by adding back the residual error
signal
sn sn en
So if we send en as side-information we can
recover sn
en is smaller than sn so may require fewer bits
!
But en is whiter than sn so may require many
bits!
The question has now become
How can we compress the residual?

27
Encoding the Residual

RELP (6-9.6 Kbps)
Low-pass filter and downsample residual to 1 KHz
Encode using ADPCM
VQ-RELP (4.8 Kbps)
VQ coding of residual
RELP (4.8 Kbps)
Perform FFT on residual
Baseband coding
RPE-LTP (GSM-FR at 13 Kbps)
Residual Pulse Excitation - Long Term Predictor
Perform Long Term Prediction (pitch recovery)
Subtract to obtain new residual
Decimate by 3, use phase with maximum energy
Extract 6-bit overall gain
Encode remainder with 3 bits/sample

28
Residual and Excitation

Synthesis filter sn
en S am sn-m
Analysis filter rn
sn - S am sn-m
So rn en !

excitation
residual
Note all-zero filter is the inverse of the
all-pole filter
29
CELP

Atals idea
Find a way to efficiently encode the excitation !
Questions
How can we find the excitation?
Theoretically, by algebra (invert the filter!)
How can we efficiently encode the residual?
VQ - Code Excited Linear Prediction
How can we efficiently find the best codeword?
Exhaustive search

30
CELP - cont.

Atal and friends (Schroeder, Remde, Singhal,
etc.) discoveries
Even random codebooks work well Gaussian,
uniform
Dont need large codebooks e.g. 1024 codewords
for 40 samples
Can center-clip with little loss
Codebook with constant amplitude almost as good
So we can use codebooks with structure (and save
storage/search/bits)
Multipulse (MP)
Constant Amplitude Pulse

Regular Pulse (RP)
31
Special Excitations

Shift technique reduces random CB operations from
O(N2) to O(N)
a b c d e f c d e f g h e f g h I j ...
Using a small number of 1 amplitude pulses
leads to MIPS reduction
Since most values are zero, there are few
operations
Since amplitudes 1 no true multiplications
In a CB containing CW and -CW we can save half
Algebraic codebooks exploit algebraic structure
Example choose pulses according to Hadamard
matrix
Using FHT reduces computation
Conjugate structure codebooks
Excitation is sum of codewords from two related
CBs

32
Analysis by Synthesis

Finding the best codeword by exhaustive search

sn
Compute energy
-
LPC
find minimum
33
Perceptual Weighting

The criterion for selecting the best codeword
should be perceptual
not simply the energy of the difference signal!
We perceptually weight the signal and the
synthesized signal

sn
PW
-
Since PW is a filter we need use it only once
CB
LPC
34
Perceptual Weighting - cont.

The most important PW effect is masking
Coding error energy near formants is not heard
anyway
so we allow higher error near formants
but demand lower perceivable error energy
To do this we de-emphasize according to the LPC
spectrum!
Simplest filter is 1 - S ai z-I where ai are
the LPC coefficients
How do we take the critical bandwidth into
account?
We perform bandwidth expansion Denominator
expansion gt numerator 1 - S g1i ai z-I
1 - S g2i ai z-I

BW - ln(g) Fs p
1 gt g1 gt g2 gt 0
Typical values g1 0.9 g2 0.6
35
Post-filter

Not related to the subject, but if we are already
here
In order to increase the subjective quality of
many coders
post-filters are often used to emphasize the
formant structure
These have the same form as the perceptual
weighting filter
but 1 gt g2 gt g1 gt 0 with typical values g1 0.5
g2 0.75
Denominator expansion lt numerator!
the post-filter also reinforces tilt
which should then be compensated by an IIR filter
since the spectral valleys are de-emphasized
we should change the PW filter parameters g1 and
g2
Originally proposed for ADPCM !

36
Subframes

Coders with large frames (gt 10 ms) need a long
excitation signal
and hence a lot of bits to encode
An alternative is to divide the frame into (2-4)
subframes
each of which has its own codeword excitation

frame n-1
frame n1
frame n
We really should recompute LPC per subframe but
we can get away with interpolating !
37
Lookahead

If we are already dividing up the frame
we can compute the LPC based on a shifted frame
This is called lookahead, and it adds processing
delay !
To decrease delay we can use backward looking IIR
filter
and then we neednt send/store the LPC
coefficients at all!

------- LPC -------
------- LPC -------
CW
CW
CW
CW
CW
CW
CW
CW
38
What happened to the pitch?

Unlike LPC, the ABS CELP coder is excited by
codebook
Where does the pitch come from?
Random CB minimi zation will prefer good
excitation
Regular/Multi pulse pulse spacing (not enough
pulses for high pitch)
But this is usually not enough (residual has
pitch periodicity)
Two solutions
Adaptive codebook (Klejn, etal)
Long term prediction (Atal Singhal)
Both of these reinforce the pitch component

39
Adaptive CB

Adaptive codebook is repetitions of previous
excitations
Total excitation is weighted sum of stochastic CB
(random, MP, RP, etc)
and adaptive CB

Adaptive CB
Ga
LPC
Gs
Fixed CB
40
Long Term Prediction

Using long-term (pitch predictor) and short-term
(LPC) prediction
Long term predictor may have only
one delay, but then non-integer
1
1 - b z - d

sn
pitch predictor
gain
-
codebook
LPC
perceptual weighting
error computation
41
Federal Standard CELP

FS 1016 at 4.8 Kbps has MOS 3.2
Developed by ATT Bell Labs for DOD 144 bits /
30 ms frame
10th order LPC on 30 ms Hamming window
no pre-emphasis, additional 15 Hz BW expansion
(quality and LSP robustness)
Conversion to LSP and nonuniform scalar
quantization to 34 bits
4 subframes (7.5 ms) LSP interpolation
512 entry fixed CB - static -1,0,1 from
center-clipped Gaussian
5 bit nonuniform quantized gain 56 bits
256 entry adaptive CB - 8 bits 5 bit nonuniform
quantized gain 48 bits
optional noninteger delays, optional
Perceptual weighting
Postfilter spectral tilt compensation,
removable for noise or tandeming
FEC 4 bits SYNC 1 bit reserved 1 bit

42
G.728

16 Kbps with MOS similar to G.726 at 32 Kbps
Low 5 sample (0.625 msec) delay
High computational complexity (about 30 MIPS)
CELP with Backward LPC
LPC order 50 (why not? - we dont transmit
side-information!)
Frame of 2.5 ms (20 samples)
4 subframes of 0.625 ms (5 samples)
Perceptual weighting
Only 10 bit index to fixed CB is transmitted
10 bits per 0.625 ms is 16 Kbps !

43
G.729

8 Kbps toll-quality coder for DSVD and VoIP
Computational complexity 20 MIPS, but G.729a is
about 10 MIPS
frame 10 ms (80 samples) lookahead 5 ms (1
subframe)
LPC, LSP, VQ, LSP interpolation
CS-ACELP CB (Interleaved single pulse
permutation) 4 1 pulses / subframe
closed loop pitch prediction and adaptive CB
(delaygain)
2 (40 sample) subframes per frame
For each frame the encoder outputs 80 bits
LSF coefficients 18 bits pitch
8 bits gain CB 14 bits
adaptive CB 5 bits parity check 1
bit
pulse positions 26 bits pulse signs 8
bits

44
G.729 annexes

A Compatible reduced complexity encoder with
minimal MOS reduction
B VAD and CNG
C Floating point implementation
D 6.4 Kbps version
similar to G.729 but 64 output bits per frame,
quality better than G.726 at 24Kbps
LSF coefficients 18b pitchadaptive CB 84b
gain CB 12b fixed CB 22b
E 11.8 Kbps coder for high quality and music

45
G.723.1