Title: Speech Compression
1SpeechCompression
2Quick Overview
Spch Comp
- Simple coders
- G.711 A-law m-law
- Delta
- ADPCM
- CELP coders
- LPC-10
- RELP/GSM
- CELP
- Other methods
- MBE
- MELP
- STC
- Waveform Interpolation
3Encoder Criteria
Spch Comp
- Encoders can be compared in many ways
- the most important are
- Bit rate (Kbps)
- Speech quality (MOS)
- Delay (algorithmic framelookahead
computational propagation) - Computational Complexity
- Often less important
- Bit exactness (interoperability)
- Transcoding robustness
- Behavior on non-speech (babble noise, tones,
music) - Bit error robustness
4PSTN Quality Coders
Spch Comp
- Rate ITU-T encoder
- 128 Kbps 16bit linear
sampling - 64 Kbps G.711 A-law/m-law 8bit
log sampling - 32 Kbps G.726 ADPCM
- 16 Kbps G.728 LDCELP
- 8 Kbps G.729 CS-ACELP
- 4 Kbps SG16Q21 ???
- toll quality MOS rating, but higher delay
5Digital Cellular Standards
Spch Comp
6Military / Satellite Standards
Spch Comp
7 8G.711
Spch Comp
- 16 bit linear sampling at 8 KHz means 128 Kbps
- Minimal toll quality linear sampling is 12 bit
(96 Kbps) - 8 bit linear sampling (256 levels) is noticeably
noisy - Due to
- prevalence of low amplitudes
- logarithmic response of ear
- we can use logarithmic sampling
- Different standards for different places
9G.711 - cont.
Spch Comp
North America m 255
- m-law
- A-law
- Although very different looking they are nearly
identical - G.711 approximates these expressions by 16
staircase straight-line segments - (8 negative and 8 positive)
- m-law horizontal segment through origin, A-law
vertical segment
Rest Of World A 87.56
10DPCM
Spch Comp
- Due to low-pass character of speech
- differences are usually smaller than signal
values - and hence require fewer bits to quantize
- Simplest Delta-PCM (DPCM) quantize first
difference signal D - Delta-PCM quantize difference between signal
and prediction - sn p ( sn-1 , sn-2 , , sn-N ) S pi sn-i
- If predict using linear combination (FIR filter),
this is linear prediction - Delta-modulation (DM) use only sign of
difference (1bit DPCM) - Sigma-delta (1bit) oversample, DM, trade-off
rate for bits
i
11DPCM with prediction
Spch Comp
- If the linear prediction works well, then the
prediction error - en sn - sn
- will be lower in energy and whiter than sn
itself ! - Only the error is needed for reconstruction,
- since the predictable portion can be predicted sn
sn en!
sn
prediction filter
12DPCM - post-filtering
Spch Comp
- Simplest case
- if highly oversampled
- then previous sample sn-1 predicts sn well,
- so we can use DM,
- if sgn(en) lt 0 then -D else D
- For DM there is no way to encode zero prediction
error - so decoded signal oscillates wildly
- Standard remedy is a post-filter that low-pass
filters this noise - But there is a b i g g e r problem!
13Open-loop Prediction
Spch Comp
- The encoder (linear predictor) is present in the
decoder - but there runs as feedback
- The decoders predictions are accurate with the
precise error en - but it gets the quantized error en and the models
diverge!
14Side Information
Spch Comp
- There are two ways to solve the problem ...
- The first way is to send the prediction
coefficients - from the encoder to the decoder
- and not to let the decoder derive them
- The coefficients sent are called side-information
- Using side-information means higher bit-rate
- (since both en and coefficients must be sent)
- The second way does not require increasing bit
rate
15Closed-loop Prediction
Spch Comp
- To ensure that the encoder and decoder stay
in-sync - we put the decoder into the encoder
- Thus the encoders predictions are identical to
the decoders - and no model difference accumulates
en
sn
en
sn
Q
IQ
IQ
PF
PF
16Two types of error
Spch Comp
- For DM there are two types of error (depending on
step size)
D too small
D OK
D too large
17Adaptive Step Size
Spch Comp
- Speech signals are very nonstationary
- We need to adapt the step size to match signal
behavior - Increase D when signal changes rapidly
- Decrease D when signal is relatively constant
- Simplest method (for DM only)
- If present bit is the same as previous multiply D
by K (K1.5) - If present bit is different, divide D by K
- Constrain D to a predefined range
- More general method
- Collect N samples in buffer (N 128 512)
- Compute standard deviation in buffer
- Set D to a fraction of standard deviation
- Send D to decoder as side-information or
- Use backward adaptation (closed-loop D
computation)
18ADPCM
Spch Comp
- G.726 has
- Adaptive predictor
- Adaptive quantizer and inverse quantizer
- Adaptation speed control
- Tone and transition detector
- Mechanism to prevent loss from tandeming
- Computational complexity relatively high (10
MIPS) - 24 and 16 Kbps modes defined, but not toll
quality - G.727 same rates but embedded for packetize
networks - ADPCM only used general low-pass characteristic
of speech - What is the next step?
19Scalar Quantization
Spch Comp
- Standard A/D has preset, evenly distributed
levels - G.711 has preset, non-evenly distributed levels
- With a criterion we can make an adaptive
quantizer - Simplest criterion minimum squared quantization
error - en sn - sn E lt en2 gt
- Need algorithm to find optimal placement of
levels EM-type algorithms -
20Vector Quantization
Spch Comp
- We can do the same thing in higher dimensions
- Here we wish to match input data xi i 1
.. N - to a codebook of codewords Cj j 1 .. M
- with Minimal Mean Squared Error
- E Si1..N xi - C 2
- where C is the codeword closest to xi in the
codebook
xi
21LBG Algorithm for VQ
Spch Comp
- Input xi i 1 .. N clustering, unsupervised
learning - Randomly initialize codebook Cj j 1 .. M
- Loop until converge
- Classification Step
- for i 1 .. N
- for j 1 .. M
- compute Dij2 xi - Cj 2
- classify xi to Cj with minimal Dij2
- Expectation Step
- for j 1 .. M correct center Cj S
i e Cj xi
1
Nj
22Speech Application of VQ
Spch Comp
- OK, I understand what to do with scalar
quantization - what is VQ good for ?
- We could try to simply VQ frames of speech
samples - but this doesnt work well !
- We can VQ spectra or sub-band components
- We often VQ parameter sets (e.g. LPC
coefficients) - We also VQ model error signals
23 24LPC-10
Spch Comp
- Based on 10th order LPC (obviously) Bishnu
Atal - 180 sample blocks are encoded into 54 bits
- Pitch U/V (found using AMDF) 7 bits
- Gain
5 bits - 10 reflection coefficients found by covariance
method - first two coefficients converted to log area
ratios - L1, L2, a3, a4 5 bits each
- a5, a6, a7, a8 4 bits each
- a9 3 bits a10 2 bits 41 bits
- 1 sync bit 1
bit - 54 bits 44.44 times per second results in 2400
bps - By using VQ could reduce bit rate to under 1
Kbps! - LPC-10 speech is intelligible, but synthetic
sounding - and much of the speaker identity is lost !
25The Residual
Spch Comp
- Recover sn by adding back the residual error
signal - sn sn en
- So if we send en as side-information we can
recover sn - en is smaller than sn so may require fewer bits
! - But en is whiter than sn so may require many
bits! - The question has now become
- How can we compress the residual?
26Encoding the Residual
Spch Comp
- RELP (6-9.6 Kbps)
- Low-pass filter and downsample residual to 1 KHz
- Encode using ADPCM
- VQ-RELP (4.8 Kbps)
- VQ coding of residual
- RELP (4.8 Kbps)
- Perform FFT on residual
- Baseband coding
- RPE-LTP (GSM-FR at 13 Kbps)
- Residual Pulse Excitation - Long Term Predictor
- Perform Long Term Prediction (pitch recovery)
- Subtract to obtain new residual
- Decimate by 3, use phase with maximum energy
- Extract 6-bit overall gain
- Encode remainder with 3 bits/sample
27Residual and Excitation
Spch Comp
- Synthesis filter sn
en S am sn-m - Analysis filter rn
sn - S am sn-m - So rn en !
excitation
residual
Note all-zero filter is the inverse of the
all-pole filter
28CELP
Spch Comp
- Atals idea
- Find a way to efficiently encode the excitation !
- Questions
- How can we find the excitation?
- Theoretically, by algebra (invert the filter!)
- How can we efficiently encode the residual?
- VQ - Code Excited Linear Prediction
- How can we efficiently find the best codeword?
- Exhaustive search
29CELP - cont.
Spch Comp
- Atal and friends (Schroeder, Remde, Singhal,
etc.) discoveries - Even random codebooks work well Gaussian,
uniform - Dont need large codebooks e.g. 1024 codewords
for 40 samples - Can center-clip with little loss
- Codebook with constant amplitude almost as good
- So we can use codebooks with structure (and save
storage/search/bits) - Multipulse (MP)
-
- Constant Amplitude Pulse
Regular Pulse (RP)
30Special Excitations
Spch Comp
- Shift technique reduces random CB operations from
O(N2) to O(N) - a b c d e f c d e f g h e f g h I j ...
- Using a small number of 1 amplitude pulses
leads to MIPS reduction - Since most values are zero, there are few
operations - Since amplitudes 1 no true multiplications
- In a CB containing CW and -CW we can save half
- Algebraic codebooks exploit algebraic structure
- Example choose pulses according to Hadamard
matrix - Using FHT reduces computation
- Conjugate structure codebooks
- Excitation is sum of codewords from two related
CBs
31Analysis by Synthesis
Spch Comp
- Finding the best codeword by exhaustive search
sn
Compute energy
-
LPC
find minimum
32Perceptual Weighting
Spch Comp
- The criterion for selecting the best codeword
should be perceptual - not simply the energy of the difference signal!
- We perceptually weight the signal and the
synthesized signal
sn
PW
-
Since PW is a filter we need use it only once
CB
LPC
33Perceptual Weighting - cont.
Spch Comp
- The most important PW effect is masking
- Coding error energy near formants is not heard
anyway - so we allow higher error near formants
- but demand lower perceivable error energy
- To do this we de-emphasize according to the LPC
spectrum! - Simplest filter is 1 - S ai z-I where ai are
the LPC coefficients - How do we take the critical bandwidth into
account? - We perform bandwidth expansion Denominator
expansion gt numerator 1 - S g1i ai z-I - 1 - S g2i ai z-I
BW - ln(g) Fs p
1 gt g1 gt g2 gt 0
Typical values g1 0.9 g2 0.6
34Post-filter
Spch Comp
- Not related to the subject, but if we are already
here - In order to increase the subjective quality of
many coders - post-filters are often used to emphasize the
formant structure - These have the same form as the perceptual
weighting filter - but 1 gt g2 gt g1 gt 0 with typical values g1 0.5
g2 0.75 - Denominator expansion lt numerator!
- the post-filter also reinforces tilt
- which should then be compensated by an IIR filter
- since the spectral valleys are de-emphasized
- we should change the PW filter parameters g1 and
g2 - Originally proposed for ADPCM !
35Subframes
Spch Comp
- Coders with large frames (gt 10 ms) need a long
excitation signal - and hence a lot of bits to encode
- An alternative is to divide the frame into (2-4)
subframes - each of which has its own codeword excitation
frame n-1
frame n1
frame n
We really should recompute LPC per subframe but
we can get away with interpolating !
36Lookahead
Spch Comp
- If we are already dividing up the frame
- we can compute the LPC based on a shifted frame
- This is called lookahead, and it adds processing
delay ! - To decrease delay we can use backward looking IIR
filter - and then we neednt send/store the LPC
coefficients at all!
------- LPC -------
------- LPC -------
CW
CW
CW
CW
CW
CW
CW
CW
37What happened to the pitch?
Spch Comp
- Unlike LPC, the ABS CELP coder is excited by
codebook - Where does the pitch come from?
- Random CB minimi zation will prefer good
excitation - Regular/Multi pulse pulse spacing (not enough
pulses for high pitch) - But this is usually not enough (residual has
pitch periodicity) - Two solutions
- Adaptive codebook (Klejn, etal)
- Long term prediction (Atal Singhal)
- Both of these reinforce the pitch component
38Adaptive CB
Spch Comp
- Adaptive codebook is repetitions of previous
excitations - Total excitation is weighted sum of stochastic CB
(random, MP, RP, etc) - and adaptive CB
Adaptive CB
Ga
LPC
Gs
Fixed CB
39Long Term Prediction
Spch Comp
- Using long-term (pitch predictor) and short-term
(LPC) prediction - Long term predictor may have only
- one delay, but then non-integer
- 1
- 1 - b z - d
sn
pitch predictor
gain
-
codebook
LPC
perceptual weighting
error computation
40Federal Standard CELP
Spch Comp
- FS 1016 at 4.8 Kbps has MOS 3.2
- Developed by ATT Bell Labs for DOD 144 bits /
30 ms frame - 10th order LPC on 30 ms Hamming window
- no pre-emphasis, additional 15 Hz BW expansion
(quality and LSP robustness) - Conversion to LSP and nonuniform scalar
quantization to 34 bits - 4 subframes (7.5 ms) LSP interpolation
- 512 entry fixed CB - static -1,0,1 from
center-clipped Gaussian - 5 bit nonuniform quantized gain 56 bits
- 256 entry adaptive CB - 8 bits 5 bit nonuniform
quantized gain 48 bits - optional noninteger delays, optional
- Perceptual weighting
- Postfilter spectral tilt compensation,
removable for noise or tandeming - FEC 4 bits SYNC 1 bit reserved 1 bit
41G.728
Spch Comp
- 16 Kbps with MOS similar to G.726 at 32 Kbps
- Low 5 sample (0.625 msec) delay
- High computational complexity (about 30 MIPS)
- CELP with Backward LPC
- LPC order 50 (why not? - we dont transmit
side-information!) - Frame of 2.5 ms (20 samples)
- 4 subframes of 0.625 ms (5 samples)
- Perceptual weighting
- Only 10 bit index to fixed CB is transmitted
- 10 bits per 0.625 ms is 16 Kbps !
42G.729
Spch Comp
- 8 Kbps toll-quality coder for DSVD and VoIP
- Computational complexity 20 MIPS, but G.729a is
about 10 MIPS - frame 10 ms (80 samples) lookahead 5 ms (1
subframe) - LPC, LSP, VQ, LSP interpolation
- CS-ACELP CB (Interleaved single pulse
permutation) 4 1 pulses / subframe - closed loop pitch prediction and adaptive CB
(delaygain) - 2 (40 sample) subframes per frame
- For each frame the encoder outputs 80 bits
- LSF coefficients 18 bits pitch
8 bits gain CB 14 bits - adaptive CB 5 bits parity check 1
bit - pulse positions 26 bits pulse signs 8
bits
43G.729 annexes
Spch Comp
- A Compatible reduced complexity encoder with
minimal MOS reduction - B VAD and CNG
- C Floating point implementation
- D 6.4 Kbps version
- similar to G.729 but 64 output bits per frame,
quality better than G.726 at 24Kbps - LSF coefficients 18b pitchadaptive CB 84b
gain CB 12b fixed CB 22b - E 11.8 Kbps coder for high quality and music
44G.723.1
Spch Comp
- 6.4 (MP-MLQ) and 5.4 (ACELP) Kbps rates
- About 18 MIPS on DSP
- frame 30 ms (240 samples) lookahead 15 ms.
- LPC on 30 ms (240 sample) frames, LSP and VQ
- open-loop pitch computation on half-frames (120
sample) - excitation on 4 subframes (60 samples) per frame
- perceptual weighting and harmonic noise weighting
- fifth-order closed loop pitch predictor
- MP-MLQ 5 or 6 1 pulses / subframe, positions
all even or all odd - ACELP 4 1 pulses / subframe, positions differ
by 8 - Annex A VAD-CNG Annex B floating point
implementation
45 46MBE coder
Spch Comp
- LPC10 makes hard U/V decision - no mixed voicing
- Multi Band Excitation uses a different excitation
- harmonics of pitch frequency
- frequency-dependent binary U/V decision
- large number of sub-bands (gt16)
- Simultaneous ABS estimation of pitch and spectral
envelope - Then U/V decision made based on spectral fit
- Use of dynamic programming for pitch tracking
47MBE coder - cont.
Spch Comp
- DVSI made various MBE, AMBE and IMBE for
satellite (INMARSAT) - Bit rates 2.4 - 9.6 Kbps (toll quality at 3.6
Kbps) - Integral FEC for bit-error robustness
- As an example
- 128 bits for each 20 ms frame
- pitch 8 bits
- U/V decisions K bits (K lt 12)
- spectral amplitudes (DCT) 75-K bits
- FEC (Golay codes) 45 bits
48MELP
Spch Comp
- DOD wanted a new 2.4 Kbps coder with MOS similar
to FS1016 - Main problems with LPC10
- voicing determination errors
- no handling of partially voiced speech
- Unlike MBE MELP uses standard LPC model
- MELP excitation is pulse train plus random noise
- Soft decision in small number (5) of sub-bands
- Frame 22.5 ms (180 samples)
- 10th order LPC, 15 Hz BW expansion, LSF,
interpolation, VQ - pitch refinement
- 5 sub-bands (0-500-1000-2000-3000-4000Hz) pitch
and noise excitation - FEC
49Sinusoidal Transform Coder
Spch Comp
- McAulay and Quatieri model
- instead of LPC use sum of sine waves
- sn Si 1 .. N Ai cos ( wi n fi )
- For each analysis frame (10 - 20 ms) need to
extract N Ai fi s - Voiced speech
- Use pitch and important harmonics from
pitch-synchronized STFT - Unvoiced speech
- Use peaks of STFT points where slope changes
from to - - At high bit-rates keep magnitudes, frequencies
and phases - At low bit-rates frequencies constrained and
phases modeled
50STC - cont.
Spch Comp
- Sparse spectrum is updated at regularly spaced
times - Amplitude linearly interpolated between updates
- Interpolated phase must obey 4 conditions (w f
w f)
overlapped windowing
sn
FFT
sum of sinusoids
sn
peak picker
spectrum encoder
spectrum decoder
e.g. all-pole model
51STC - cont.
Spch Comp
- Tracking the sinusoidal components
birth
frequency
death
time
52Waveform Interpolation
Spch Comp
- Voiced speech is a sequence of pitch-cycle
waveforms - The characteristic waveform usually changes
slowly with time
Useful to think of waveform in 2d
time
Phase in pitch period
This waveform can be the speech signal or the LPC
residual
53WI - cont.
Spch Comp
- Per frame LPC and pitch are extracted
- Represent CW by features (e.g. DFT coefficients)
- Alignment by circular shift until maximum
correlation - Separate treatment for voice and unvoiced
segments
LPC pitch tracking
sn
Characteristic waveform extraction
conversion to 1d
sn
2d CW alignment
waveform interpolation
quantization
decoding
54We can go even lower
Spch Comp
- The Shannon entropy of speech is 200-500 bps
- So even deeper compression should be possible
- There are even lower rate speech coders (300-600
bps) - but they are not practical at this point
- extremely high MIPS and memory
- very long latency
- very sensitive to background noise